This is a thread where I would like to ask some best practices.
In DataSHIELD projects I am involved in, there is an interest to have version snapshots available on each node. The demand is, if we release any results as part of a study, we should be able to rewind data tables on the server back to the point in time that this result was obtained.
At this time, two approaches are apparent to us:
- Setting a data cut-off date, feasible as long as we have a column indicating data collection or upload time. A repeat of the work is possible so long as no data is removed and the data cut-off date is saved. (i.e. We can use the same January 1st 2023 cut-off in 2033). There might be some workarounds needed whenever new columns are introduced to the table.
- Creating a new table with a timestamp each time there is a major data revision, f.x. Project/Table_202401 for January, Project/Table_202403 for March etc. This costs further storage space and also warrants constant communication with analysts as to which table version they should target. However, unlike the 1st approach this accomodates for mistakes that may have been made in the past; so if for example an individual was mis-added in an earlier snapshot these versioned tables will always retain that information.
What exactly has been the lifecycle pattern of your dataSHIELD nodes?