11th October 2022
A statement prepared by the DataSHIELD research project - maintainers of ds.Base and core DataSHIELD software:
- Demetris Avraam (Statistics co-lead)
- Paul Burton (Statistics co-lead, former PI)
- Olly Butters (Infrastructure co-lead)
- Yannick Marcon (Infrastructure co-lead)
- Madeleine Murtagh (Data Governance and Ethics co-lead)
- Stuart Wheater (Infrastructure co-lead)
- Becca Wilson (DataSHIELD PI, Data Governance and Ethics co-lead)
During their use of DataSHIELD in the ORCHESTRA project, researchers at Helmholtz Munich and the University of Bonn have identified that it is possible to reverse engineer individual-level data in some federated learning systems, including DataSHIELD (Huth et al., 2022, a BioArXiv preprint article awaiting peer review). Huth et al., describe an algorithm for multi-step inferential disclosure: this is the first description of specific methods in relation to DataSHIELD in the literature. Recognising the existence of such algorithms, rather than just their theoretical possibility, emphasises the necessity of formal data access governance agreements as part of the DataSHIELD implementation framework. Data access agreements routinely prohibit data users from attempting to identify individual data or data subjects. The major consortia currently using DataSHIELD have such data access governance in place. The necessity of good data governance has been emphasised from the very beginning (Wolfson et al., 2010).
Since the first in depth discussions of inferential disclosure at the BioSHARE-eu Tool Roll-Out in 2015, DataSHIELD best practice has been to notify the community via the DataSHIELD Forum and/or the core infrastructure team when such methods have been identified – this is precisely the process followed by Huth et al., when they reported their work three weeks ahead of publication. DataSHIELD is an intrinsically collaborative endeavour and identification of issues and papers such as Huth et al., make an invaluable contribution to improving DataSHIELD.
We would like to reassure the community of DataSHIELD users, data providers and developers that while the Huth et al., paper demonstrates that sophisticated disclosure methods are a practical reality, this does not materially change data protection in studies implementing the DataSHIELD framework with strong data access governance in place. This statement serves as a reminder of the necessary non-technical and technical framework which must be in place when implementing DataSHIELD, and this can be used by data providers and users to evaluate their data governance and operational processes. Specifically, there are 8 primary disclosure thresholds and 11 disclosure mitigating policies and processes.
DataSHIELD’s 8 primary disclosure ‘thresholds’ are aimed at preventing simple (single-step) approaches to inferential disclosure, including: rules on minimum cell counts in a contingency table (typically 1-5 by default); rules on minimum subset-sizes and rules relating the total number of terms in a regression model to the total number of useable observations in a study. The disclosure method described by Huth et al., is a multi-step procedure (i.e. it requires more than one function to be called sequentially) - as have been all the specific issues we discussed at the BioSHARE-eu Tool Roll-Out meeting and subsequent algorithms that have been reported to us. Importantly, Huth et al., have taken the discussion forward by publishing their method in BioArXiv. We are not aware of any other group that has published on multi-step methods of disclosure vulnerability in the remote analysis of federated data.
DataSHIELD is not a federated analysis technology that can be implemented in isolation. There are eleven specific disclosure protections and mitigations in place which either represent formal regulatory and professional conditions for data use, methods embedded in the DataSHIELD software, or are strong recommendations based on best practice relating to implementation. These are as follows:
-
DataSHIELD can only be used in compliance with formal data governance agreements and conditions, managed via user-control settings (including assignment of usernames and passwords for DataSHIELD) and dataset approvals are given based on those conditions. DataSHIELD users working with formally managed data sets – i.e. the vast majority of analysis undertaken on big-Epidemiology and/or health data - are constrained by the conditions of data access awards and by the contractual and cultural expectations to uphold professional standards. Deliberate attempts to identify individuals or to infer the value of variables in the data risks professional censure and sanctions, including dismissal from employment (responsible data use now forms part of academic and health service employment contracts) and denial of future access to managed datasets.
-
DataSHIELD analyses are enacted on a database snapshot, not on the primary/live database for a study.
-
DataSHIELD must be built on robust hardware and all transmissions must be encrypted.
-
The analysis environment server side (R) can only be called via Opal or Molgenis.
-
Only valid characters/functions (via the R Parser) can be passed from client side to the server side.
-
Server side functions block directly disclosive output e.g. print; glm residuals & fitted values.
-
Disclosure traps are built into each function that can call any of the 8 primary disclosure ‘thresholds’ (e.g. the contingency table function calls the minimum allowable cell size for a contingency table).
-
Only the designated data controller/custodian can change the values of the 8 primary disclosure thresholds for analysis on their DataSHIELD server.
-
All user commands on the DataSHIELD server are logged so they can later be interrogated if a disclosure event is suspected.
-
DataSHIELD users and deployers are responsible for updating and maintaining their software. As always, new DataSHIELD infrastructure releases and updates will continue to be announced in the Releases section of the DataSHIELD forum.
-
As an open-source community, DataSHIELD package developers are responsible for updating and maintaining their packages. Package developers are encouraged to send a request to datashield @ liverpool.ac.uk to join the monthly DataSHIELD full stack technical meeting where a variety of infrastructure and development issues are discussed.
There is a crucial distinction between one-step and multi-step methodology for disclosure. Disclosure events that occur with a single analysis step would be impossible to detect in the DataSHIELD log files or in the contents of the evolving server side databases i.e. it would not be possible to distinguish valid analysis from malign activity. Any function that allows one step disclosure must be removed from DataSHIELD altogether, or modified with appropriate disclosure thresholds. For example, when we initially implemented ds.factor it rapidly became clear that if it was applied to a continuous variable it would print all values of that variable to the screen, so we modified and updated the function to prevent this. Any other functions allowing single-step disclosure, would similarly be deleted or methods updated as soon as they were reported. However, because a multi-step disclosure algorithm has to use two or more functions in a particular order, and generally with a particular structure to the output, there are typically a range of mitigating algorithms that could be run on the log files and server side to identify and alert disclosure risks. There is additionally an opportunity to modify and update the analysis methods exploited by a disclosure algorithm. So far, although it will take a significant piece of ongoing work, every example of a multi-step disclosure algorithm that we have seen (including that described by Huth et al.,), we believe can be identified and mitigated by the expansion and analysis of both current and additional log files.
Going forwards
We will work with the Helmholtz Munich and University of Bonn team to update functions and create mitigating systems in ds.Base. We will describe this work in upcoming posts and encourage other community DataSHIELD package developers to do the same to facilitate updating these functions. As always, new DataSHIELD infrastructure releases and updates will be announced in the Releases section of the DataSHIELD forum.
In subsequent posts, we will include a link to the BioArXiv publication once it is available. We hope that other groups interested in disclosure methodologies, will engage in the project and collaborate with us to continue to expand DataSHIELD disclosure controls, as well as participate in collaborative funding proposals. To facilitate this, there is a planned discussion session on disclosure risk, control and mitigations at the DataSHIELD hybrid conference 19-21st October, registration is still open for those wanting to participate (full conference details including a draft agenda are also now available). Like previous discussions, ongoing community exploration on this topic can be participated in via the DataSHIELD forum and further statements may be released by DataSHIELD community stakeholders.