Hi all,
I’m currently authoring a comparison of different tools that do (semi) automated disclosure control.
As part of that I wondered how much work it would be to make a version of DataShield for general purpose adoption that met standard disclosure control norms. For example, but not limited to,
no zeros in tables to avoid class disclosure (DS currently blocks cells with counts 1…threshold but allows 0s through)
configurable upper/lower bounds (typically 10%) on estimates that can be made of a single contributor’s values (rather than, for example, hard-coded +/- 5% when reporting max/min values via DS.range).
I would probably argue that in a semi-automated system, flagging zeros rather than blocking them is more reasonable (assuming there is a human-check that will happen afterwards).
This is because zeros could be logical, and would not therefore be an issue of class disclosure. The example sometimes used in training is causes of injuries at work, with heavy machinery being logically zero for bankers or doctors.
If there isn’t human checking, then would probably agree that prohibiting zeros is the safer option. Similarly we should also be careful if a cell contains all or nearly all of the counts.
Hi Simon,
I agree with your points, and the point about some zeros being ‘structural’
However, at present Datashield does not block any zeros, so does not seem to protect against class disclosure. I assume this is because in the original use-case it was decided that class disclosure was not an issue.
However, for a general TRE audience, it would be nice to be able to say that Datashield did offer one potential solution, albeit possibly with more development work needed, rather than ruling it out.
The points you suggested can be easily implemented in DataSHIELD and I agree that we need to do these modifications to meet the standard SDC norms. If you have any additional suggestions please let me know.
We were also thinking to write a paper on describing the disclosure controls in DataSHIELD compared to the SACRO guidelines but i don’t know if that will overlap with your work or how much information about DS are you going to provide.
It is an interesting question as to the minimal disclosure controls which are needed to provided a both interactive access and correctly governed of the sensitive data.
Thinking more about the zero cells in tables, it can be easily blocked in a single-site setting. However in a multi-site setting this is more complicated. For example let’s say we have a variable showing the birth plurality and we have one study with two levels in this variable: singletons and twins with enough observations in each level. Then the table can be returned with no any issues. However, let’s say that we have a second study with three levels in this variable: singletons, twins and triplets again with enough observations in each level. In this case if we want to return the combined table across both studies we should block all observations from study one.
hi Demetris,
that’s a great example - and a really good example of why there is an urgent need for richer metadata to support semi or fully automated output checking.
If only there was a standard ontology that defined that birth plurality was an ordinal, (hence transitive) but not time-dependant (assuming a window of n minutes/hours for a birth event) variable, then it would have been apparent at the time the first study’s results were provided that they implicitly had some zeros in the columns for triplets, quads etc.
Sadly we have a mix of ontologies (OMOP for medical, various other) and I;m not sure any are rich enough.
I’d forgotten until Becca reminded me that you had started do to an ‘interactive DS’.
As it stood I was thinking just about fully automatic checking, and the desirability of an option for TREs to have a default setting that doesn’t allow zeros (as per Simon’s reply) or max or min values
Hi Jim - welcome to the DataSHIELD community! (I knew you would see the light )
On the class disclosure bit - we’ve tended to find that in our real world deployments that most data controllers are comfortable with this, especially when balanced with the wider five safes framework risk mitigations. That said, I do agree that it would be an enhancement to add the functionality in as others have mentioned above. With this being an open source community, you could get involved in developing it - reach out to the stats development group.
The max/min have noise added to them here, but I take you point about it being more configurable. What we tend to have is a series of settings which the data controllers can set for their data (things like min cell counts etc), this list of settings does slowly expand as we generalise some hard coded settings.
It is worth noting that a DataSHIELD deployment doesn’t have to have all of the DataSHIELD functions - they can easily be removed as the data controller sees fit.