Custom Data Disclosure Checks

aakkoc · 6 March 2025 08:55

The DataSHIELD disclosure checks are fairly well-defined. The other day a colleague suggested creating a custom disclosure check:

Suppose we have a column cohort to indicate the cohort of a patient. If an analyst is provisioned to access the cohort = diabetes set of patient data, they should not have access to rows where cohort = cancer.

Analysts can filter as needed, but is there any way to have invalid results raise an error on the serverside?

becca.wilson · 9 March 2025 10:02

Could this be resolved as part of data governance? Once an analyst has approval for data access the data is extracted and provisioned for access via DataSHIELD. I presume the data controller would only provision variables that the analyst is approved to access.

An alternative governance model is groupings of analysts all have access to the same dataset and variables but then data access T &Cs / contracts should reflect this.

I should add it is possible to develop additional disclosure checks either within a package or if a type of check could have wider application it could be generalised and abstracted out of packages so others can use them like the DataSHIELD disclosure checks.

aakkoc · 10 March 2025 10:20

Data Governance and Groupings, can help and I think it’s the way we handle things now. But I also think a bespoke package would be the best solution.

This question arose while discussing the new dsOMOP package which may encourage certain practices. dsOMOP enables interfacing OMOP-CDM data through DataSHIELD, but it’s not uncommon for these databases to feature several cohorts together.

Again, a temporary solution is likely to do it the good-old way by ensuring what arrives on Opal/Armadillo are separate cohorts whose access permissions are managed separately. But a cohort/filter disclosure check would be most certainly welcome.

I am curious if anyone is working on something like this @davidsarratgonzalez

davidsarratgonzalez · 21 March 2025 14:44

Hi Ahmet,

Thank you for mentioning me and for bringing up this interesting (and very necessary!) topic.

Currently, the dsOMOP package only includes one custom disclosure check: it interprets DataSHIELD’s nfilter.subset rule to count unique patients instead of raw rows when querying an OMOP database. This was necessary because in the OMOP CDM, each event (like a medical condition occurrence or a drug exposure) is stored as a separate row. Without counting unique patients, an analyst could run a query that returns many rows but only a few distinct individuals, potentially risking disclosure of those individuals’ data. By switching the subset rule to count unique patients, we ensure that any query returning too few individuals will be blocked, regardless of how many rows those individuals have.

Apart from that measure, we have not implemented any disclosure check in dsOMOP to restrict data access by cohort on a per-user basis. It is true that an analyst can choose to focus on a specific cohort in their analysis—for example, filtering the data of a specific cohort using dsBase functions. However, this approach relies on the analyst’s voluntary action to apply the filter and doesn’t technically prevent them from accessing other cohorts if those are available in the underlying data. In essence, if the entire database is accessible, a determined user could still retrieve information about any cohort unless there is an additional layer of access control outside of the analysis itself.

As you mentioned, in many DataSHIELD deployments, separating data by cohort is achieved through data governance practices rather than through dynamic filtering. Administrators often prepare multiple views or tables of the dataset—each view containing only one cohort’s data—and then assign permissions so that each user or group can only access the view corresponding to their authorized cohort. For example, an administrator might create a table or a view for a “diabetes” cohort and another for a “cancer” cohort, and grant an analyst permission only to the diabetes one. This way, even though the analyst is using DataSHIELD normally, they can only see the data for that one cohort because that’s the only table they have access to. This sandboxs each cohort’s data at the source, ensuring that analysts cannot accidentally or intentionally query outside their permitted subset.

However, things become more complex when the data is not fully contained within an Opal or Armadillo server but is instead accessed as an external database resource. In these external setups, DataSHIELD does not automatically enforce row-level filters because it relies on package implementations to query the external database. As a result, the responsibility falls on database administrators to impose those restrictions. One common strategy is to create separate database views for each cohort in the external database, similar to the approach with multiple tables at the server level. Another strategy is to use the database management system’s row-level security features to try to restrict what data each user can see. Unfortunately, both approaches run into a fundamental limitation: the database needs to know which user is making the query in order to apply the right filter, and by default it does not get that information from DataSHIELD.

Under the hood, DataSHIELD typically connects to an external database using a single set of credentials (a shared account for the DataSHIELD server, which is configured in the resource). This means that from the database’s perspective, every query coming from DataSHIELD looks the same and is coming from the same account, regardless of which analyst actually initiated it. Because of this, if you try to implement row-level security in the database, it cannot distinguish one analyst from another—the database sees all queries as coming from that one shared account unless you create separate resources with different credentials. Similarly, if you rely on predefined views per cohort, you also need to create separate DataSHIELD resources for each view and manage permissions accordingly. That can become complicated if an analyst is allowed to see multiple cohorts or combinations of cohorts, since it might require setting up a different resource for each scenario. Unless the DataSHIELD infrastructure is modified to supply user-specific credentials or session parameters to the database, the database itself has no way to enforce user-specific filters under the current setup.

Another workaround that we considered was using OMOP CDM’s COHORT table as a possible way to handle this, controlling it through dsOMOP’s logic. The COHORT table in OMOP is intended to list patients belonging to various cohorts (identified by cohort_definition_id), and one could imagine using it to filter queries. For example, dsOMOP could automatically restrict any query to only include patients with certain cohort IDs that the user is allowed to see (maybe we could control such permissions through custom disclosure control variables?). This would leverage the OMOP data model itself to enforce the rules and would be quite elegant, because the filter would be applied within every query transparently. However, in the current DataSHIELD architecture, we do not seem to have a way to inform the server-side package functions about the user’s identity or session. The DataSHIELD server-side R session has no concept of who the client is—I assume that it is deliberately sandboxed and blind to user identity for security. As a consequence, dsOMOP cannot know which cohort_definition_id values correspond to the logged-in analyst’s permissions. Without that knowledge, it cannot automatically filter out disallowed cohorts using the COHORT table.

We also considered incorporating the cohort restriction into the DataSHIELD resource configuration. That would mean when an administrator defines the data resource in DataSHIELD (in Opal or Armadillo), they specify a fixed cohort filter parameter that dsOMOP will always apply for that resource. For instance, an admin might define a resource for a particular cohort or combination of cohorts, which internally always applies cohort_definition_id filters to every query on the OMOP tables. While this could work, it essentially requires setting up a separate resource for every single cohort or every combination of cohorts that different users might need access to, which is practically the same complexity as maintaining multiple database views or users.

Looking toward the future, a more convenient solution would be for DataSHIELD to allow some way to pass user-level information into the server-side environment or directly into the database queries, if that does not interfere with its privacy-preserving principles. If DataSHIELD enabled each user to connect to resources with distinct credentials or provided a user/session identifier for the server-side packages, then dsOMOP (or similar resource-based packages) could use that information to enforce restrictions programmatically. This would remove the need to manually set up numerous separate data slices (either resources, tables or views) and would ensure that no matter what an analyst tries to do, they simply cannot breach the cohort boundaries because the server will not return disallowed data. Such a feature would be very beneficial, especially as more organizations adopt DataSHIELD in conjunction with large centralized databases—but I assume the discussion here should center on how secure and desirable this would be.

Until such improvements are available, our best option is to continue using the data governance approach—essentially creating distinct data objects or views for each cohort and using DataSHIELD’s existing permission system to control access. In practice, this means maintaining separate datasets (or database views) for each cohort and ensuring each analyst’s account can only connect to the dataset that contains their allowed subset of data. This may be a bit labor-intensive to set up, but at present it is the most robust way to prevent any cross-cohort data leakage when using DataSHIELD, especially in an external database scenario.

I am very interested to hear if others in the community have tackled this issue in a different way. A unified strategy for passing user-level access rules into DataSHIELD analyses would greatly benefit our work and the work of many other teams facing similar needs.

Best regards,

David

Topic		Replies	Views
Statement: DataSHIELD disclosure controls and mitigation Old news statement , announcement , news	4	1167	8 November 2022
IPD Data disclosed on OBIBA demo server? Beginner Support	1	200	27 January 2023
What DataSHIELD functionality are you working on? New functionality under-dev	15	1494	29 May 2022
DataSHIELD disclosure settings - migration of pages to new wiki Statistical development disclosure , datashield-wiki , wiki	3	137	12 March 2024
Cox Regression in DataSHIELD Statistical help	20	1953	29 May 2022

Custom Data Disclosure Checks

Related topics