DataSHIELD analysis with relational database


I am dealing with a specific use case for analysis with DataSHIELD where we work with data from a relational database. Namely, the following example depicts the situation: there are (at least) two tables in the database, Observation and Patient, where each observation has patient_id as a foreign key. The observations record some biomarkers’ values from the patient on a specific date. The same biomarker can be observed multiple times for the same patient.

Patient table (snippet):

id gender country
1 female UK
2 male UK

Observation table (snippet):

id patient_id observation_date biomarker value
1 1 2019-02-27 Fasting Blood Glucose
2 1 2020-02-27 Fasting Blood Glucose
3 2 2018-02-27 Total Cholesterol
4 2 2020-02-27 Total Cholesterol
5 2 2020-02-17 Fasting Blood Glucose

To my understanding, the DataSHIELD analyst does not have an access to the id-s of any of these tables, which makes it impossible to join them on the analyst side. On the other hand, the problem also occurs when trying to flatten this data into a single data object, to make it ready for analysis in DataSHIELD; having a flat table (patient per row) seems infeasible for this case.

What would be an appropriate solution for this case?

Thank you in advance,


My understanding of the Opal server, is that data load are made of a single table. So the relational structure would have been flattened into one table. Perhaps resources can help.


Hi Tanja,

It is possible to join different tables in DataSHIELD if they share one (or more than one) column(s) using the ds.merge function.

In your case you need to first reshape one of the tables in order to get one row per patient using the ds.reShape function and then apply the ds.merge function.

Please have a look and let me know if you need any further help.

Many thanks, Demetris


thank you for pointing out ds.merge and ds.reShape functions.

However in my case this would create very wide and sparse table and this data representation is not very convenient for the analysis.

Alternatively, is there a functionality ni DataSHIELD that aggregates values (something like groupby)? I imagine this way the table would be constituted of observation per row, and columns: patient_id, observation_date, biomarker, value, gender (after merging)… Then, this table could be transformed to patient-per-row by grouping on patient_id.

Thank you!

Hi Tanja,

Yes, the ds.tapply.assign is a “groupby” type of function. Please have a look and check if it is useful for your case :slight_smile:

1 Like


this is a question regarding a step before ds.tapply.assign function - ds.asFactor. When the variable has too many values relative to number of rows in the study and the following error occurs:

"Failed to obtain levels: FAILED: this variable has too many levels and may be disclosive. The ds.asFactor function allows no more than x levels in this particular study. This variable has y"

is it somehow possible to rule out from the study the rows with low occurrences of values in a variable that we want to factorise? With forces.factors.levels it is assumed we know in advance which values are the ones with low occurrences that we want to exclude (and also a general knowledge of all values contained in the variable). Is there a way around this?


Hi Tanja,

Depends. If the number of rows that you want to remove (e.g. by subsetting) is less than 3 (this is the default value for the disclosure control of subsets) you will still get the same error message, otherwise you can create a subset of the dataframe without those rows and then convert the specific variable to a factor.

An alternative approach is to recode those specific values (e.g. using the ds.recodeValues function) assumming that they belong to another category.

I am not convinced these operators in DataSHIELD could achieve what Tanja would like to do. While they appear to have some similar elements based on set theories, their applications in a different context; one is applied on dataframe for some statistics analysis purpose. The other is on the data storage.

Tanja, is working at the database level, that is before the data are uploaded in a R server. Unless, there are two different data frame for each table, then it becomes an issues. I feel it should be answered by Yannick Marcon. Otherwise, Tanja will not be able to achieve what she is looking for to do.

In data science, we tend to join such data together, before we complete the analysis. If I was going to work at the database level. I would create a view or a query in the database management system , and use it to apply my analysis. It is often much quicker and more efficient. Data Scientists relies on flattening content of relational databases all the. time, it stems from Big Data. Otherwise, we cannot achieve our analysis.

Best wishes,



After discussing with My colleagues Stuart and Hugh, we thought we needed to mention Resources in the Opal server. The resources can link to your relational database management system (RDBMS) where the data are stored. On the RDBMS, the join can be achieved. This type of server-side software are just specialised for this purpose. The result can be then shared through the Opal server, using the resources.

It may be worth, if you can, ask the person managing the data to write the join for you. Then, the Opal server can have some resources added. It is worth mentioning, that I have asked Yannick Marcon to contribute in this post too.

I hope we are not too confusing.