Planned DataSHIELD v6.2 release

Hello,

we are starting to put together the v6.2 release of DataSHIELD, the current release notes are below. We are keen to hear any comments and feed-back.

Stuart


Draft DataSHIELD Release Notes v6.2

Focus of Release

The changes in the v6.2 release of DataSHIELD are mainly focuses on the enhancing of disclosure controls available to data owners, also additional analytical and presentation methods for data analysis.

Changes from DataSHIELD v6.1.1 to v6.2

Checking Permissive PrivacyControlLevel

To support data owners who have particularly sensitive data, additional disclosure protection has been added to v6.2 release. These changes permit a data owner to place a service into “Permissive” (default) or “non-Permissive” disclosure mode. This is done by setting the “datashield.privacyControlLevel” option. The service will be in “permissive” mode if the “datashield.privacyControlLevel” option has the value “permissive”, any other value will cause the service to be in “non-permissive” mode.

If a service is in “non-permissive” mode will cause certain methods to be blocked from being invoked by the client. The list of blocked methods are:

dataFrameSubsetDS1 rbindDS
levelsDS recodeLevelsDS
cDS recodeValuesDS
cbindDS repDS
dataFrameDS reShapeDS
dataFrameSortDS seqDS
dataFrameSubsetDS2 subsetByClassDS
dmtC2SDS subsetDS

In addition, the method aliases for ‘base::c’, ‘base::cbind’ and ‘base::rep’ have been removed.

Not having access to these methods will mean that the Data Owner will be required to perform more data shaping for the Data User(s).

Changing disclosure settings

In this release, there are new disclosure settings data owners can specify. The new “default.nfilter.levels.density” and “default.nfilter.levels.max” has been added, with default level equal to 0.33 and 40 respectively. These options are described on the page wiki page - https://data2knowledge.atlassian.net/wiki/x/DoCaKg

New Functions

The following functions have been added to the version 6.2 of DataSHIELD dsBaseClient package.

ds.hetcor: computes a heterogenous correlation matrix, consisting of Pearson product-moment correlations between numeric variables, polyserial correlations between numeric and ordinal variables, and polychoric correlations between ordinal variables.

ds.lspline: computes the basis of piecewise-linear spline such that, depending on the argument “marginal”, the coefficients can be interpreted as (1) slopes of consecutive spline segments, or (2) slope change at consecutive knots. This is an assign function which saves the created object on the serverside.

ds.qlspline: this is similar to ds.lspline but it calculates the knot positions to be at quantiles of the input variable.

ds.elspline: this is similar to ds.lspline but it calculates the knot positions such that they cut the range of the input variable into n equal-width intervals.

ds.ns: generates a basis matrix for representing the family of piecewise-cubic splines with a specified sequence of interior knots, and natural boundary conditions. This is an assign function which saves the created object on the serverside.

ds.dmtC2S: supports the need to be able to transfer complex variables for the client-site to the server-side(s). This is an assign type method. The types of variables which can be transferred are data.frame, matrix or tibble.

ds.asFactorSimple: converts an input variable into a factor. Unlike ds.asFactor and its serverside functions, ds.asFactorSimple does no more than coerce the class of a variable to factor in each study. It does not check for or enforce consistency of factor levels across sources or allow you to force an arbitrary set of levels unless those levels actually exist in the sources. In addition, it does not allow you to create an array of binary dummy variables that is equivalent to a factor. If you need to do any of these things, you will have to use the ds.asFactor function.

ds.metadata: obtains the non-disclosive metadata associated with a variable held on the server.

ds.ranksSecure: securely generate the ranks of a numeric vector and estimate true global quantiles across all data sources simultaneously (see https://data2knowledge.atlassian.net/wiki/x/AYDPog for retails)

ds.unique: generate a variable on the server-side which represents a version of an existing variable but without any duplicate values.

ds.forestplot: draws a forestplot of the coefficients for Study-Level Meta-Analysis (*)

(*) Provided by Xavier Escribà Montagut, Barcelona Institute of Global Health (ISGlobal), Spain

Changed Functions

ds.replaceNA: This new version of ds.replaceNA can replace NAs in factor variables. The replaced values are then considered as additional levels of the factor.

ds.tapply.assign: Major refactoring which ensures that variables are present in all servers. fixed an issue to deal correctly with variables that include missing values and not only complete cases.

ds.tapply: Major refactoring which ensures that variables are present in all servers, fixed a issue to deal correctly with variables that include missing values and not only complete cases.

ds.mean: the behavior if all values are NAs has been changed; if ds.mean is call on a vector, on a server, which only contains NAs, the result from the server will be NA, instead of causing a disclosure block.

ds.var: the behavior if all values are NAs has been changed; if ds.var is call on a vector, on a server, which only contains NAs, the result from the server will be NA, instead of causing a disclosure block.

ds.table: The new version allows the user to specify only two options for the argument useNA either “no” or “always”. The option “ifany” which was available in v6.1.1, is not allowed any more.

ds.corTest: The new version allows the user to get Kendall’s tau or Spearman’s rho correlation coefficient for a pair of variables, in addition to the existing Pearson’s correlation. The new arguments added are: the method which can be one of “pearson” (default), “kendall”, or “spearman”, the exact which is a logical indicating whether an exact p-value should be computed for Kendall’s tau or Spearman’s rho, the conf.level which defines the level of the returned confidence interval, and the type which defines if a study-specific correlation coefficient is returned or a combined correlation across all studies (the combined correlation is an approximation of the exact pooled correlation and is estimated based on Fisher’s z transformation).

ds.glmSLMA, ds.lmerSLMA and ds.glmerSLMA: the changes to these functions are as follows:

  • we made sure that the grouping factor (i.e. the variable after the “|”) in the mixed model is not included in a set of checks that are normally used for standard GLMs. This is not appropriate as it blocked users from running models when there were small number of individuals in the groups (e.g. siblings in family groups). Having a small number of individuals in a group is not a disclosure issue for mixed models and hence it should be permitted.
  • we improved the handling of errors when something went wrong in the underlying lme4 functions that are used. Previously this meant that the error message returned to the user was not the one from the underlying function, making it hard to debug what has gone wrong.
  • we have added, to ds.glmSLMA, a notify.of.progress argument which can enable or disable logging to progress.

ds.histogram: function allows the user to plot distinct histograms (one for each study) or a combined histogram that merges the single plots.

ds.Boole: an issue was fixed which means that under certain circumstances incorrect results can be produced. This incorrect behaviour can occur if the right-hand operand is negative.

ds.asNumeric: has been changed to deal with different types of variables (including characters)

Client-side Testing Infrastructure

Additional tests, and general test improvements are included in this release.

Addition of testing within client methods of existence of variable and class being used.

Server-side Testing Infrastructure

Additional tests, general test improvements, added privacy control level tests and improved error messages are included in this release.

Backward compatibility with v6.1.1 dsBaseClient

There are no known significant issues with using v6.1.1 dsBaseClient with v6.2 dsBase. The changed in behaviour which have been observed are limited to changes to the text of error messages, changes to the circumstances under which a disclosure block could occur and bug fixes.

Supported Versions

DataSHIELD v6.2 is supported on R3.5, R3.6, R 4.0 and R4.1, and would be expected to work with intermediate versions. At present the DataSHIELD client-side package is known to work on Ubuntu 18.04, Ubuntu 20.04, Windows 10 and macOS Big Sur (11.6). DataSHIELD server-side package is known to work when deployed to Opal 4.3.3 running on Ubuntu 18.04 and 20.04.

Code Availability

(Planned) As ever, you can obtain the code at a variety of places:

New ds.histogram changelog description:

ds.histogram: change to how function automatically checks for disclosure, now compares the number of breaks with the disclosure parameter “nfilter.levels.density”, instead of comparing with “nfilter.levels” as previously.

@yannick I was wondering is there a link on the Obiba CRAN to historical package documentation. For example if I want to cite in a paper a particular version of dsBase(Client) v6.0 or 6.1 - so readers/reviewers would know not just the version I used but which functions were available to me for the analysis. I know the CRAN lists the current release functions. I thought previously you could select which release version to view but now i can’t find it.

There is a more appealing online documentation:

but there is no documentation history.

If the datashield packages were in the official CRAN, you would have this history from the R documentation web site. See for instance that you can pickup a version:

I’m after the information that is on this page - beautified would be even better! but to be able to view a list of functions for a given version number. Maybe we stopped doing this - and only the surrent release is shown now? People writing papers could link to the release version on github but the function list isn’t in an as easily readable way like it is for the current list on the obiba cran.

Hello,

I’m using the training VMs for version 6.1 (2020-10-31) and previously I have had no problems downloading the 6.2-dev version to try out some approved functions. As I’m very interested in testing the latest version I removed the dsBase package from the VM:s and tried to download the 6.2-dev version but nothing happened! And suddenly I’m not even capable of downloading the current 6.1 version from the master branch. Tried to fix it through R with opalr and ‘dsadmin.install_package’ but get an error message saying “Failed to install ‘dsBase’ from GitHub: (converted from warning) installation of package ‘nloptr’ had non-zero exit status”. Updated R to version 4.1.3 but the same message. Ran out of ideas what to do. Is it a problem that the Opal version is 3.0.3 in the current VM:s? When will a training VM be available with newer Opal and DataSHIELD v 6.2?

Kind Regards,

Bodil

Bodil,

My own general practice is to update the VM from the command line with “apt update -y; apt upgrade -y”, I haven’t recently had any problems with upgrading “Opal” or “Rock” this way.

I would check the R packages required by “dsBase” (listed in the DESCRIPTION file) are installed. Certain version of “nloptr” has caused me problems in the past because it requires installation of library packages using “apt”. To find out what these packages are on your machine I would recommend trying to install the “nloptr” from the R command, and watching for errors/warnings in the output.

Just ro be clear, the development package which will form the v6.2 release isn’t in the CRAN (Obiba’s or DataSHIELD’s) at the moment, only GitHub. If you wish to try out the code base which is close to that which will be in v6.2, use “+ Add Package” on “Administration” / “DataSHIELD” page, select “GitHub” with “User or organization name” of “datashield”, “Name” of “dsBase” and “Git reference” of “v6.2-dev”.

The v6.2 training VM which are currently being created by the VM creation system is 40GB, which need addressed.

Stuart

Hi,

There is one of the dependencies of dsBase that recently changed: the system library libnlopt-dev must be installed. The VM should be updated accordingly.

sudo apt-get install libnlopt-dev

and then try to reinstall dsBase.

Regards
Yannick

Thank you Stuart and Yannick for your rapid responses. R code I understand but I guess I’ll now reveal how little I understand about systems. So, in order to use the suggested commands for updating the VMs, I need to, in the screen that is shown when I start the VM first login (I’m using the VM:s on Windows). I tried the same user and password as for the Opal web interface but failed. I realised that I probably don’t have the proper keyboard setting as I’m scandinavian but figured out what keys represented the non alphabetic letters in the password - still failure. Then I created a new user in the Opal web interface with password not including any non-alphabetic letters and gave permission to administrate in the General Setting. Still no luck in login in the command window. Any idea what to try next?

Kind Regards,

Bodil

Hi,

Opal is just an application in the system. The Opal’s user/password is not relevant to login in the system and to perform this apt upgrade task. I don’t know the DS’s VM, but @swheater will certainly be able to guide you.

Best regards
Yannick

Bodil,

Currently the username is “root” and the password is “puppet” (this isn’t a secret, on wiki. Also one reason it is not best idea to use the tutorial VMs as the basis of a production VM).

You can use the command below to configure the right keyboard.

dpkg-reconfigure keyboard-configuration

Stuart

Thank you!

I was able to login, the apt update/upgrade task did a lot of things but unfortunately I think there were quite a lot of messages about failing too. Despite this, I continued with the installation of libnlopt folllowing the instruction Yannick provided. Then I added dsBase from v6.2-dev using the Opal web interface and it worked! Although the VMs still have Opal 3.0.3 the dsBase is version 6.2.0-6. I’m quite happy with that! I can now test some of the new or approved functions and I’ll just wait for the new VMs with the latest version of Opal and the official v6.2 version of dsBase once it is released.

Again, thanks a lot for the rapid support!

All the best,

Bodil