Datashield R API

#1

Hi devs,

I am starting the development of a new datashield API so that the framework can be opened to other data repositories in addition to the opal one (EuCan-Connect project). There will be several steps in this process:

  1. Moving opal R package source code out of the datashield’s github to the obiba’s one. Preview of this new package is already available at obiba/opalr. This will also be the opportunity for updating the package dependencies (RCurl is not maintained anymore, to be replace by httr).
  2. Defining a datashield R API, using S4 classes in a new R package called dsAPI (expected to be located in datashield’s space at datashield/dsAPI. This package will define base classes to be implemented by data repositories R API willing to contribute to datashield framework. All the datashield client R packages will then depend on dsAPI in place of opal R package.
  3. Developing a reference implementation of the datashield R API based on the new opal R package.
  4. Supporting other organisations during their implementation of this API (Molgenis group for instance).

The result is expected to be backward compatible with existing DS scripts.

If you have improvement requests (for instance the connection discovery function also known as getOpals()), please let me know.

Cheers, Yannick

2 Likes
Bad performance of covariance
pinned #2
#3

Hi,

It may seem trivial but I have been thinking of the name of the Datashield API repository… When starting a new piece of software, one should pay attention to the name as it is usually difficult to change it afterwards. I think that dsAPI is not an appropriate name as it is confusing with the Datashield server side packages naming schema, ds*, and it is not a server side package and it does not do data analysis things. Then I propose to name it DSI for DataShield Interface. This name is similar to DBI, the DataBase Interface, which is my reference in terms of API design.

Cheers Yannick

#4

Hi Yannick,

I agree that dsAPI has too much in common with server side packages.

My only suggestion is DSAPI (i.e fully capitalised), because I wonder if the term DataSHIELD API has already gained too much traction. But if not I am happy with DSI.

Tom

#5

I agree dsAPI is not an appropriate name, I think API should form part of the name though. Would DatashieldAPI be too long?

#6

DatashieldAPI is way too long for sure. I like DSI because it can be easily expanded for readability as Datashield Interface whereas no one (I think) would ever write Datashield Application Programming Interface when expliciting DSAPI. Sounds less technical to me but maybe I am wrong.

#7

Hi,

I have made very good progress with the DataSHIELD Interface (DSI) and before I go any further I would like to have your feedback. The new and updated packages are:

  • DSI package defines the interface using S4, see the README for more details; I am pretty happy with the choice of using S4 as it makes a robust API, strongly typed.

  • DSOpal is the reference implementation of DSI, see README for more details; it is based on the new opalr package which is a merge between the opal and opaladmin legacy packages, without the DataSHIELD specific functions (datashield.login() etc.) that have been moved to the DSI package.

  • dsBaseClient (DSI branch) has been updated to use DSI and to remove all references to Opal (more specifically, the magic getOpals() function is replaced by the reusable and generic DSI::findDSConnections() and in the documentation).

Example of code using DSI:

# install development packages
devtools::install_github("datashield/DSI")
devtools::install_github("datashield/dsBaseClient", ref = "DSI")
devtools::install_github("obiba/opalr")
devtools::install_github("datashield/DSOpal")

# example with dsBase
library(dsBaseClient)
library(DSOpal) # explicit load is now required
data(logindata.opal.demo)
conns <- datashield.login(logindata.opal.demo, assign = TRUE)
ds.summary(x='D$LAB_TSC')
ds.ls()
datashield.logout(conns)

This new API is backward compatible, richer, more flexible and more robust. Server side R packages and Opal are not affected. All the packages I have mentioned have gone through the R package check process (many fixes in the dsBaseClient DESCRIPTION file).

Could you incorporate that new API in your test pipelines? Apart from the set-up of the packages, the end-user code should not change. I am waiting for your feedback before continuing with the port of the other DataSHIELD client packages.

Cheers Yannick

#8

Hi Yannick, this does indeed look like great progress. While I don’t have a testing pipeline, are you looking for general feedback if I were to build a VM to run as a client and test it out? Or is that unlikely to uncover anything useful, and you are more looking for integration issues?

#9

Hi Tom,

The tests that are with the dsBaseClient package are just smoke test (no returned value check) and are outdated (wrong arguments, missing functions…). I am pretty confident that it works the same as with the previous API but having an automated non-regression test-suite would be a good practice. Does it exist outside of the dsBaseClient package?

Yannick

#10

@swheater maybe can shed some light on this.

#11

Hi,

I had to merge your last changes in the dsBaseClient (rebase master to DSI branch). I see all the efforts made with listOpals and setDefaultOpals functions. Please be aware that these functions are too opal specific (which does not fit with the objective of having a data repository independent interface) and the right place for developing such utility functions is in the DSI package.

Note also that I propose to use a datashield.env option to specify in which environment the connections (a.k.a opals) objects are to be found, see DSI::findDSConnections().

By the way, to clarify, the packages DSI, DSOpal and opalr are fully functional. I am waiting for you to test the dsBaseClient (DSI branch) to validate and eventually amend the DSI package. The changes to be made in the client packages are minor (mainly dependencies and documentation update) but I am waiting for your feedback before applying them to all client packages.

Yannick

#12

Hi,

I have added the listDSConnections() and setDefaultDSConnections() functions that reproduce the ds.listOpals.o() and ds.setDefaultOpals.o() functions without referring to opal. The global variable is now called default.connections.

Update: I have applied a consistent naming with the other DSI functions. The connections management functions are now datashield.connections (print the list of connections), datashield.connections_default (setter and getter of the default connections) and datashield.connections_find (used by the client functions). See file datashield.connections.R.

Yannick

#13

Hi,

I am pleased to announce that DataSHIELD has a new backend: DSLite. Opal is still the only data repository that supports DataSHIELD, DSLite is a serverless (i.e. a pure software solution) implementation of DSI: the DataSHIELD server-side operations happen in distinct R environments in the same R session as the DataSHIELD client. The datasets that are analyzed are living on the client side. DSLite also supports workspace save/restore. The function call filtering is less strict than the one of Opal but that’s not a security issue as the individual level data are accessible anyway.

See DSLite README for an explanation of the architecture.

The benefits of this:

  • super-fast and lightweight new DS functions development cycle as VMs and data upload are not needed anymore, all can happen on the developer’s workstation.
  • allow combined analysis between remotely accessible datasets in secure data repository (Opal) and local datasets that cannot be shared.

This also proves the robustness of the DSI as only minor adjustments were needed to support both Opal and DSLite as DataSHIELD backends.

To give it a try:

# install required packages
install.packages("dsBase", repos="https://cran.obiba.org", dependencies=TRUE)
# install development packages
devtools::install_github("datashield/DSI")
devtools::install_github("datashield/dsBaseClient", ref = "DSI")
devtools::install_github("datashield/DSLite")

# example with dsBase
library(dsBaseClient)
# explicit load is now required
library(DSLite)

# prepare data in a light DS server
data("CNSIM1")
data("CNSIM2")
data("CNSIM3")
dslite.server <- newDSLiteServer(tables=list(CNSIM1=CNSIM1, CNSIM2=CNSIM2, CNSIM3=CNSIM3))

# datashield logins and assignments
data("logindata.dslite.demo")
conns <- datashield.login(logindata.dslite.demo, assign=TRUE, variables=c("GENDER","PM_BMI_CONTINUOUS"))
ds.summary(x='D$PM_BMI_CONTINUOUS')
ds.ls()
datashield.logout(conns)

You can also perform mixed analysis on local and distant Opal demo datasets:

# install required packages
install.packages(c("opalr", "dsBase"), repos=c("https://cran.r-project.org", "https://cran.obiba.org"), dependencies=TRUE)
# install development packages
devtools::install_github("datashield/DSI")
devtools::install_github("datashield/dsBaseClient", ref = "DSI")
devtools::install_github("datashield/DSOpal")
devtools::install_github("datashield/DSLite")

# example with dsBase
library(dsBaseClient)
# explicit load is now required
library(DSOpal)
library(DSLite)

# prepare data in a light DS server
data("CNSIM2")
dslite.server <- newDSLiteServer(tables=list(CNSIM2=CNSIM2))

# prepare login data
server <- c("study1", "study2", "study3")
url <- c("https://opal-demo.obiba.org", "dslite.server", "https://opal-demo.obiba.org")
user <- c("administrator", "", "administrator")
password <- c("password", "", "password")
table <- c("datashield.CNSIM1", "CNSIM2", "datashield.CNSIM3")
options <- rep("", 3)
driver <- c("OpalDriver", "DSLiteDriver", "OpalDriver")
logindata.mixed.demo <- data.frame(server,url,user,password,table,options,driver)

conns <- datashield.login(logindata.mixed.demo, assign = TRUE)
ds.summary(x='D$LAB_TSC')
ds.mean(x='D$PM_BMI_CONTINUOUS')
ds.ls()
datashield.logout(conns)

Cheers, Yannick

#14

I have updated my code snipet in the previous message, the “dsBase” installation instructions were missing. To get the DSLiteServer configuration you can call dslite.server$config().

#15

Hi Yannick, this looks really helpful for development, and I shall give it a try when I work on some of the new functions for longitudinal data in the coming weeks.

#16

Hi Tom,

Then your client-side package will need to be DSI compatible, which mainly consists of replacing findLoginObjects() calls by datashield.connections_find() (it’s quickly done with a find-and-replace-all).

Yannick

#17

Hi,

I have added functions that facilitates the writing of DataSHIELD packages unit tests (with testthat). The idea is to have a test environment as light as possible so that developers do not have to rely on a complex data repository infrastructure: DSLite is the candidate.

The setup is as simple as one line of code:

The setupCNSIMTest() function loads the CNSIM1, CNSIM2, CNSIM3 datasets and corresponding login data (these are part of the DSLite package), and instanciates a DSLiteServer that will host these datasets and will automatically discover the DataSHIELD configuration from the server-side DataSHIELD packages description. The loaded datasets are available in the current environment, allowing test writers to check the test expectations.

The setupCNSIMTest() function is based on setupDSLiteServer() function that can load datasets from any packages (not only the DSLite ones).

Cheers, Yannick

#18

Hi Yannick,

I really like what you are doing with this. I am having issues running it though. In your example above I get

data(“logindata.dslite.demo”) Warning message: In data(“logindata.dslite.demo”) : data set ‘logindata.dslite.demo’ not found

Should the example above work without modification, or am I supposed to be editing it before it runs?

Cheers!

#19

Hi Olly,

I have recently replaced the “logindata.dslite.demo” login data object by “logindata.dslite.cnsim”. because I introduced “dasim” and “survival.expand_with_missing” flavors. Have look at the DSLite::setup* functions, that help at loading datasets, making server, and returning corresponding login data.

I have also ported all the datashield packages to DSI, including dsBetaTest. There are for some of them changes on server side as well (dsBase for instance). Make sure to pickup the DSI branches for clients and servers (when there is one for the server).

Let me know if you have issues, I would be glad to help.

Cheers Yannick

#20

I’m struggling a little to get going. When I try:

# prepare data in a light DS server
data("DASIM1")
data("DASIM2")
data("DASIM3")
dslite.server <- newDSLiteServer(tables=list(DASIM1=DASIM1, DASIM2=DASIM2, DASIM3=DASIM3))
dslite.server$config()

# datashield logins and assignments
data("logindata.dslite.dasim")
conns <- datashield.login(logindata.dslite.dasim, assign=T)

I get

Logging into the collaborating servers

  No variables have been specified. 
  All the variables in the table 
  (the whole dataset) will be assigned to R!

Assigning data...
Error in data.frame(name = unique_name, type = type, status) : 
  arguments imply differing number of rows: 0, 1