DataSHIELD Resources

Hi,

In order to address the challenge of analyzing the omics data with DataSHIELD (EuCanConnect project), I am currently working at introducing a new concept in the DataSHIELD infrastructure: the Resources. Resources are datasets or computation units which location is described by a URL and access is protected by credentials. When assigned to a R/DataSHIELD server session, remote big/complex datasets or high performance computers are made accessible to data analysts.

Instead of storing the data in Opal’s database, only the way to access them is to be defined: the datasets are kept in their original format and location (a SQL database, a SPSS file, etc.) and are read directly from the R/DataSHIELD server-side session. Then as soon as there is a R reader for the dataset or a connector for the analysis services, a resource can be defined. Opal takes care of the DataSHIELD permissions (a DS user cannot see the resource’s credentials) and of the resources assignment to a R/DataSHIELD session.

DS2

To facilitate the management of Resources on the R/DataSHIELD server-side, I have implemented the resourcer R package. This package allows to deal with the main data sources (using tidyverse, DBI, dplyr, sparklyr, MongoDB, AWS S3, SSH etc.) and is easily extensible to new ones (Molgenis for instance). I have also prepared a test environment, with the Opal implementation of Resources and an appropriate R/DataSHIELD configuration:

https://opal-test.obiba.org and see “test” project, Resources tab

  • username: administrator
  • password: password

You can test this setup from the DataSHIELD client side. Please have a look at this, it has a huge potential!

Cheers, Yannick


Prerequisites

The demo requires some specific packages to be installed on the client side:

devtools::install_github("obiba/opalr", ref = "resources", dependencies = TRUE)
devtools::install_github("datashield/DSI", ref = "resources", dependencies = TRUE)
devtools::install_github("datashield/DSOpal", ref = "resources", dependencies = TRUE)
devtools::install_github("datashield/dsBaseClient", ref = "DSI", dependencies = TRUE)

CNSIM

Then start an analysis based on the CNSIM test dataset where:

  • CNSIM1 is a SQL table,
  • CNSIM2 is a local file in SPSS format,
  • CNSIM3 is a zipped CSV file stored in Opal’s file store.
library(DSOpal)
library(dsBaseClient)

# prepare login data and resources to assign
builder <- DSI::newDSLoginBuilder()
builder$append(server = "study1", url = "https://opal-test.obiba.org", user = "dsuser", password = "password", resource = "test.CNSIM1", driver = "OpalDriver")
builder$append(server = "study2", url = "https://opal-test.obiba.org", user = "dsuser", password = "password", resource = "test.CNSIM2", driver = "OpalDriver")
builder$append(server = "study3", url = "https://opal-test.obiba.org", user = "dsuser", password = "password", resource = "test.CNSIM3", driver = "OpalDriver")
logindata <- builder$build()

# login and assign resources
conns <- datashield.login(logins = logindata, assign = TRUE, symbol = "res")

# assigned objects are of class ResourceClient (and others)
ds.class("res")

# coerce ResourceClient objects to data.frames
# (DataSHIELD config allows as.resource.data.frame() assignment function for the purpose of the demo)
datashield.assign.expr(conns, symbol = "D", expr = quote(as.resource.data.frame(res)))
ds.class("D")

# note that some dsBase functions do not like that the data.frame has multiple and different classes
# (despite all are data.frames). Then query colnames one by one:
lapply(conns, function(conn) {ds.colnames("D", datasources = conn)})

# do usual dsBase analysis
ds.summary('D$LAB_HDL')

# vector types are not necessarily the same depending on the data reader that was used
ds.class('D$GENDER')
ds.asFactor('D$GENDER', 'GENDER')
ds.summary('GENDER')

# or coerce to a dplyr's tbl, which is more suitable for large/big datasets analysis
# (DataSHIELD config allows as.resource.tbl() assignment function for the purpose of the demo)
datashield.assign.expr(conns, symbol = "T", expr = quote(as.resource.tbl(res)))
ds.class("T")

# DataSHIELD analysis using dplyr objects and functions is to be invented...

datashield.logout(conns)

R data objects

Another example is using BioConductor’s ExpressionSet objects stored in R data files:

  • GSE66351 is downloaded from the web (GitHub),
  • GSE80970 is a local file.
library(DSOpal)
library(dsBaseClient)

builder <- DSI::newDSLoginBuilder()
builder$append(server = "study1", url = "https://opal-test.obiba.org", user = "dsuser", password = "password", resource = "test.GSE66351", driver = "OpalDriver")
builder$append(server = "study2", url = "https://opal-test.obiba.org", user = "dsuser", password = "password", resource = "test.GSE80970", driver = "OpalDriver")
logindata <- builder$build()

# login and assign resources
conns <- datashield.login(logins = logindata, assign = TRUE, symbol = "res")

# R data file resource
ds.class("res")

# coerce the ExpressionSet object (accessed by the resource client) to a data.frame
datashield.assign.expr(conns, symbol = "D", expr = quote(as.resource.data.frame(res)))
ds.class("D")

# analyse using dsBase
ds.class('D$Sex')
ds.asFactor('D$Sex', 'Sex')
ds.summary('Sex')

# or directly extract the R object (a Bioconductor's ExpressionSet)
# (DataSHIELD config allows as.resource.object() assignment function for the purpose of the demo)
datashield.assign.expr(conns, symbol = "ES", expr = quote(as.resource.object(res)))
ds.class("ES")

# DataSHIELD analysis using ExpressionSet objects is to be invented...

datashield.logout(conns)

Hi Yannick,

Are there plans to create a Debian package for 2.16-SNAPSHOT?

Stuart

We do not make system package for snapshot versions any more. You can use Docker instead. See this docker-compose.yml file as a reference. It gets an image with the resources Opal development branch, along with the R server image with the appropriate settings. Then just run:

docker-compose run

You will also need to declare the coercing functions mentioned in the example script (as.resource.data.frame(), as.resource.tbl() and as.resource.object()). See how done on opal-test.obiba.org.

By the way, so that there is no confusion, Resources will not be part of the next 2.16 release. There is still some polishing to be done and I am expecting some feedback or improvement ideas.

Hi Yannick,

This looks very exciting and I hope I might have time to look at it before our meeting on Monday. Unfortunately I am at a workshop today…

Thanks for sharing it with us

Tom

Hi Yannick, Thank you, I will look at the Docker version. This is very timely as we have been approached about a Bio (microarray data related) project which would certainly benefit form use of resources.

Stuart

Hi Yannick,

I was able to run the examples that you provided above without any problems. I thought I would try something myself, which would be to just make the SSH connection to our HPC. I set up a resource on opal-test.obiba.org called test_HPC. I now removed the credentials for my account, but when I did have the correct details in there I got the following error:

> builder <- DSI::newDSLoginBuilder()
> builder$append(server = "hpc", url = "https://opal-test.obiba.org", user = "dsuser", password = "password", resource = "test.test_HPC", driver = "OpalDriver")
> logindata <- builder$build()
> 
> # login and assign resources
> conns <- datashield.login(logins = logindata, assign = TRUE, symbol = "res")

Logging into the collaborating servers

Assigning resource data...
Error in warning("Resource assignment of '", resources[i], "' failed for '",  : 
  no slot of name "error" for this object of class "OpalResult"

Please could you help me investigate further?

Thanks

Tom

Hi Tom,

It’s probably because the SSH resource is not completely defined (the fault to the SSH form in Opal that is not strict enough): some allowed commands are missing.

Anyway, you would not have seen anything in a DataSHIELD environment. What you can do instead to test is to build a resource object directly:

devtools::install_github("obiba/resourcer", ref = "master", dependencies = TRUE)
library(resourcer)
# declare a resource
res <- resourcer::newResource(url="ssh://login-gpu.hpc.cam.ac.uk:22?exec=ls,pwd", identity = "xxxx", secret = "xxxx")
# make a resource client (performs the SSH connection)
client <- resourcer::newResourceClient(res)
client$exec("ls", "-la")
client$exec("pwd")
client$close()

Yannick

Hi Yannick,

Thanks for the code to build the resource object directly: it works perfectly. I will need to spend some more time trying to understand how all this works and the implications for analysis/disclosure

Tom

That looks existing. Can you please send us some. documentation for the wiki please.

P.

When it will be released, there will be some documentation for the wiki. But the resources stuff will not be released before DSI-compliant dsBaseClient is released. Too much changes to handle at the same time… Yannick

Hi,

Where do we set the SSL options? #yannick

P.

Hi,

Is this a resources related question or a DSOpal one?

Yannick

`hi,

Yes it is. With the. Resources, how do we set the SSL options. In some instances, we needed to specify them with “opal”. How do we do that with the new functionality. Have you got an example, please

P.

Dr Patricia Ryser-Welch

When restraint and courtesy are added to strength, the latter becomes irresistible.

Mahatma Gandhi

Start by doing what’s necessary; then do what’s possible; and suddenly you are doing the impossible.

With resources, it is the R server that would make a SSL connection. This would happen in a ResourceClient object, so if there are any SSL options to be applied, this would be the place. You have then 2 possible strategies: (1) specify the SSL option in the resource’s URL (a bit cryptic, but this is how it works in a MongoDB URL for instance) or (2) have the SSL options built-in the ResourceClient's implementation. Currently the ResourceClient subclasses that are provided by the resourcer package are using httr (which itself uses libcurl) for HTTPS connections, and an invalid certificate (self-signed for instance) would be rejected (this is good security practice!).

Hi Yannick,

I have tried to run the examples above from the opal-demo environment, https://opal-demo.obiba.org with RSRC project since the test environment is no longer available. The tested examples work perfectly for this setup.

Then I tried setting up my own test project and resources and here I run into issues while coercing ResourceClient object to data.frame. The situation is following:

> builder <- DSI::newDSLoginBuilder()
> builder$append(server = "study1", url = "https://opal-demo.obiba.org", user = "administrator", password = "password", resource = "TEST.opal_file", driver = "OpalDriver")
> builder$append(server = "study2", url = "https://opal-demo.obiba.org", user = "administrator", password = "password", resource = "TEST.CNSIM3", driver = "OpalDriver")
> builder$append(server = "study3", url = "https://opal-demo.obiba.org", user = "administrator", password = "password", resource = "RSRC.CNSIM3", driver = "OpalDriver")
> logindata <- builder$build()
>
> # login and assign resources
> connections <- datashield.login(logins = logindata, assign = TRUE, symbol = "res")

Logging into the collaborating servers
  Logged in all servers [================================================================] 100% / 1s

Assigning resource data...
  Assigned all resources [===============================================================] 100% / 1s

> ds.class('res')
  
Aggregated (exists("res")) [===========================================================] 100% / 1s
Aggregated (classDS("res")) [==========================================================] 100% / 1s
$study1
[1] "TidyFileResourceClient" "FileResourceClient"    
[3] "ResourceClient"         "R6"                    

$study2
[1] "TidyFileResourceClient" "FileResourceClient"    
[3] "ResourceClient"         "R6"                    

$study3
[1] "TidyFileResourceClient" "FileResourceClient"    
[3] "ResourceClient"         "R6"  

> datashield.assign.expr(connections, symbol = "D", expr = quote(as.resource.data.frame(res)))
 
Assigned expr. (D <- as.resource.data.frame(res)) [====================================] 100% / 0s

> ds.class("D")
  
Aggregated (exists("D")) [=============================================================] 100% / 0s
Aggregated (classDS("D")) [============================================================] 100% / 0s
$study1
[1] "tbl_df"     "tbl"        "data.frame"

$study2
[1] "tbl_df"     "tbl"        "data.frame"

$study3
[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

> ds.dim("D")
  
Aggregated (dimDS("D")) [==============================================================] 100% / 0s
$`dimensions of D in study1`
[1] 0 0

$`dimensions of D in study2`
[1] 0 0

$`dimensions of D in study3`
[1] 4128   12

$`dimensions of D in combined studies`
[1] 4128    0

So it seems like the data.frames are not created correctly for the first two studies. In this example all three resources are data files stored in Opal server:

  • the first one is my own CSV file stored in TEST project in Opal
  • the second one is the CNSIM3.csv (identical to the one in RSRC project) loaded into my TEST project and
  • the third is one of your examples above (zipped CSV).

Now, I am not sure if the environment is not supporting some of the steps I am trying to execute or the problem is in the way I coerce objects into data.frame. I would appriciate help with further steps.

Thank you,

Tanja

Hi,

It’s hard to tell, the demo server is rebuilt every night (around 5am CEST), so everything has been whipped out. If the files were stored in Opal, you need to provide an authentication token when creating the resource, an authentication token from a user that has the appropriate permissions for accessing the resource. Was it the case?

Regards
Yannick

Dear all,

Is the “DataSHIELD Resources” already available in the lastest version of datashield? I would like to use it to load the large omics data.

Regards, Hank

Hi,

If you mean by “datashield”, the DataSHIELD infrastructure, then yes “resources” are available with the latest Opal server (3.0) and DSI (1.1).
If you mean whether “resources” are part of dsBase, this is not the case but you do not need to have resources support in dsBase to make omics analysis. Use the dsOmics and dsOmicsClient packages and see resources/dsOmics documentation at: https://isglobal-brge.github.io/resource_bookdown/

Regards
Yannick

Dear Yannick,

Thanks a lot!

Regards,

Hank