Hi,
In order to address the challenge of analyzing the omics data with DataSHIELD (EuCanConnect project), I am currently working at introducing a new concept in the DataSHIELD infrastructure: the Resources. Resources are datasets or computation units which location is described by a URL and access is protected by credentials. When assigned to a R/DataSHIELD server session, remote big/complex datasets or high performance computers are made accessible to data analysts.
Instead of storing the data in Opal’s database, only the way to access them is to be defined: the datasets are kept in their original format and location (a SQL database, a SPSS file, etc.) and are read directly from the R/DataSHIELD server-side session. Then as soon as there is a R reader for the dataset or a connector for the analysis services, a resource can be defined. Opal takes care of the DataSHIELD permissions (a DS user cannot see the resource’s credentials) and of the resources assignment to a R/DataSHIELD session.
To facilitate the management of Resources on the R/DataSHIELD server-side, I have implemented the resourcer R package. This package allows to deal with the main data sources (using tidyverse, DBI, dplyr, sparklyr, MongoDB, AWS S3, SSH etc.) and is easily extensible to new ones (Molgenis for instance). I have also prepared a test environment, with the Opal implementation of Resources and an appropriate R/DataSHIELD configuration:
https://opal-test.obiba.org and see “test” project, Resources tab
- username: administrator
- password: password
You can test this setup from the DataSHIELD client side. Please have a look at this, it has a huge potential!
Cheers, Yannick
Prerequisites
The demo requires some specific packages to be installed on the client side:
devtools::install_github("obiba/opalr", ref = "resources", dependencies = TRUE)
devtools::install_github("datashield/DSI", ref = "resources", dependencies = TRUE)
devtools::install_github("datashield/DSOpal", ref = "resources", dependencies = TRUE)
devtools::install_github("datashield/dsBaseClient", ref = "DSI", dependencies = TRUE)
CNSIM
Then start an analysis based on the CNSIM test dataset where:
- CNSIM1 is a SQL table,
- CNSIM2 is a local file in SPSS format,
- CNSIM3 is a zipped CSV file stored in Opal’s file store.
library(DSOpal)
library(dsBaseClient)
# prepare login data and resources to assign
builder <- DSI::newDSLoginBuilder()
builder$append(server = "study1", url = "https://opal-test.obiba.org", user = "dsuser", password = "password", resource = "test.CNSIM1", driver = "OpalDriver")
builder$append(server = "study2", url = "https://opal-test.obiba.org", user = "dsuser", password = "password", resource = "test.CNSIM2", driver = "OpalDriver")
builder$append(server = "study3", url = "https://opal-test.obiba.org", user = "dsuser", password = "password", resource = "test.CNSIM3", driver = "OpalDriver")
logindata <- builder$build()
# login and assign resources
conns <- datashield.login(logins = logindata, assign = TRUE, symbol = "res")
# assigned objects are of class ResourceClient (and others)
ds.class("res")
# coerce ResourceClient objects to data.frames
# (DataSHIELD config allows as.resource.data.frame() assignment function for the purpose of the demo)
datashield.assign.expr(conns, symbol = "D", expr = quote(as.resource.data.frame(res)))
ds.class("D")
# note that some dsBase functions do not like that the data.frame has multiple and different classes
# (despite all are data.frames). Then query colnames one by one:
lapply(conns, function(conn) {ds.colnames("D", datasources = conn)})
# do usual dsBase analysis
ds.summary('D$LAB_HDL')
# vector types are not necessarily the same depending on the data reader that was used
ds.class('D$GENDER')
ds.asFactor('D$GENDER', 'GENDER')
ds.summary('GENDER')
# or coerce to a dplyr's tbl, which is more suitable for large/big datasets analysis
# (DataSHIELD config allows as.resource.tbl() assignment function for the purpose of the demo)
datashield.assign.expr(conns, symbol = "T", expr = quote(as.resource.tbl(res)))
ds.class("T")
# DataSHIELD analysis using dplyr objects and functions is to be invented...
datashield.logout(conns)
R data objects
Another example is using BioConductor’s ExpressionSet objects stored in R data files:
- GSE66351 is downloaded from the web (GitHub),
- GSE80970 is a local file.
library(DSOpal)
library(dsBaseClient)
builder <- DSI::newDSLoginBuilder()
builder$append(server = "study1", url = "https://opal-test.obiba.org", user = "dsuser", password = "password", resource = "test.GSE66351", driver = "OpalDriver")
builder$append(server = "study2", url = "https://opal-test.obiba.org", user = "dsuser", password = "password", resource = "test.GSE80970", driver = "OpalDriver")
logindata <- builder$build()
# login and assign resources
conns <- datashield.login(logins = logindata, assign = TRUE, symbol = "res")
# R data file resource
ds.class("res")
# coerce the ExpressionSet object (accessed by the resource client) to a data.frame
datashield.assign.expr(conns, symbol = "D", expr = quote(as.resource.data.frame(res)))
ds.class("D")
# analyse using dsBase
ds.class('D$Sex')
ds.asFactor('D$Sex', 'Sex')
ds.summary('Sex')
# or directly extract the R object (a Bioconductor's ExpressionSet)
# (DataSHIELD config allows as.resource.object() assignment function for the purpose of the demo)
datashield.assign.expr(conns, symbol = "ES", expr = quote(as.resource.object(res)))
ds.class("ES")
# DataSHIELD analysis using ExpressionSet objects is to be invented...
datashield.logout(conns)