How to create the OPAL servers in real machines and deploy the datashield

Hello all,

I am Hank from the Hiedelberg University, we are preparing to apply datashiled into our collaborative analysis of pschatric genetics.

After training courses demonstrated in datashield wiki, we can successfully run all fucntions in datashield v5 using virtual machine. Then we hope to deploy the datashied as well as opals on 8 geo-distributed clients instead of virtual machine, do you have any tutorial for this? Do we need to learn this from the opals side? What is most convenient way to achive it?

Thanks, I am looking forward for your reply.

Regards, Hank

Hi,

See the installation instructions (followed by the configuration instructions): http://opaldoc.obiba.org/en/latest/admin/installation.html

If you are comfortable with Docker, we also provide images for Opal and R server.

The central analysis server will consist of a Rstudio server, with DataSHIELD R packages installed.

Are you dealing with “large” datasets, like genetics?

Best Yannick

Dear Yannick:

If you are comfortable with Docker, we also provide images for Opal and R server.

It would be best if you could provide the Docker Image of Opal and R server, since we can easily deploy on different platforms.

Are you dealing with “large” datasets, like genetics?

Yes, we are dealing with GWAS data, do you have any comments on that? Any information is appreciated.

Regards, Hank

Hi,

You can find the Docker install instructions in Opal’s documentation: http://opaldoc.obiba.org/en/latest/admin/installation.html#docker-image-installation

My concern is more about how large the data are, like millions of rows and/or columns? Next version of Opal will be able to handle large data sources, but it is not ready yet (see this post).

Cheers, Yannick

Dear Yannick:

Thanks for the information.

The “Resource” idea is quite cool which, in my opinion, is able to integrate the heterogeneous data type or resources more conveniently.

Most of our analysis were to investigate the gene expression, mythelation and neuro imaging data of 100~1000 subjects. So on each opal server, the number of columns for each table would be smaller than 20,000 (except: mythelation might have 40000 columns) with the number of subjects less than 1000. For genetic data, it is indeed large, but in our analysis, we could cut it into pieces like the above scale. So in short, if datashield can handle the matrix with the size of 1000x20000 in each opal, it would be far more enough to us. Note, since we plan to implement the machine/multi-task learning analysis in datashield, the matrix of the above scale (1000x20000) have to be fully accessed in matrix calculation.

Regards, Hank

Hi,

I must admit that I have never tried to import a dataset with that much of columns. The “resource” feature (not yet available) is definitely the way to go, not only to avoid data storage and copy limitations but also to be able to minimize the R memory usage. In the meantime you can try to import the dataset, using a MongoDB backend (SQL databases will not support more than ~1000 columns).

Cheers, Yannick

Hi Yannick: Thanks for this information. I would try to upload one gene expression matrix as the test according to your suggestions.

Regards, Hank