How to create the OPAL servers in real machines and deploy the datashield

HankCao · 2 December 2019 16:00

Hello all,

I am Hank from the Hiedelberg University, we are preparing to apply datashiled into our collaborative analysis of pschatric genetics.

After training courses demonstrated in datashield wiki, we can successfully run all fucntions in datashield v5 using virtual machine. Then we hope to deploy the datashied as well as opals on 8 geo-distributed clients instead of virtual machine, do you have any tutorial for this? Do we need to learn this from the opals side? What is most convenient way to achive it?

Thanks, I am looking forward for your reply.

Regards, Hank

yannick · 2 December 2019 16:41

Hi,

See the installation instructions (followed by the configuration instructions): http://opaldoc.obiba.org/en/latest/admin/installation.html

If you are comfortable with Docker, we also provide images for Opal and R server.

The central analysis server will consist of a Rstudio server, with DataSHIELD R packages installed.

Are you dealing with “large” datasets, like genetics?

Best Yannick

HankCao · 19 December 2019 15:21

Dear Yannick:

If you are comfortable with Docker, we also provide images for Opal and R server.

It would be best if you could provide the Docker Image of Opal and R server, since we can easily deploy on different platforms.

Are you dealing with “large” datasets, like genetics?

Yes, we are dealing with GWAS data, do you have any comments on that? Any information is appreciated.

Regards, Hank

yannick · 19 December 2019 15:58

Hi,

You can find the Docker install instructions in Opal’s documentation: http://opaldoc.obiba.org/en/latest/admin/installation.html#docker-image-installation

My concern is more about how large the data are, like millions of rows and/or columns? Next version of Opal will be able to handle large data sources, but it is not ready yet (see this post).

Cheers, Yannick

HankCao · 24 December 2019 19:29

Dear Yannick:

Thanks for the information.

The “Resource” idea is quite cool which, in my opinion, is able to integrate the heterogeneous data type or resources more conveniently.

Most of our analysis were to investigate the gene expression, mythelation and neuro imaging data of 100~1000 subjects. So on each opal server, the number of columns for each table would be smaller than 20,000 (except: mythelation might have 40000 columns) with the number of subjects less than 1000. For genetic data, it is indeed large, but in our analysis, we could cut it into pieces like the above scale. So in short, if datashield can handle the matrix with the size of 1000x20000 in each opal, it would be far more enough to us. Note, since we plan to implement the machine/multi-task learning analysis in datashield, the matrix of the above scale (1000x20000) have to be fully accessed in matrix calculation.

Regards, Hank

yannick · 6 January 2020 10:04

Hi,

I must admit that I have never tried to import a dataset with that much of columns. The “resource” feature (not yet available) is definitely the way to go, not only to avoid data storage and copy limitations but also to be able to minimize the R memory usage. In the meantime you can try to import the dataset, using a MongoDB backend (SQL databases will not support more than ~1000 columns).

Cheers, Yannick

HankCao · 8 January 2020 15:22

Hi Yannick: Thanks for this information. I would try to upload one gene expression matrix as the test according to your suggestions.

Regards, Hank

Topic		Replies	Views
Opal 4.2 released Releases	0	385	20 July 2021
DataSHIELD Resources New functionality	18	2140	21 October 2020
Install a DataSHIELD server (<1hour) [Tutorial] Beginner Support	9	704	3 March 2023
Installation query and test data Beginner Support	10	658	15 November 2021
Documentation for administrator Beginner Support	2	330	14 June 2021

How to create the OPAL servers in real machines and deploy the datashield

Related topics