An error when applying glm in Foreach and doParallel

Hi datashield team:

I am testing the performance of glm in parallel setting, because we have many variables. see the codes:

> fits=list()
> startTime=Sys.time()
> fits=foreach(i = 1:50) %dopar% {
+     fit=ds.glm(formula=paste0(expX[i], "~pheno+ages+gender+ph+pmi"), data = "expXY", family='gaussian')
+ }
Error in { : task 1 failed - "could not find function "ds.glm""

Does anyone have interpretation of the error?

Regards, Hank

Hi,

ds.glm is a client side function, that you are calling in your script. Looks like the package dsBaseClient was not loaded.

By the way I doubt that parallelizing requests would help, because the R server is single threaded: Opal will receives all the requests in parallel, but these will be executed by the R server one after the other.

Cheers, Yannick

Hi Yannick,

No, this might not be helpful, also because the running speed is upper bounded by the network speed. However, it should work anyway since it worked well in the normal for loop.

I wonder the next version of datashield will support the “batch-run” of glm because it has huge potential to save many network callings.

For example, I have 500 glms to run, and for each glm, only the outcome needs to be switched. Suppose, each glm need three iterations until convergence (in other word each gm need to call the server three times), then in total 1500 calls are need. However if „batch-run" of glm are supported, only 3 calls are enough because the summary information across all glms can be transmited simultaneously in one call.

Regards, Hank

I might be well wide of the mark, but I am wondering if @HankCao is trying to do some kind of 'omics analysis. I suspect this because of things like running large numbers of GLMs and the variable names exp and pheno look like expression and phenotype.

If this is the case, then it might be appropriate to wait for the forthcoming work on DataSHIELD 'omics functions? I have seen that the use of the limma package really speeds up running '000s of glms…

Apologies if I have completely misunderstood

Tom

Hi Tom and other DS members,

Please allow me to introduce ourselves first.

We are translational bioinformatics group in Heidelberg University, conducting a collaborative research project including several labs in Germany and Norway. We do work on the multiple omics data of psychiatric disorders and use datashield for privacy preserving distributed learning. We might need to deeply explore the inner mechanism of datashield and need many helps from you because we hope to integrate our multi-task learning package (called RMTL in CRAN) in the datashield framework for our multi-omics analysis. I am sure RMTL is also beneficial to datashield since it provided several common machine/multi-task learning methods especially for high-dimentional data.

@ tombishop, dsOmics indeed is quite potential, we definitely want to use. Have you already had a specific date for publishing dsOmics? We prefer to use the datashield now for transition to dsOmics if there are still six months to wait. Limma package indeed is quite fast but it does not provide distributed learning or does it?

Regards, Hank

2 Likes

@tombishop I would also be greatly interested in knowing more about dsOmics. Is there already a development version? If so, I would be very happy to get access and contribute to it as I want to rewrite my variable selection function and instead of writing my own stuff, it would be good to be able to build on your package.

Hi Hank, Tom and Daniela.

We plan to organise soon a teleconference where Yannick can show us the new DataSHIELD Interface (DSI) and Juan can give us a demonstration of the current version of the dsOmics. I will check what is Paul’s availability for the next couple of weeks and I will send around a doodle poll where you can add your availability for the telecon.

Hi Demetris,

Sourds great. Thanks for the invitation.

Hank