How to pass models between data scientists and nodes (hospitals)?

jch · 16 September 2021 07:09

We try to implement federated learning using DataSHIELD.

We were able to implement model training on the nodes, it is working fine.

Then the model of each node is returned as a result of the DataSHIELD function to the data scientist.

Then we combine models and want to send them back to the nodes (hospitals).

Combining models works but we are unable to send the model back to function on the node.

And this is a moment when it fails. Either there’s something wrong with deserialization or we hit 10000 bytes limit.

How to pass models between data scientists and nodes?

Do you have any current examples that showcase that?

What about the next version of DataSHIELD?

Can we expect native support for FL scenarios?

xescriba · 16 September 2021 07:48

Hello!

Could you please explain how are you building the expression passed to DSI::datashield.assign.expr()?

As you mention the 10.000 bytes limit I assume you are using something like:

cally <- as.symbol("...........")
DSI::datashield.assign.expr(conns, symbol, cally)

If that is indeed yoour case, just know that there is no need of passing a symbol, you can use a string and therefore bypass the 10k bytes limit impossed by the as.symbol() function. (Working example dsOmicsClient/ds.PRS.R at master · isglobal-brge/dsOmicsClient · GitHub)

Maybe this discussion also helps you Send serialized object to DS server - #3 by xescriba

If that is not your case please explain how are you trying to pass the model to the nodes so further help can be provided.

Regards, Xavier.

jch · 17 September 2021 10:02

Thank you for your swift answer.

We have two functions implemented. Aggregation function to train randomForest model, and we want to assign the aggregation result to the symbol, so we can use this model in further aggregations.

rf.base64 ← base64encode(serialize(model, NULL))

datashield assign function:

my_func <- function(x){

  library(base64)
  out<- unserialize(base64decode(x))
  return(out)

We have tried to assign it in two ways:

DSI::datashield.assign.expr(connections, "model", expr= quote(my_func(rf.base64)))

(results in “error rf.base64 object not found”)

cally <- paste0("my_func(", base64encode(serialize(model, NULL)), ")" )

DSI::datashield.assign.expr(connections, "model", cally)

(which results in “error 400 bad request, R operator at line 1, column 1234. Was expecting one of Number […]”.

Any ideas?

xescriba · 17 September 2021 10:22

I managed to achieve what you are proposing by using the following:

DataSHIELD assign function:

my_func <- function(x){
    out <- unserialize(wkb::hex2raw(x))
    return(out)
}

Client assign call:

cally <- paste0("my_func(", data, ", '", sf::rawToHex(serialize(model, NULL)), "')")

DSI::datashield.assign.expr(datasources, "model", as.symbol(cally))

So you will need the sf library on the client and the wkb library on the server.

Note that I worked on this many months ago, I am not sure if any changes to the DataSHIELD parser might have broken this implementation. Feel free to try it.

Regards, Xavier

HankCao · 18 September 2021 12:43

This is so cool. The model can be shared directly. I may consider using this in my package. Thanks a lot. Hank

sztop · 20 September 2021 09:33

Hi Hank, could you describe how you share models between nodes and data scientist non-directly? While using rawToHex, we encountered another error which is “The usage of stack C 16975741 is too close to the limit”. We tried increasing the limit to 33 GB but that didn’t work.

HankCao · 20 September 2021 09:55

Hi sztop,

I attached our paper and github repository below. We implemented distributed Lasso and NMF as well as several other multi-task learning algorithms.

We did not do anything special for information transformation, just control the number of bytes for transformation. Here is the example.

>     ws=round(ws, nDigits)
>     w.text=paste0(as.character(ws), collapse=",")
>     cally <- call('LS_iter_updateDS', w.text, X, Y)
>     iter_update=DSI::datashield.aggregate(datasources, cally)

Paper: https://www.biorxiv.org/content/10.1101/2021.08.26.457778v1

Client：GitHub - transbioZI/dsMTLClient

Regards，

Hank

HankCao · 20 September 2021 09:57

Can I ask how many features and samples do you have? And which algorithms are you implementing?

Hank

yannick · 20 September 2021 10:01

Hi,

Would that help to have in the Datashield API the possibility to upload a file? Then the parameter passed to the server side function would be the file name and the function would be responsible to verify the file is not malicious. It would be much better than working around with a big serialized object in the function call (and not less secure).

Regards
Yannick

HankCao · 20 September 2021 10:09

Hi Yannick,

I imagine if the file was a compressed file, then it would be helpful because it will reduce the required memory. Or more directly it would be great if DataSHIELD could provide a function to compress (lossy or losses) parameters before sending.

Regards, Hank

Topic		Replies	Views
What DataSHIELD functionality are you working on? New functionality under-dev	15	1494	29 May 2022
How to send 10 messages (using datashield.aggregate) to 10 servers simultaneously? Developer support	10	367	17 March 2021
Machine learning or deep learning possibilities with DataSHIELD Beginner Support	18	734	21 April 2021
Where can I find a tutorial for adding a new datashield function and its installation？ Developer support	3	608	19 February 2020
What next after creating own custom function in DataSHIELD? Beginner Support	12	761	3 January 2023

How to pass models between data scientists and nodes (hospitals)?

Related topics