How to pass models between data scientists and nodes (hospitals)?

We try to implement federated learning using DataSHIELD.

We were able to implement model training on the nodes, it is working fine.

Then the model of each node is returned as a result of the DataSHIELD function to the data scientist.

Then we combine models and want to send them back to the nodes (hospitals).

Combining models works but we are unable to send the model back to function on the node.

And this is a moment when it fails. Either there’s something wrong with deserialization or we hit 10000 bytes limit.

How to pass models between data scientists and nodes?

Do you have any current examples that showcase that?

What about the next version of DataSHIELD?

Can we expect native support for FL scenarios?

Hello!

Could you please explain how are you building the expression passed to DSI::datashield.assign.expr()?

As you mention the 10.000 bytes limit I assume you are using something like:

cally <- as.symbol("...........")
DSI::datashield.assign.expr(conns, symbol, cally)

If that is indeed yoour case, just know that there is no need of passing a symbol, you can use a string and therefore bypass the 10k bytes limit impossed by the as.symbol() function. (Working example dsOmicsClient/ds.PRS.R at master · isglobal-brge/dsOmicsClient · GitHub)

Maybe this discussion also helps you Send serialized object to DS server - #3 by xescriba

If that is not your case please explain how are you trying to pass the model to the nodes so further help can be provided.

Regards, Xavier.

Thank you for your swift answer.

We have two functions implemented. Aggregation function to train randomForest model, and we want to assign the aggregation result to the symbol, so we can use this model in further aggregations.

rf.base64 ← base64encode(serialize(model, NULL))

datashield assign function:

my_func <- function(x){

  library(base64)
  out<- unserialize(base64decode(x))
  return(out)

We have tried to assign it in two ways:

DSI::datashield.assign.expr(connections, "model", expr= quote(my_func(rf.base64)))

(results in “error rf.base64 object not found”)

cally <- paste0("my_func(", base64encode(serialize(model, NULL)), ")" )

DSI::datashield.assign.expr(connections, "model", cally)

(which results in “error 400 bad request, R operator at line 1, column 1234. Was expecting one of Number […]”.

Any ideas?

I managed to achieve what you are proposing by using the following:

DataSHIELD assign function:

my_func <- function(x){
    out <- unserialize(wkb::hex2raw(x))
    return(out)
}

Client assign call:

cally <- paste0("my_func(", data, ", '", sf::rawToHex(serialize(model, NULL)), "')")

DSI::datashield.assign.expr(datasources, "model", as.symbol(cally))

So you will need the sf library on the client and the wkb library on the server.

Note that I worked on this many months ago, I am not sure if any changes to the DataSHIELD parser might have broken this implementation. Feel free to try it.

Regards, Xavier

1 Like

This is so cool. The model can be shared directly. I may consider using this in my package. Thanks a lot. Hank

Hi Hank, could you describe how you share models between nodes and data scientist non-directly? While using rawToHex, we encountered another error which is “The usage of stack C 16975741 is too close to the limit”. We tried increasing the limit to 33 GB but that didn’t work.

Hi sztop,

I attached our paper and github repository below. We implemented distributed Lasso and NMF as well as several other multi-task learning algorithms.

We did not do anything special for information transformation, just control the number of bytes for transformation. Here is the example.

>     ws=round(ws, nDigits)
>     w.text=paste0(as.character(ws), collapse=",")
>     cally <- call('LS_iter_updateDS', w.text, X, Y)
>     iter_update=DSI::datashield.aggregate(datasources, cally)

Paper: https://www.biorxiv.org/content/10.1101/2021.08.26.457778v1

Client:GitHub - transbioZI/dsMTLClient

Regards,

Hank

Can I ask how many features and samples do you have? And which algorithms are you implementing?

Hank

Hi,

Would that help to have in the Datashield API the possibility to upload a file? Then the parameter passed to the server side function would be the file name and the function would be responsible to verify the file is not malicious. It would be much better than working around with a big serialized object in the function call (and not less secure).

Regards
Yannick

Hi Yannick,

I imagine if the file was a compressed file, then it would be helpful because it will reduce the required memory. Or more directly it would be great if DataSHIELD could provide a function to compress (lossy or losses) parameters before sending.

Regards, Hank