Machine learning or deep learning possibilities with DataSHIELD

Hi Team,

I am curious to know what DataSHIELD possibilities exist with regards to machine learning or deep learning. I have gone through the tutorials and tried out the GLM model examples and I am wondering if there are plans to feature any examples of other machine or deep learning algorithms in the tutorials. Also, what potential limitations could one foresee in this regards.

best regards, Ayisha

Hi Ayisha,

@xescriba is working on the development of some machine learning algorithms in DataSHIELD. You can see some of his developments here

2 Likes

We have created a package that can perform deep learning in DataSHIELD and just published a paper about it:

2 Likes

that is great news. Can you provide more information about the privacy preserving features of the package, please. Deep learning can be disclose…

In the paper we compare different models with respect to disclosure risk on different data sets. With the methods we used we could not find an increased disclosure risk in deep learning per se. But of course, this is open for debate.

@xescriba When you implemented K-means or KNN, did you utilize the existing libraries or did you write the algorithm from scratch? A link to the github, if available, might be helpful.

best regards, Ayisha.

From scratch … you need to implement algorithms to be non-disclosive

the github is in the vignette

Best Juan

1 Like

@ayisharuna You can find the code for both functions here:

Further information about the algorithms implemented and internal working can be found on the vignette already linked by Demetris

https://htmlpreview.github.io/?https://github.com/isglobal-brge/dsMLClient/blob/main/vignettes/dsML_vignette.html#k-nearest-neighbours https://htmlpreview.github.io/?https://github.com/isglobal-brge/dsMLClient/blob/main/vignettes/dsML_vignette.html#k-means

Feel free to contact me if you need further information.

Regards, Xavier.

1 Like

@xescriba , @jrgonzalez thank you for the information

@stefan.lenz I was reading the paper you provided. Thank you for the link; some nice work there. Is it safe to say you did install the BoltzmannMachines Julia package in your package but however did checks to ensure that information was not getting leaked?

@ayisharuna Thanks for the interest. In this paper we focused on generating synthetic data and also performed some disclosure analysis to check whether applying such deep learning for generating synthetic data has an increased disclosure risk compared to simpler methods that are in principle already possible with DataSHIELD, e.g. GLM models. The package is mostly a wrapper around the Julia package, which restricts the amount of things that can be done. It does not, however, check the data that is returned at this time. I do not see a way to make inferences about indivuals via the data that is returned, especially because the DBM algorithm is a stochastic learning algorithm, so there is much randomness involved. But there is no absolute certainty, of course, and there is always a trade-off between utility and disclosure.

1 Like

@stefan.lenz Thank you for the response. I tried to create a clustering package for DataSHIELD using existing packages such as that for DBSCAN. While debugging my errors, I thought to ask if it was already possible to use an existing package for machine learning. Hence my few questions. I get from @jrgonzalez that I should have to write it from scratch but if I understand you correctly, I should be able to utilize an existing package and then build up on that.

You can use existing packages if you want to run them at each Opal idependently and then return the aggregated results (non-disclosive such as mean, sd, var, n, “probably distance”, …) and then combine them in the client side to get results as if all the data were in the same computer

Meta-analysis is a clear example, but you can also implement other methods for other methodologies mainly if you are using parallel algorithms … This is what Xavier did in dsML package, and this is what I menant with “writting the funcion from scracth”

(dbscan is not a good fit - perhaps unless applied to RBM samples - since it doesn’t induce a model external to the data. Its clusters are defined explicitly in terms of points in the dataset, unlike the aggregate centroids in kmeans)

Hi ayisharuna,

1, About ML packages

We wrote the DS-based Lasso and a set of Lasso-based sparse multi-task learning methods. It was supposed to release in these two months with the paper.

2, distributed Clustering

There are about three ways to do Clustering incorporating multiple geo-distributed matrices (just come to my mind)

a, Bagging of local models (meta-analysis)

b, Distributed integrative matrix-factorization + local Clustering

c, Distributed Clustering

It would be great if you could include all in your package. We done the second one, implemented the distributed integrative matrix-factorization, which extracts the common component of multiple matrices into one. If you like, you can include the result of our method as the input to the clustering. This is very simple to implement.

3, Privacy-preserving

Datashield has already provided quite many mechanisms to protect the privacy-preserving (see their paper), but most are not specific to machine learning. For distributed integrative matrix-factorization, we only output the incomplete model which I think (but not sure) is robust to model inverse attack. @all, More information especially on the attack of the incomplete model was appreciated. For example, in the 2-server-environment, one shared matrix out of five matrices was returned as the output. This shared matrix was enough for subsequent clustering. The inverse construction was not possible. But the inverse attack… no idea.

Regards, Hank

1 Like

Regarding privacy preserving mechanisms, a good suggestion is to take a look into the source code of the dsBase package to get used with the DataSHIELD filtering variables.

@ayisharuna If you are looking into developing new functionalities regarding machine learning in DS please be aware that there are already implementations of:

-PCA -Kmeans -KNN -SVD -FAMD

Also, we have planned to work on a Random Forest implementation in the near future (we have a so-so working prototype, but lots of revision and improvement is needed before a formal release).

What are the methods you are planning to implement?


@HankCao When are you planning to release the DS-Lasso to the public? It sounds interesting

1 Like

Hi xescriba,

We were drafting a paper, which probably needs to tune the details of the package. So I would say before the end of next month for sure.

Regards, Hank

Hi Hank,

Please send the draft paper to Yannick and myself. We will be able to review it and perhaps review DataSHIELD and its achievements.

P.

Hi Patricia,

Sure, will do so.

Regards, Hank