Machine learning or deep learning possibilities with DataSHIELD

ayisharuna · 14 April 2021 12:57

Hi Team,

I am curious to know what DataSHIELD possibilities exist with regards to machine learning or deep learning. I have gone through the tutorials and tried out the GLM model examples and I am wondering if there are plans to feature any examples of other machine or deep learning algorithms in the tutorials. Also, what potential limitations could one foresee in this regards.

best regards, Ayisha

demetris.avraam · 14 April 2021 17:53

Hi Ayisha,

@xescriba is working on the development of some machine learning algorithms in DataSHIELD. You can see some of his developments here

stefan.lenz · 15 April 2021 05:25

We have created a package that can perform deep learning in DataSHIELD and just published a paper about it:

PatRyserWelch · 15 April 2021 08:52

that is great news. Can you provide more information about the privacy preserving features of the package, please. Deep learning can be disclose…

stefan.lenz · 15 April 2021 09:19

In the paper we compare different models with respect to disclosure risk on different data sets. With the methods we used we could not find an increased disclosure risk in deep learning per se. But of course, this is open for debate.

ayisharuna · 15 April 2021 09:41

@xescriba When you implemented K-means or KNN, did you utilize the existing libraries or did you write the algorithm from scratch? A link to the github, if available, might be helpful.

best regards, Ayisha.

jrgonzalez · 15 April 2021 10:05

From scratch … you need to implement algorithms to be non-disclosive

the github is in the vignette

Best Juan

xescriba · 15 April 2021 10:22

@ayisharuna You can find the code for both functions here:

github.com

isglobal-brge/dsML/blob/main/R/kmeansDS.R

#' @title Parallel k-means iteration
#' 
#' @description Performs an iteration of a k-means parallel algorithm (what in a multi-thread machine 
#' would be run on each thread). The client acts as the master and the servers as the slaves if thinking
#' like a regular parallel implementation.
#'
#' @param x \code{data frame} Train dataset for the k-means
#' @param ... \code{numeric} Parameters corresponding to the data frame from the server that contains the
#' centroids (updated on each iteration on the client)
#'
#' @return \code{list} with: \cr
#' -counts \code{numeric} vector with the counts per cluster \cr
#' -centers \code{data frame} New centroids calculated \cr
#' -assignations \code{numeric vector} ordered cluster assignations, to be used by the client
#' to assign them with the \code{kmeans.assign_result} function to a table on the servers to be later used
#' @export

kmeansDS <- function(x, ...){
  
  # Check 'x' for NAs, this algorithm does not work with NAs in the dataset

This file has been truncated. show original

github.com

isglobal-brge/dsML/blob/main/R/knnDS.R

#' @title K-Nearest Neighbour Classification
#' 
#' @description Compute K-Nearest Neighbours of a query vector
#'
#' @param x \code{data frame} Dataset to get the neighbours and tags
#' @param neigh \code{numeric} number of neighbours considered
#' @param classificator_name \code{character} Name of column on the table 'x' that has the classifier factor
#' @param method.indicator \code{character} (default \code{"knn"}) specifies the method that is used to
#' generated non-disclosive coordinates to calculate the euclidean distance. This argument can be set as \code{'knn'}
#'  or \code{'noise'}
#' @param k \code{numeric} (default \code{3}) he number of the nearest neighbors for which their centroid is calculated
#' @param noise \code{numeric} (default \code{0.25}) the percentage of the initial variance that is used as the variance 
#' of the embedded noise if the argument method is set to \code{'noise'}
#' @param ... \code{numeric} Queried vector
#'
#' @return \code{list} with: \cr
#' -distance \code{numeric}: Distances of the queried vector to the anonimized dataset \cr
#' -classification \code{character}: Clasification tag of the queried vector
#' @export

This file has been truncated. show original

Further information about the algorithms implemented and internal working can be found on the vignette already linked by Demetris

https://htmlpreview.github.io/?https://github.com/isglobal-brge/dsMLClient/blob/main/vignettes/dsML_vignette.html#k-nearest-neighbours https://htmlpreview.github.io/?https://github.com/isglobal-brge/dsMLClient/blob/main/vignettes/dsML_vignette.html#k-means

Feel free to contact me if you need further information.

Regards, Xavier.

ayisharuna · 15 April 2021 11:48

@xescriba , @jrgonzalez thank you for the information

ayisharuna · 15 April 2021 11:59

@stefan.lenz I was reading the paper you provided. Thank you for the link; some nice work there. Is it safe to say you did install the BoltzmannMachines Julia package in your package but however did checks to ensure that information was not getting leaked?

stefan.lenz · 15 April 2021 12:46

@ayisharuna Thanks for the interest. In this paper we focused on generating synthetic data and also performed some disclosure analysis to check whether applying such deep learning for generating synthetic data has an increased disclosure risk compared to simpler methods that are in principle already possible with DataSHIELD, e.g. GLM models. The package is mostly a wrapper around the Julia package, which restricts the amount of things that can be done. It does not, however, check the data that is returned at this time. I do not see a way to make inferences about indivuals via the data that is returned, especially because the DBM algorithm is a stochastic learning algorithm, so there is much randomness involved. But there is no absolute certainty, of course, and there is always a trade-off between utility and disclosure.

ayisharuna · 15 April 2021 12:59

@stefan.lenz Thank you for the response. I tried to create a clustering package for DataSHIELD using existing packages such as that for DBSCAN. While debugging my errors, I thought to ask if it was already possible to use an existing package for machine learning. Hence my few questions. I get from @jrgonzalez that I should have to write it from scratch but if I understand you correctly, I should be able to utilize an existing package and then build up on that.

jrgonzalez · 15 April 2021 13:20

You can use existing packages if you want to run them at each Opal idependently and then return the aggregated results (non-disclosive such as mean, sd, var, n, “probably distance”, …) and then combine them in the client side to get results as if all the data were in the same computer

Meta-analysis is a clear example, but you can also implement other methods for other methodologies mainly if you are using parallel algorithms … This is what Xavier did in dsML package, and this is what I menant with “writting the funcion from scracth”

jnothman · 15 April 2021 13:48

(dbscan is not a good fit - perhaps unless applied to RBM samples - since it doesn’t induce a model external to the data. Its clusters are defined explicitly in terms of points in the dataset, unlike the aggregate centroids in kmeans)

HankCao · 16 April 2021 11:40

Hi ayisharuna,

1, About ML packages

We wrote the DS-based Lasso and a set of Lasso-based sparse multi-task learning methods. It was supposed to release in these two months with the paper.

2, distributed Clustering

There are about three ways to do Clustering incorporating multiple geo-distributed matrices (just come to my mind)

a, Bagging of local models (meta-analysis)

b, Distributed integrative matrix-factorization + local Clustering

c, Distributed Clustering

It would be great if you could include all in your package. We done the second one, implemented the distributed integrative matrix-factorization, which extracts the common component of multiple matrices into one. If you like, you can include the result of our method as the input to the clustering. This is very simple to implement.

3, Privacy-preserving

Datashield has already provided quite many mechanisms to protect the privacy-preserving (see their paper), but most are not specific to machine learning. For distributed integrative matrix-factorization, we only output the incomplete model which I think (but not sure) is robust to model inverse attack. @all, More information especially on the attack of the incomplete model was appreciated. For example, in the 2-server-environment, one shared matrix out of five matrices was returned as the output. This shared matrix was enough for subsequent clustering. The inverse construction was not possible. But the inverse attack… no idea.

Regards, Hank

xescriba · 16 April 2021 12:41

Regarding privacy preserving mechanisms, a good suggestion is to take a look into the source code of the dsBase package to get used with the DataSHIELD filtering variables.

@ayisharuna If you are looking into developing new functionalities regarding machine learning in DS please be aware that there are already implementations of:

-PCA -Kmeans -KNN -SVD -FAMD

Also, we have planned to work on a Random Forest implementation in the near future (we have a so-so working prototype, but lots of revision and improvement is needed before a formal release).

What are the methods you are planning to implement?

@HankCao When are you planning to release the DS-Lasso to the public? It sounds interesting

HankCao · 16 April 2021 12:52

Hi xescriba,

We were drafting a paper, which probably needs to tune the details of the package. So I would say before the end of next month for sure.

Regards, Hank

patricia.ryser-welch · 20 April 2021 07:42

Hi Hank,

Please send the draft paper to Yannick and myself. We will be able to review it and perhaps review DataSHIELD and its achievements.

P.

HankCao · 21 April 2021 19:12

Hi Patricia,

Sure, will do so.

Regards, Hank

Topic		Replies	Views
What DataSHIELD functionality are you working on? New functionality under-dev	15	1494	29 May 2022
Support Vector Machines Statistical development software-release	4	23	17 March 2025
Is Lasso disclosive? Statistical help	6	555	3 February 2020
Where can I find a tutorial for adding a new datashield function and its installation？ Developer support	3	605	19 February 2020
Cox Regression in DataSHIELD Statistical help	20	1958	29 May 2022

Machine learning or deep learning possibilities with DataSHIELD

Related topics