Is Lasso disclosive?

Hi datashield members:

Since we were dealing with high-dimentional data (n<<p) i.e. omics data, therefore we were learning to implement the common machine learning tool —Lasso and Ridge regression— in datashield. And the implemented method must be non-disclosive.

However, during the study, we found the statement in the Page: “…This disclosure filter protects against fitting overly saturated models which can be disclosive. The choice of 0.37 is entirely arbitrary…”

Can anyone explain why the saturated model is disclosive? How to identify the individual information from the saturated model? It is important to us since LASSO is the saturated sometimes (RIDGE model must be saturated), is there any possibility to avoid the disclosure for machine learning?

Regards, Hank

Hi Hank,

The point is that if the number of individuals is equal to number of parameters to be estimated is the same, the solution of the regression is exact. Reversing this solution would give you the individual level data making it disclosive.

Unfortunately, I can not state anything about Lasso or Ridge in DataSHIELD as I haven’t looked into it yet, but I would be very happy to contribute if you like!

I personally have implemented a non-disclosive version of componentwise likelihood-based boosting in DataSHIELD. This method is a machine learning method to automatically select variables in a setting with n<<p. In general, the results of this boosting approach and LASSO are very similar and only differ when there is a complex correlation structure within the variables. In the latter case, boosting gives more robust results than LASSO. If you are interested in this method, you can contact me. More information can be found in our manuscript on ArXiv (we plan to submit it in the next two month) - https://arxiv.org/abs/1803.00422

Best wishes, Daniela

1 Like

Hi Daniela:

Thanks, I understand if n==p, the training data would be on the fitting line becasue the residuals are 0. However you still can not identify the individuals from the fitting line because there are infinit number of points there.

Sure we definitely want to test more machine learning methods on our data. Can we find all necessary information to install and run your method on this github page?

Regards, Hank

Hi Hank,

More or less yes. The program is in Julia (calling R) and I still have some issues when deploying it in a real setting: The function automatically uses all assigned variables as stating special ones in a high-dim setting is too large for the parser system. In addition, the function runs quite slowly and the communication system crashes sometimes without an error - which is not a problem of my code itself but of the fact that I call a lot of DataSHIELD functions. I am currently re-working the program so that less single calls are needed.

Best wishes, Daniela

Hi Daniela

Thanks for the information. We are currently evaluating the running time of glm in high-dimensional differential analysis. After this, we will came back to your method. Please also feel free to let me know once you solved the deployment issues.

I have several questions about the running time of your method. How the running time of your method scales with linearly increasing

  • the number of covariates

  • the number of subjects

  • the number of datashield clients

In the current testing, the running time of differential analysis scales linearly with increasing the number of covariates. But your method might not be the case because of the multi-variant analysis. This information is quite important for us to determine how to design the analysis.

Regards, Hank

Hi Hank,

Right now, this is very slow.

I am basically using ds.glm and ds.cov. For every covariate, I have to call ds.glm (which I currently do with one call per covariate) - thus, this takes a lot of time. I guess this will be much faster when the ds.omics package is out and I can call the univariable estimates for all covariates, or I will write a wrapper for this.

After the ds.glm-calls I call ds.cov for every already selected covariate (I start with non-selected) with potentially interesting ones. Depending on the data structure (correlation), this might again be unnessary time-consuming as again single calls are needed. Here, I will write a wrapper for ds.cov where I will be able to call several covariances simultaneously.

As I am using ds.glm and ds.cov, the scaleability to the number of clients in the same.

The only issue I have right now is that the data need to be standardized - For the omics data this is a huge effort when using currently available methods of DataSHIELD, which is why I looked at the effect of standardizing per cohort (before DataSHIELD) and luckily, this does not seem to be a problem even with 20 sites.

Another thing: My method is tailored at continuous endpoints, but we have evaluated the effect on the variable selection if the endpoint is nevertheless binary, and the selection is good!

Best wishes, Daniela

Hi Daniela,

Sounds great, I am looking forward to use your method.

Regards, Hank