Bad performance of covariance

Hi,

my name is Patrick and I’m working with DataSHIELD as part of the MIRACUM research project. I’m currently trying to use PCA in combination with DataSHIELD, for which I obviously need the covariance/correlation matrix of my variables. However, I tried to compute a covariance matrix of 10000 variable dataset with both the old covariance function and the new function of the dsBetaTest package. I adjusted the first function from dsStats locally to have better performance (https://github.com/datashield/dsStats/pull/3). While trying the new function in the betaTest package, my server (64 GB of RAM) quickly ran out of memory. Are there any intentions to improve the perfomance of these functions by the developer group? Otherwise I would love to contribute on these functions to make it more feasible for larger datasets.

Best regards, Patrick

Hi Patrick,

It will be great if you can contribute on optimizing the performance of those functions. Just to let you know that the previous versions of ds.cov and ds.cor that are included in the dsStatsClient package can calculate only the covariance and the correlation matrices for each single study separately. The new versions that are included in the dsBetaTestClient package are able to calculate the matrices for each single study plus the combined matrices for the pooled analysis.

Hi Demetris,

I already made good progress in improving this function. However, I’m still puzzling on how to test this function on my already set up server. Do you have any advice for this?

Hi Patrick,

To test your development you can either use the new DataSHIELD Interface (see the discusssion here: Datashield R API) or the “traditional” way by uploading the function as a script in your server (you can find a lot of information in the DataSHIELD wiki: https://data2knowledge.atlassian.net/wiki/spaces/DSDEV/pages/12943448/Notes+for+developers). Otherwise, we can arrange a skype call where I can show you how you can test the function.