Seed for random number generation (RNG)

In some DataSHIELD functions we need to add a small random number to the outputs before we return them to the client (see for example the graphical functions). The embedded random numbers follow a normal distribution, and therefore if the seed of RNG is not fixed in a constant value, then multiple uses of the same function in a given dataset will give different results but their average will converge to the real numbers (due to the law of large numbers) thus is disclosive. To overcame this issue we must fix the seed number however we should keep it secret from the user.

I have developed a server-side function (https://github.com/datashield/dsBetaTest/blob/master/R/seedDS.o.R) that generates the seed number based on an input vector, however the way that this number is generated is not the optimal. Do you have any thoughts on how to optimise the generation of a seed number based on an input vector?

One idea is that the Opal will generate a study-specific seed number at the time when the data are uploaded in the Opal server. Another idea is that the data owner can specify the seed number as part of the data dictionary, but this number should be kept secret to any other person. @yannick do you think that any of these two options are possible?

Sure, there is no problem for Opal (or whatever data repository is used (think DSI)) to provide a (server instance specific) seed number at Datashield R session creation. This can be part of the specifications of a Datashield-compatible data repository. What would be the scope of this seed number? Simply study-specific or R session specific, etc.?

Yannick

Great! This should be study-specific, so any user that analyse data from a given study to get always the same answer as any other user who does the same analysis on the same data.

Ok, so this is the right time to do it as there is a opal release coming soon.

Would a R option datashield.seed work for you? In your R code server side you would get the seed number value with getOption("datashield.seed"). Opal will ensure the seed is always the same.

Is there a preferred range of values for the seed number?

Yes, an R option datashield.seed would work. The seed can be any number, we don’t have a preferred range. The data owner can specify this number for their study. However, we want this number to be hidden from any user. We don’t want this to be visible in the Opal Interface like the other R options neither to be visible in any package’s DICTIONARY.

Opal has already a secret key which is instance specific, generated at first run. We can have a seed number as well, hidden from the users.

The opal feature request issue.

Yannick

Great! Thanks Yannick.

What about the case datasets from different studies are hosted on the same Opal? They could even be in the same project. The seed cannot be dataset name/path dependent because one could assign several tables during the same datashield session. I am a bit lost with your need.

Hi @yannick. Yes its ok for different studies that are hosted on the same Opal to share the same seed.

Hi,

The datashield.seed R option is available in Opal 2.14 which has been released yesterday.

Cheers

Yannick

Thanks Yannick. I will try to use the generated seed and I will give you any feedback.

Best, Demetris