Ds.cut - Numeric value to factor with intervals

Dear all,

I want to re-create the functionality of cut() from the base R-package and am thinking about confidentiality. Basically: I want to be able to categorize for example BMI into <18.5, 18.5-25 and >25 or age into something like (0,10], (10,20], (20,30], (30,40], (40,50], (50,60], (60,120]

If someone would use this function several times with only slightly different break points, one might get individiual information combining it with other categorical or categorized variables in tables. E.g. in the above example, changing the breaking point to 20.1 in age and slightly in BMI, one might end with only 1 value difference in the table to the one obtained via the first categorization.

Has anyone already thought about this? For me, there would be two options:

  1. Restrict the available options for categories, e.g. only into 2 or 4 groups based on the combined quartiles.

  2. Somewhere store which or how many categorizations have been performed by the analyst already and use this for checks, i.e. if the number of individuals re-categorized compared to the previous categorization is large enough or if the number of different categorizations of a numeric variable is still below a given threshold.

  3. Only allow a re-categorization if at least n.filter.tab observations are at a break point.

For both versions, I have problems:

  1. With the exception of the above options, I have the difficulty to define options I can generate automatically with information I can obtain in DataSHIELD. For example, I cannot find automations for the two categorizations of BMI and age.
  2. I am not aware of a way to store such information, in particular across several analysis sessions. In addition: How would the process be to add another threshold
  3. In a two or threedim table, the above given problem might still be present, but less likely. Maybe instead of n.filter.tab using a higher value, like multiplied by 3? In addition, I would need to combine rouding into the functions, I think

I will start with implementing version 3 (combined with 1 as an alternative option) but I still feel like version 2 should be a part of the checks,

Best, Daniela

HI Danieela,

Good idea. Would any of the subset functions achieve this ….

P.

Hi Daniela,

We had similar thoughts in the past but we still don’t know the optimal solution. We were thinking for example that a system that can monitor the commands (their order and how many times are used) each analyst uses, can be used to control and stop any attempt of disclosure either if this attempt is deliberate or not. Such a system can use theoretical models like the “differential privacy” for example. Another thought that we had in the past is to have a “white hat hacker” that can play with many different scenarios and check if specific combinations of functions can be used to disclose any information from the data and what we have to change to block such scenarios. Of course we need sources and people to do that :slight_smile:

In relation to your specific question, a ds.cut() would be a useful function to have in DataSHIELD. At the moment we can do such categorization using a combination of the ds.Boole and ds.make functions. For example if you have a continuous bmi variable and you want to create a three categorical variable you can do something like this:

ds.Boole(V1 = ‘bmi’, V2 = 18.5, Boolean.operator = “>=”, numeric.output = TRUE, newobj = ‘var1’)

#var1 has zeros if bmi<18 and ones otherwise

ds.Boole(V1 = ‘bmi’, V2 = 25, Boolean.operator = “>”, numeric.output = TRUE, newobj = ‘var2’)

#var2 has zeros if bmi<=25 and ones otherwise

ds.make(toAssign = ‘var1+var2’, newobj=‘bmi.c’)

#bmi.c can then take 3 possible values: 0 if var1 and var2 are both equal to zero (this means that bmi<18.5), 1 if var1 is equal to 1 and var2 is equal to 0 (this means that bmi is iin the range 18.5-25) and 2 if both var1 and var2 are equal to 1 (this means that bmi>25).

Finally you can convert the bmi.c from numeric to a factor.

But anyway, we still don’t block iterative attemps of the above procedure with slightly different break points.

It would be great if we can discuss with you those thoughts at some point (probably in a call?)

Best wishes, Demetris

Hi Demetris,

a call would be great! I have several things I am thinking about which might be good to discuss. In particular, we plan to establish as a way to “publish” data alongside a paper for medical data. Then it would be good to have a possibility to allow for specific request to reproduce the paper the results although only 1 or 2 patients are shown.

How shall we proceed? Can you set up a doodle?

Best, Daniela

Hi Daniela,

That’s great! Yes, I will set up a doodle and send it around.

Best wishes, Demetris