I want to re-create the functionality of cut() from the base R-package and am thinking about confidentiality. Basically: I want to be able to categorize for example BMI into <18.5, 18.5-25 and >25 or age into something like (0,10], (10,20], (20,30], (30,40], (40,50], (50,60], (60,120]
If someone would use this function several times with only slightly different break points, one might get individiual information combining it with other categorical or categorized variables in tables. E.g. in the above example, changing the breaking point to 20.1 in age and slightly in BMI, one might end with only 1 value difference in the table to the one obtained via the first categorization.
Has anyone already thought about this? For me, there would be two options:
Restrict the available options for categories, e.g. only into 2 or 4 groups based on the combined quartiles.
Somewhere store which or how many categorizations have been performed by the analyst already and use this for checks, i.e. if the number of individuals re-categorized compared to the previous categorization is large enough or if the number of different categorizations of a numeric variable is still below a given threshold.
Only allow a re-categorization if at least n.filter.tab observations are at a break point.
For both versions, I have problems:
- With the exception of the above options, I have the difficulty to define options I can generate automatically with information I can obtain in DataSHIELD. For example, I cannot find automations for the two categorizations of BMI and age.
- I am not aware of a way to store such information, in particular across several analysis sessions. In addition: How would the process be to add another threshold
- In a two or threedim table, the above given problem might still be present, but less likely. Maybe instead of n.filter.tab using a higher value, like multiplied by 3? In addition, I would need to combine rouding into the functions, I think
I will start with implementing version 3 (combined with 1 as an alternative option) but I still feel like version 2 should be a part of the checks,