I have a research problem where I want to scale all the variables in a dataset to zero mean and unit variance. For the continuous variables, I use the z-score transformations. Is it statistically correct to use the same transformations for the discrete variables (e.g. a binary)?
Short answer: No.
Long answer: It depends (The standard answer of a statistician )
First of all, we have differenciate between ordinal (i.e. ordered variables like school grades, where the difference between two grades is not meaningful, but only which one is better), nominal (i.e. categories like Treatment A, B, or C, where neither the order nor the difference is meaningful), and discrete counting processes (e.g. number of failures).
For the first two, the z-transformation is meaningless, as both the mean as well as the variance are not defined meaningful. Sometimes, ordinal data are treated as it were interval or ratio data, and then it can be argued that the z-transformaiton is applicable.
For the counting process, it might be useful. For example, in genomic analysis, the SNP is often defined as “0=SNP not present”, “1=SNP present on one chormosom”, “2=SNP present at both chromosoms”, and the standard approaches use the z-transformation (after the variables have been pre-screened and only variables with a meaninfull high variance are analysed).
Some scientiests would use the z-transformation also for binary data, as then the two categories remain (you will still only have two values for the variable, just with another value than before). Mostly, this is done when one wants to perform some kind of automatic variable selection and one wants to eleminate the influence of the variance.
Please note that in a regression model, you would have to interpret the results very carfully!!!
As you might have read between the lines, the planed analysis also playes an important role in deciding whether the z-transformation is acceptable or not. Thus, if my answer doesn’t help you right now, please contact me with details on your planned analysis and I might give you a more fitting answer.
Best wishes, Daniela
Hi @daniela.zoeller. Thanks for the helpful answer. My plan is not related to any kind of statistical analysis but I am trying to develop a deterministic algorithm for data anonymization. In this algorithm, I consider the elements in each row of individual-level data as coordinates of points in an N-dimensional space (where N is the number of variables in the dataset).
One step of the algorithm finds the k-nearest neigbors of each data point using as metric of proximity the Euclidean distances (or the Mahalanobis). This is the reason that I want to have the same scale in all the variables.
At the moment I separate continuous with categorical data, I then apply stratification in relation to all the combinations of the categories in the categorical data and then search for the nearest neighbors using only the continuous variables in each strata.
If you plan to attend the DataSHIELD workshop in September, I can show you more details of the algorithm
Many thanks, Demetris
This is actually a very complex question and I am aware of some discussions in the field of clustering concerning this combination. Unfortunately, I don’t think that there is a consent what to do about it.