Corrupted variables after cbind

Hi all, I have noticed that when working in DataSHIELD some variables become corrupted when I create a new data frame from an existing data frame, for example after using the cbind function to combine a new variable to an existing data frame

To demonstrate what I mean, the variable “sex” has become corrupted in one of my cohorts after I created the dataframe ‘D1’ from ‘D’:

ds.table1D(x=‘D$sex’, datasource=opals[‘moba’]) $counts D$sex 1 45817 2 43724 Total 89541

$percentages D$sex 1 51.17 2 48.83 Total 100.00

$validity [1] “All tables are valid!”

ds.table1D(x=‘D1$sex’, datasource=opals[‘moba’]) $counts D1$sex 1 48791 2 36872 3 14782 4 2634 5 623 Total 103702

$percentages D1$sex 1 47.05 2 35.56 3 14.25 4 2.54 5 0.60 Total 100.00

$validity [1] “All tables are valid!”

Whilst other variables are now empty:

ds.table1D(x=‘D$pets_preg’, datasource=opals[‘moba’]) $counts D$pets_preg 0 70698 1 34072 Total 104770

$percentages D$pets_preg 0 67.48 1 32.52 Total 100.00

$validity [1] “All tables are valid!”

ds.table1D(x=‘D1$pets_preg’, datasource=opals[‘moba’]) $counts D1$pets_preg 0 0 1 0 Total 0

$percentages D1$pets_preg 0 NaN 1 NaN Total NaN

$validity [1] “All tables are valid!”

The same thing hasn’t happened in other cohorts:

ds.table1D(x=‘D$sex’, datasource=opals[‘dnbc’]) $counts D$sex 1 49674 2 47206 Total 96880

$percentages D$sex 1 51.27 2 48.73 Total 100.00

$validity [1] “All tables are valid!”

ds.table1D(x=‘D1$sex’, datasource=opals[‘dnbc’]) $counts D1$sex 1 49674 2 47206 Total 96880

$percentages D1$sex 1 51.27 2 48.73 Total 100.00

$validity [1] “All tables are valid!”

ds.table1D(x=‘D$pets_preg’, datasource=opals[‘dnbc’]) $counts D$pets_preg 0 52758 1 39358 Total 92116

$percentages D$pets_preg 0 57.27 1 42.73 Total 100.00

$validity [1] “All tables are valid!”

ds.table1D(x=‘D1$pets_preg’, datasource=opals[‘dnbc’]) $counts D1$pets_preg 0 52758 1 39358 Total 92116

$percentages D1$pets_preg 0 57.27 1 42.73 Total 100.00

$validity [1] “All tables are valid!”

I have created the new data frame (‘D1’) after binding three new variables to the original data frame:

ds.cbind(x=c(‘D’, ‘ige_cat_3’, ‘ige_cat_4’, ‘ige_cat_5’), newobj = ‘D1’, datasources = opals)

Has anybody experienced the same issue and/or have any idea what I am doing wrong?

Any suggestions are very welcome! Thanks very much in advance! Angela

I have experienced a similar issue: One variable became 0 when I used ds.subset on a dataframe created via ds.dataframe

Just an update:

We found what causes this issue. Basically when the ds.cbind (or the ds.dataFrame) function is used across more than one datasources, the function initially gets the column names of the input components from each study separately, then generates a unique vector of the column names across all studies, and then passes this unique vector to each study to be used as the column names of the generated data frame in each study. However, when the order of the columns of the input components is different in different studies then the function assigns the column names by the order that is passed through the unique vector without considering the actual order of the variables across the different studies.

We have corrected this behaviour and now the new versions of ds.cbind and ds.dataFrame functions that will be included in the v6.0 release of DataSHIELD work independently for each study.

Also, we plan to develop for a future release a “rename” function to allow the user to rename the column names of a dataframe and a “reorder” function to allow the user to re-order the columns in data frames.