When analyzing data in R using DataShield, is it possible to join two tables from the same Opal project? For example we would like to compare someone’s age at baseline with their values at follow up (stored in a separate table). Is this possible?
Yes, that is possible. Assume that you have three studies (‘study1’, ‘study2’, ‘study3’) with 2 tables each (‘tableA’, ‘tableB’). By using the function datashield.login() you login to one of the tables from all the studies, and by using the function datashield.assign() you assign the second table. Below is an example:
server <- c(“study1”,“study2”,“study3”)
url <- c(“http://XXX…”,“http://YYY…”,“http://ZZZ…”)
tablesA <- c(“tableA1”,“tableA2”,“tableA3”)
logindata <- data.frame(server,url,user="???",password="???",table=tablesA)
opals <- datashield.login(logins=logindata,assign=TRUE, symbol=‘A’)
datashield.assign(opals, “B”, “tableB”)
Then you can use any DataSHIELD function as usual: For example: ds.table2D(x=‘A$variable1Name’, y=‘B$variable2Name’) ds.glm(formula=‘B$variable2Name~A$variable1Name’, …)
or you can use the new data manipulation functions to merge the tables in one. See for example the ds.dataFrame.o() and the ds.merge.o() functions
Hi Demetris,
How does the merge works? There is no column for the participant IDs in the data frame, these are in the row names instead; Opal is perfectly capable of making a column for the IDs, but the current behavior with DataSHIELD context is to use row.names (this is legacy). Do you think we should change this, and make the IDs more easy to work with?
Regards Yannick
Hi Yannick,
Yes, I think that in some cases is nessecary to work with the IDs and it would be helpful if we can make the IDs more easy to work with, but let’s discuss with Paul for his opinion. At the moment to deal with this issue, we replicate the column of row.names as participants IDs or we generate a new column with a sequence of numbers from 1 to N (where N is the total number of rows) before uploading the data to Opal. For example, Paul recently developed a reshape function that converts data from long to wide and from wide to long format. In the long format we have multiple rows representing the same participant and we “groupped” each participant’s data when converting the dataset to the wide format. This is feasible only if we have a column with the IDs. Similar need we have when we apply survival analysis where again multiple rows represent data for the same participants and therefore the column with the IDs is necessary. On the other hand the ds.merge function can be used by comparing any other variable rather than the ID (if there is not a column for IDs). For example it can merge two datasets using a “date of birth” column.
Another point that we can discuss with you and Paul, and check if we can change it, is the default behaviour of Opal in sorting the IDs (the row.names). At the moment Opal sorts IDs in an alphabetical order rather than a numerical order.
It would be great if we can arrange a skype call to discuss these points.
Sure, I regularly have Skype calls on Fridays with Patricia and Stuart at between 12-1pm CEST, next one is tomorrow.
The reason for sorting the IDs alphabetically is that these are not necessarily numerics (can have a prefix for instance).
Yannick
FYI, I have added the possibility to specify a column name for the ID in the datashield.login and datashield.assign functions. Only available with DSI and will be supported by the next release Opal (due in June).
Yannick
Great! Thanks Yannick.
HI Yannick, Is the June release of OPAL still happening in June with this feature?
Yes, end of June. Anyway you would not be able to use this feature as it will require the next API of Datashield (DSI) which is due this autumn.
Yannick
Hi Demetris,
I also met the same problem.
datashield.assign(opals, “B”, “tableB”)
This command doesnot work for me, where are “B” and “tableB" defined?
Regards, Hank
Hi Hank,
Did you get a specific error message? Can you write here the entire block of code that you use for logging in to the servers and assigning the tables?
Hi Demetris,
For example, I want to datashield.login to load the table CNSIM1 from sever dstesting-100-base and dstesting-101-base, and then add the table CNSIM2 to the same opal connection using datashield.assign.
Here is my logindata
:
server url user password table
1 dstesting-100-base http://192.168.56.100:8080 administrator datashield_test& CNSIM.CNSIM1
2 dstesting-101-base http://192.168.56.101:8080 administrator datashield_test& CNSIM.CNSIM1
Step 1: load CNSIM1
opals <- datashield.login(logins=my_logindata, assign=TRUE, symbol="D")
Step 2: load CNSIM2 in the same opal connection
table <- c("CNSIM.CNSIM2","CNSIM.CNSIM2")
datashield.assign(opals, "D2", table)
The above codes doesnot report any errors, but D2 is not successfully defined
ds.exists("D2")
$`dstesting-100-base`
[1] FALSE
$`dstesting-101-base`
[1] FALSE
Please let me know if I understood any step uncorrectly.
Regards, Hank
I guess you are using the released version of datashield. In which case the datashield.assign
function is rather limited. It will not guess that the first table name applies to the first server etc. In the next version of datashield (DSI
based), the datashield.assign
will handle a named list of values like datashield.assign(opals, list('dstesting-100-base'='CNSIM.CNSIM2', 'dstesting-101-base'='CNSIM.CNSIM2'))
. Other parameters options are also available (see the DSI::datashield.assign.table documentation). In the meantime you must call the assign for each of the server:
datashield.assign(opals[1], "CNSIM.CNSIM1")
datashield.assign(opals[2], "CNSIM.CNSIM2")
# etc.
# using lapply() is certainly more elegant, but you get the idea
A special case is when the table name is the same in each of the server, then this should work:
datashield.assign(opals, "CNSIM.CNSIM2")
Yannick
Hi Yannick,
Thanks a lot, it worked perfect for me.
Regards, Hank