Linking opal tables for access through datashield

wkrosner · 15 May 2019 21:29

When analyzing data in R using DataShield, is it possible to join two tables from the same Opal project? For example we would like to compare someone’s age at baseline with their values at follow up (stored in a separate table). Is this possible?

demetris.avraam · 16 May 2019 07:16

Yes, that is possible. Assume that you have three studies (‘study1’, ‘study2’, ‘study3’) with 2 tables each (‘tableA’, ‘tableB’). By using the function datashield.login() you login to one of the tables from all the studies, and by using the function datashield.assign() you assign the second table. Below is an example:

server <- c(“study1”,“study2”,“study3”)

url <- c(“http://XXX…”,“http://YYY…”,“http://ZZZ…”)

tablesA <- c(“tableA1”,“tableA2”,“tableA3”)

logindata <- data.frame(server,url,user="???",password="???",table=tablesA)

opals <- datashield.login(logins=logindata,assign=TRUE, symbol=‘A’)

datashield.assign(opals, “B”, “tableB”)

Then you can use any DataSHIELD function as usual: For example: ds.table2D(x=‘A$variable1Name’, y=‘B$variable2Name’) ds.glm(formula=‘B$variable2Name~A$variable1Name’, …)

or you can use the new data manipulation functions to merge the tables in one. See for example the ds.dataFrame.o() and the ds.merge.o() functions

yannick · 16 May 2019 07:36

Hi Demetris,

How does the merge works? There is no column for the participant IDs in the data frame, these are in the row names instead; Opal is perfectly capable of making a column for the IDs, but the current behavior with DataSHIELD context is to use row.names (this is legacy). Do you think we should change this, and make the IDs more easy to work with?

Regards Yannick

demetris.avraam · 16 May 2019 08:21

Hi Yannick,

Yes, I think that in some cases is nessecary to work with the IDs and it would be helpful if we can make the IDs more easy to work with, but let’s discuss with Paul for his opinion. At the moment to deal with this issue, we replicate the column of row.names as participants IDs or we generate a new column with a sequence of numbers from 1 to N (where N is the total number of rows) before uploading the data to Opal. For example, Paul recently developed a reshape function that converts data from long to wide and from wide to long format. In the long format we have multiple rows representing the same participant and we “groupped” each participant’s data when converting the dataset to the wide format. This is feasible only if we have a column with the IDs. Similar need we have when we apply survival analysis where again multiple rows represent data for the same participants and therefore the column with the IDs is necessary. On the other hand the ds.merge function can be used by comparing any other variable rather than the ID (if there is not a column for IDs). For example it can merge two datasets using a “date of birth” column.

Another point that we can discuss with you and Paul, and check if we can change it, is the default behaviour of Opal in sorting the IDs (the row.names). At the moment Opal sorts IDs in an alphabetical order rather than a numerical order.

It would be great if we can arrange a skype call to discuss these points.

yannick · 16 May 2019 08:44

Sure, I regularly have Skype calls on Fridays with Patricia and Stuart at between 12-1pm CEST, next one is tomorrow.

The reason for sorting the IDs alphabetically is that these are not necessarily numerics (can have a prefix for instance).

Yannick

yannick · 23 May 2019 19:10

FYI, I have added the possibility to specify a column name for the ID in the datashield.login and datashield.assign functions. Only available with DSI and will be supported by the next release Opal (due in June).

Yannick

demetris.avraam · 24 May 2019 10:57

Great! Thanks Yannick.

wkrosner · 5 June 2019 22:41

HI Yannick, Is the June release of OPAL still happening in June with this feature?

yannick · 6 June 2019 10:47

Yes, end of June. Anyway you would not be able to use this feature as it will require the next API of Datashield (DSI) which is due this autumn.

Yannick

HankCao · 23 January 2020 14:52

Hi Demetris,

I also met the same problem.

datashield.assign(opals, “B”, “tableB”)

This command doesnot work for me, where are “B” and “tableB" defined?

Regards, Hank

demetris.avraam · 25 January 2020 18:55

Hi Hank,

Did you get a specific error message? Can you write here the entire block of code that you use for logging in to the servers and assigning the tables?

HankCao · 29 January 2020 14:16

Hi Demetris,

For example, I want to datashield.login to load the table CNSIM1 from sever dstesting-100-base and dstesting-101-base, and then add the table CNSIM2 to the same opal connection using datashield.assign.

Here is my logindata:

  server                        url          user         password        table
1 dstesting-100-base http://192.168.56.100:8080 administrator datashield_test& CNSIM.CNSIM1
2 dstesting-101-base http://192.168.56.101:8080 administrator datashield_test& CNSIM.CNSIM1

Step 1: load CNSIM1

opals <- datashield.login(logins=my_logindata, assign=TRUE, symbol="D")

Step 2: load CNSIM2 in the same opal connection

table <- c("CNSIM.CNSIM2","CNSIM.CNSIM2") 
datashield.assign(opals, "D2", table)

The above codes doesnot report any errors, but D2 is not successfully defined

ds.exists("D2")
$`dstesting-100-base`
[1] FALSE

$`dstesting-101-base`
[1] FALSE

Please let me know if I understood any step uncorrectly.

Regards, Hank

yannick · 29 January 2020 16:29

I guess you are using the released version of datashield. In which case the datashield.assign function is rather limited. It will not guess that the first table name applies to the first server etc. In the next version of datashield (DSI based), the datashield.assign will handle a named list of values like datashield.assign(opals, list('dstesting-100-base'='CNSIM.CNSIM2', 'dstesting-101-base'='CNSIM.CNSIM2')). Other parameters options are also available (see the DSI::datashield.assign.table documentation). In the meantime you must call the assign for each of the server:

datashield.assign(opals[1], "CNSIM.CNSIM1")
datashield.assign(opals[2], "CNSIM.CNSIM2")
# etc.
# using lapply() is certainly more elegant, but you get the idea

A special case is when the table name is the same in each of the server, then this should work:

datashield.assign(opals, "CNSIM.CNSIM2")

Yannick

HankCao · 30 January 2020 07:08

Hi Yannick,

Thanks a lot, it worked perfect for me.

Regards, Hank

Topic		Replies	Views
DataSHIELD analysis with relational database Analyst Support	8	537	10 August 2020
How to create a new DataSHIELD user? Beginner Support	14	432	28 October 2021
Error: No such value table in data in Data Source Analyst Support	4	523	18 May 2020
Error when assigning table Analyst Support	8	463	3 December 2020
Titanic workshop data import - identifiers table Beginner Support	15	584	14 April 2022

Linking opal tables for access through datashield

Related topics