How to create a new DataSHIELD user?

Hi all,

We’ve set up Opal servers to try creating privacy-preserving federated data opportunities. We’ve struggled to find documentation on how to create a DataSHIELD-only user to access the data in Opal. Do I create a user in Opal for them to access tables via datashield?

I’ve also tried looking at how the users are configured in the tutorial VMs, but at present I’m having trouble getting VirtualBox to work. The username “administrator” for logging into datashield does not persuade me that the account is unable to access the private data.

Please help clarify or point to docs about creating datashield-only users.

Thanks,

Joel

Hi Joel,

You create a user in Opal, give them permission to view summaries on a table/view and also give them permission to use DataSHIELD.

Then you specify these credentials and the table you gave the permission on as part of the login object:

builder$append(server = "study1",  url = "http://192.168.56.100:8080/",
           user = "my_user", password = "my_password",
           table = "CNSIM.CNSIM1", driver = "OpalDriver")

Does that help?

Tom

Thank you Tom, very helpful. Sorry for the slow response.

I will admit, I’m now stuck a few steps down the path from there. Together with Vincent van Hees and others I’m trying to prove the concept of uploading a harmonised tabular dataset and performing regressions using datashield, with an eye to developing a procedure for installing new nodes that minimises ongoing administration (i.e. node owners should focus on uploading current data / appending rows according to a harmonised table definition).

With apologies, I’ve not worked through the documentation/tutorials in a rigorous manner. I have now understood that:

[User Management]

  • We will need to deploy a centralised OpenID Connect authority where we can create new user accounts.
  • We will be best to develop an Opal image that comes with that OpenID Connect authority configured to a datashield-permitted group. (I’ve not yet investigated how easily this can be achieved with CLI.)
  • On the datashield user side, I suppose that will then mean having a script to acquire an OpenID token to enable the datashield logins.

[Data Ingestion]

  • We may need to create all 1000+ table variables in advance: is there a way to do that from CLI, or is documentation for the relevant REST APIs available?
  • We will still require someone CLI or R-savvy to execute the data import (likely from CSV) as I’ve not identified UI for this.
  • The import of a new table will then require permitting the OpenID group to access its summaries (as this can’t be permitted in advance of table creation). Also possible from CLI.
  • Not yet sure what the best way is to handle the addition of new records, if necessary, to a node’s table: whether this should involve the creation of a new table with a new version of the data, copying the variables and permissions from the previous table, etc, and disseminating the new table name to users.

I expect I will need to make more time for going through the tutorials/docs. :wink:

The Obiba community’s support through this POC process is much appreciated.

Thank you,

DR JOEL NOTHMAN | Data Science / Research Engineer Group Lead

Sydney Informatics Hub | Core Research Facilities | Research Portfolio

THE UNIVERSITY OF SYDNEY

Hi,

Most of your needs (if not all) are covered by the opalr R package: data import/export, dictionary, permissions, DataSHIELD setup etc. There is also a Opal python client, that you can use for your data workflow. And of course all functionalities are accessible through the GUI.

Opal supports authentication to OpenID Connect providers. See also the Personal Access Token feature.

Regarding data management, my preference goes to making “releases” of datasets (different table name in the same project or same table name in a different project), so that users do not get surprised when analyses results change or to make previous analyses reproducible. You can use a View (logical table) to always point to the latest release.

Best regards
Yannick

I would add that the data can be imported via the Opal UI, so it doesn’t necessarily need to be done by someone with CLI expertise. @Yannick can correct me, but I think everything that you do on the CLI can be done in Opal? It’s just if you want automation (esp to do something like set permissions in bulk) then the CLI is more convenient.

If you import via CSV then you will also need to create an accompanying data dictionary to describe the data - so it might be better to looking into using one of the other formats supported (Stata, R etc.)

I agree with @Yannick 's comments about versioning of the data

So I haven’t found the import functionality in Opal UI, but it’s good to know it’s hiding in there. I haven’t obviously found variable creation in the CLI or the ability to clone a table’s variable definitions without cloning its data (or is this something I can do with export/import annotations), so that seems to make CSV import fairly useless when one’s variables are not all text.

Thanks for the advice regarding versioning and user management. We will need to think more about openid solutions for our consortium.

To import data, create a project (select in which database project’s data are to be stored), go to this project, select Tables tab, there is a big Import button.

In Opal there are plugins that you can add (Administration section) and one of them (opal-datasource-readr) imports data in CSV format with data type discovery, which can be convenient.

When you say “to clone a table’s variable definitions without cloning its data”, you are referring to a View: select the variables of table, then click on “Add to view” and you are done. Another way of doing this: add variables to the cart, go to the cart and make a view. You will be able to change the data type of each variable, and much more, using the view.

Note that with the R client you can prepare your dataset, the way you like (from any kind of input format that is readable by R and then perform any data transformation/cleansing) and then save it into Opal. There is definitely no reason why one would stick with text types from CSV files.

Regards
Yannick

Indeed the import button is there staring me in the face. Not sure how I lost it! I suppose it is just a matter of familiarising oneself with the layers of navigation and toolbars.

By copying the variable definitions, I mean that were CSV import to be useful, one needs to be able to import another CSV with the same datatype definitions. I’ve not worked out how to do that with Add to View, but perhaps the bottom line is to use a self-describing data format. No worries.

Thanks again

Joel - I now understand from Doua that uploading the test data as SAS-file instead of csv-file was her solution: The variable formats were automatically recognised and she was able to run some basic statistics tests on the data. Note that Opal also accepts files from Stata, SPSS, SQL, and more. It is probably good to communicate this to data owners as a possible alternative upload route.

Hi Yannick,

I’d like to clarify the datashield login procedure for a multi-node setup with an OpenID Connect provider common to all nodes. Will the datashield user need to:

  • log into the web UI of each opal node
  • generate a personal access token
  • copy-paste that into a list of login details

Have I missed a simpler solution that doesn’t require the user to perform per-node bureaucracy?

Thanks,

Joel

Hi,

Exactly, this per-node bureaucracy is only required once, whereas making the OIDC dance would have to be done each time the token has expired, for each node. The personal access token is the most suitable and safer way to authenticate using an API, as opposed to OIDC that is for human interaction (function call that redirects to a web browser, then callback to the “script”, is a bit weird). See also this Personal access token description.

Regards,
Yannick

It would still be ideal to me if there were some way to facilitate this dance when a user needs to acquire PATs from multiple nodes and compose a login script. I also note that creating a PAT requires four clicks and a text input in the UI.

So if we were to facilitate the login but still require the user to copy-paste, it would be something like:

  • Give the user an interface that allows them to click through a list of links to /ui/index.html#!profile for each node, with a form to paste in the collected tokens
  • Clicking on a link opens a window where the user authenticates, clicks Add Access Token, copies the token to the form, enters name, clicks Analyse and Use DataSHIELD and Save
  • Once all collected tokens have been pasted into the form, it spits out an R datashield login sciript.

Otherwise, I can imagine using a web driver to execute the same flow, with user input at authentication. But distributing a web driver-based app to non-tech-savvy users is not ideal.

The third option would indeed involve a script opening a web browser to authenticate then using that OIDC flow to define a PAT with an Opal REST API. I’m not sure whether cross-site protections would make this difficult.

Thoughts?

Joel

What you would expect is a kind of dashboard for the DataSHIELD nodes. That could also include some admin tools for monitoring the different node states. There is no such application right now.

What I can propose:

  • facilitate the creation of PATs in the Opal interface, by adding a more obvious link in the landing page and shortcuts + simplified forms for creating PATs for different common purposes (like DataSHIELD, data import/export, SQL etc.)
  • add some functions in opalr R package to login using OIDC (the weird OIDC dance initiated from a script, not sure it works from Rstudio server by the way) and to create PATs.

Regards
Yannick

+1 for quick links to get PATs as a quick solution. I don’t mind where the start page is as long as it can be reached from a fixed URL, but it should then be at most three clicks to generate, copy and save the PAT, at least for DataSHIELD access since for that case you expect the user has minimal interest in the specific content of one node.

AFAICT, to use OIDC from a script requires knowing the client ID and listening on a port on localhost for the returned id/access token. No, opening a port won’t work in rstudio server, at least not when run on another host. But OIDC login from script by itself is not as useful as performing multiple OIDC logins from a script/CLI and returning a personal access token for each node in a way that you can then just paste that code into RStudio (server or otherwise).

And yes, I suppose users would find node status information to be helpful.

Cheers,

Joel