Tutorial for creating central web portal?

Hello, I am exploring how to implement Datashield for the ProPASS research consortium. I found your documentation for data providers, users, and function developers, but I struggle to find documentation for setting up the client portal that acts as a central node in the network of Opal data warehouses as depicted in Figure 3 of the Wilson 2017 paper https://datascience.codata.org/articles/10.5334/dsj-2017-021/. Is this central client portal still needed? If yes, where can I find documentation for creating such a portal?

Kind regards, Vincent

Hi Vincent,

There are different philosophies around how you manage the client side of things. It is possible for every user to work with the servers holding the data by installing R and the DataSHIELD packages locally on their own computers. For example, on a Windows machine:

https://data2knowledge.atlassian.net/wiki/spaces/DSDEV/pages/1146454017/v6.1+Windows+Installation+Instructions and see the part “Install DataSHIELD client packages”.

I think the disadvantage of that is the user has to make sure they match the version of the client DataSHIELD packages to the server side packages. The advantage is the consortium doesn’t have to run a web portal.

Consortia such as InterConnect and Lifecycle (I think? @sidohaakma) provide RStudio server, which acts as a web portal for all users. This is a hosted version of RStudio. You can just follow the standard instructions for installing RStudio server available on the web, and then install the DataSHIELD client packages for all users.

Best wishes

Tom

1 Like

Hi,

I would add to Tom’s comment that when some data node owners require to restrict access from a known client only, having a central trusted Rstudio server is the preferred option.

Regards
Yannick

2 Likes

Thanks Tom and Yannick for your replies in November! It took me a while to reflect on this and to obtain input from my consortium partners. I have a couple of follow-up questions. I see there is also an OBiBa user group https://groups.google.com/g/obiba-users, if you think my questions are better asked there then please tell me.

If we would go for a central portal solution:

  1. RStudio server does not come with user and credential management. We would need a separate tool for Linux user management. Can you advise on this or is this actually what Agate and Mica are designed for? If yes, will it manage both the Opal- and the DataSHIELD credentials?

  2. I am assuming that the plan is to give every researcher in the consortium their own unique credentials to enable auditing the activities, correct?

  3. For the purpose of reproducible research is there a mechanism to log the state of each Opal server and communicate this back to the DataSHIELD users, such that they can reproduce their analysis at a future point in time?

Best wishes,

Vincent

Hi Vincent,

Here are my replies, but of course it would be great to get @yannick 's view too.

It’s actually a good question about where this type of discussion belongs. In the past I think obiba-users has been quite focused on the detailed technicalities of the Opal, Mica etc. applications. I think the intention of the DataSHIELD forum was originally to offer support on using and developing DataSHIELD functions. However, as well know if you are actually using DataSHIELD in a real-life project, there are a lot of infrastructure considerations which we are all grappling with. So I suggest for now we use this forum for these issues too!

For your questions:

  1. You are correct that RStudio server uses Linux credentials. Agate is for managing users across the OBiBa suite (Opal, Mica etc and hence also DataSHIELD). Mica is the OBiBa cataloguing product. I know that @sidohaakma is now looking at using Jupyter Hub as the analysis portal in the LifeCycle project because it is able to utilise existing authentication services. I have not been able to look into it in detail yet, and maybe Sido can describe it a bit better than me.

  2. It is for your consortium to decide who has credentials to access the system, but I think it would be generally a good idea to have credentials per user for auditing as you say

  3. The question of reproducibility is always challenging, I think. For the analysis to be reproduced at some point in the future it needs to be guaranteed that the Opals are all still running. Given that they are run by independent groups this cannot be guaranteed. For example, if a group runs out of funding to keep their server running. For InterConnect we try to keep things alive at least until papers have been reviewed and published. I guess this is no worse than the current situation we see with genetics consortia who send cookbooks to each study to run an analysis - it is unlikely that this could be reproduced easily. Maybe someone in another consortia has a better answer.

Best wishes

Tom

Hi,

My contribution to this discussion:

  1. One possible option is to configure the Linux server (where Rstudio is running) so that authentication is delegated to a LDAP server (it is quite an advanced sysadmin job!). You could then use the user federation feature of Keycloak to manage the LDAP users with a modern user management tool. If you want to go one step further, Opal can delegate authentication to Keycloak (using OpenID Connect) and then the LDAP server becomes the central authentication service of the whole DataSHIELD infrastructure.

  2. Yes, definitely one credential per user for activity auditing.

  3. Apart from the infrastructure sustainability issue mentioned by Tom, my recommendation would be (1) to use the Opal project backup/restore feature to save the DataSHIELD data setup and (2) to use containerized R servers (docker images with all the needed packages installed). This is the minimum you can do to make reproducible science.

Regards
Yannick