Hi all, am new to the community and I’ve been asked to look at implementing Datashield in our secure data environment. In looking at the documentation I’ve seen that there are two options for server deployment:
Opal
Armadillo
We are using a largely kubernetes based infrastructure so container integration is essential, but I see both of these have docker definitions, though at a glance it appears that the Opal implementation may be more mature? though this is just a perception on my part at a glance.
What are the benefits and drawbacks of each option?
You have it mostly right. Opal is the mature sibling and Armadillo is the younger sibling. Most of the DataSHIELD documentation is written with Opal in mind, so we still assume Opal to be the default DataSHIELD flavor.
To the Armadillo Team’s credit, they have learned many lessons from Opal so Armadillo is also better, in some ways.
Armadillo comes with some extensions to Opal, such as ds-upload which makes it possible to automate data updates.
The Armadillo team designs with the intention of being ‘more lightweight’ than Opal (which has a lot of legacy code).
On the subject of Kubernetes, the DataSHIELD developers are indeed a bit more pro-Docker rather than Kubernetes.
@wilmar.igl wrote a tutorial on differeny ways to install DataSHIELD, and I know from there that there was a project called CORAL. Though I think they used something like Docker Stack.
Hi Shaun,
As a Armadillo Administrator I can be positive about the kubernetes support. Older versions 2.x, where hosted by molgenis on kubernetes. Since the release of armadillo 3, we dropped the support of kubernetes ( cause there was no demand and we where migrating our internal production to Azure VM’s ).
But good news: Currently i’m working on a CI/CD build to do test releases to kubernetes. If you are willing, maybe we can share some thoughts and cooperate to get this to production.
Hi Shaun, welcome to the forum. At Liverpool Uni we have installed a proof of concept DataSHIELD infrastructure in the NHS NW SDE (kubernetes based, hosted by Lancs Teaching Hospital Trust) and are now in the planning stage for pilot install in the region across 2 of our 3 ICBs. Can I suggest getting in contact with @olly (Olly . Butters @ Liverpool .ac.uk ) who can share more about lessons learned and the approach in the region.
Regarding Opal Vs Armadillo others have made some comments above. Within consented longitudinal studies we find that studies are typically using multiple components of the Obiba or Molgenis clinical software stacks so the analytical functionality of DataSHIELD is compatible with both to facilitate flexibility for consortia who may already be using associated software. There are differences between the way analysts interact under each system - Opal typically R Studio and Armadillo uses Jupyter (I am not sure if this is always the case @DickPostma can clarify) so perhaps consideration of how users will interact with the analysis may be a factor.
To clarify: Armadillo also uses R(studio). However, we work in multiple large consortia where users interact with multiple cohorts (using both Armadillo and Opal). To streamline access, resources, DataSHIELD packages versions etc. we created a Jupyterhub set up where users can use Rstudio and run their analysis. But this set up is of course not necessary.
I agree with Marije, both Armadillo and Opal are accessible using an R API. Rstudio is better for R development, but Jupyterhub is used because the “community” version of Rstudio server does not offer central authentication with an external provider (institution’s openid service for instance), which makes user management more cumbersome. Opal also has a Python API for system administration and data management (stable) and for Datashield analysis (still experimental).
Regarding Opal vs. Armadillo, Opal has more features in terms of data and metadata management. More specifically, Opal supports various data input formats, and it is possible to fully describe the data harmonization process. This is scientifically very important when making Datashield analysis (how to know whether the harmonized variables of each study compare with each other and can be combined in a Datashield analysis?). Opal interoperates with the Mica data catalog to show up this harmonization information (module developed by Maelstrom Research). The Coral distribution integrates all these services.
In terms of performance, they are very similar, mainly because of the “resources” approach that can handle any size and format of datasets. During a Datashield session, the hard work happens on the R server(s) then this is where you should put your k8s efforts I think: as Datashield is a multi-user environment, the computation load may vary a lot depending on the type of analysis, the size of the datasets and the concurrency for hardware usage between users. Opal supports connections to multiple R servers (horizontal scaling), with the same or different profiles. Datashield profiles are managed from the Opal admin interface, including specific access rights settings.
Some familiar voices here that I met at the Groningen conference last year and who I look forward to meeting again this year. Thank you all for your help with this.
@nhs-sjt and I are working together on building a secure data environment for healthcare data. research. The entire analytics platform is built upon Kubernetes (in Microsoft Azure) and has various components - including JupyterHub, OHDSI, etc.
DataSHIELD is an important component of this wider architecture. @yannick , you may recall my email last year about auto-scaling on the Opal-Rock-DataSHIELD stack - this is the continuation of that effort.
In the networks we serve using Armadillo, we now have central infa with an access point using jupytherhub (where users can use ds client using R studio) + fusionauth for oidc based id federation (using LS AAI to connect to institutes which works in EU). This runs Kubernetes to enable scaling.
We used to also have run Armadillo on kubernetes, but in most cases our partners want to install Armadillo locally and don’t have kubernetes. We do host Armadillo for some data partners ‘as a service’ on Azure. To make sure that our operations are as similar to our users we there moved from Kubernetes to Azure hosted VMs, using docker to manage the ds profiles. But if you would be exploring kubernetes we would be interested.
We have 3 weekly ‘tech’ meetings in DataShield project that you might want to join someday to see if we can make adaptations to serve your needs (either Armadillo or Opal or both)
Yes I do remember the spawner pattern, and still have it in mind… Since our last meeting the main focus was on migrating to java 21. Once done, we would be happy to continue with making the stack k8s friendly.