Armadillo vs Opal

Hi all, am new to the community and I’ve been asked to look at implementing Datashield in our secure data environment. In looking at the documentation I’ve seen that there are two options for server deployment:

  • Opal
  • Armadillo

We are using a largely kubernetes based infrastructure so container integration is essential, but I see both of these have docker definitions, though at a glance it appears that the Opal implementation may be more mature? though this is just a perception on my part at a glance.

What are the benefits and drawbacks of each option?

1 Like

Hi, Welcome to the DataSHIELD community!

You have it mostly right. Opal is the mature sibling and Armadillo is the younger sibling. Most of the DataSHIELD documentation is written with Opal in mind, so we still assume Opal to be the default DataSHIELD flavor.

To the Armadillo Team’s credit, they have learned many lessons from Opal so Armadillo is also better, in some ways.

  • Armadillo is developed by the Molgenis team.
  • Armadillo comes with some extensions to Opal, such as ds-upload which makes it possible to automate data updates.
  • The Armadillo team designs with the intention of being ‘more lightweight’ than Opal (which has a lot of legacy code).

On the subject of Kubernetes, the DataSHIELD developers are indeed a bit more pro-Docker rather than Kubernetes.

@wilmar.igl wrote a tutorial on differeny ways to install DataSHIELD, and I know from there that there was a project called CORAL. Though I think they used something like Docker Stack.

I have modified a docker compose based deployment to one which was based on Kubernetes, but I didn’t try to incorporation autoscaling.

Stuart

1 Like

Hi Shaun, As a Armadillo Administrator I can be positive about the kubernetes support. Older versions 2.x, where hosted by molgenis on kubernetes. Since the release of armadillo 3, we dropped the support of kubernetes ( cause there was no demand and we where migrating our internal production to Azure VM’s ).

But good news: Currently i’m working on a CI/CD build to do test releases to kubernetes. If you are willing, maybe we can share some thoughts and cooperate to get this to production.

Sincerely, Dick Postma Molgenis / Armadillo

1 Like

Hi Shaun, welcome to the forum. At Liverpool Uni we have installed a proof of concept DataSHIELD infrastructure in the NHS NW SDE (kubernetes based, hosted by Lancs Teaching Hospital Trust) and are now in the planning stage for pilot install in the region across 2 of our 3 ICBs. Can I suggest getting in contact with @olly (Olly . Butters @ Liverpool .ac.uk ) who can share more about lessons learned and the approach in the region.

Regarding Opal Vs Armadillo others have made some comments above. Within consented longitudinal studies we find that studies are typically using multiple components of the Obiba or Molgenis clinical software stacks so the analytical functionality of DataSHIELD is compatible with both to facilitate flexibility for consortia who may already be using associated software. There are differences between the way analysts interact under each system - Opal typically R Studio and Armadillo uses Jupyter (I am not sure if this is always the case @DickPostma can clarify) so perhaps consideration of how users will interact with the analysis may be a factor.

Hi Shaun,

Welcome!

To clarify: Armadillo also uses R(studio). However, we work in multiple large consortia where users interact with multiple cohorts (using both Armadillo and Opal). To streamline access, resources, DataSHIELD packages versions etc. we created a Jupyterhub set up where users can use Rstudio and run their analysis. But this set up is of course not necessary.

1 Like

Hi,

I agree with Marije, both Armadillo and Opal are accessible using an R API. Rstudio is better for R development, but Jupyterhub is used because the “community” version of Rstudio server does not offer central authentication with an external provider (institution’s openid service for instance), which makes user management more cumbersome. Opal also has a Python API for system administration and data management (stable) and for Datashield analysis (still experimental).

Regarding Opal vs. Armadillo, Opal has more features in terms of data and metadata management. More specifically, Opal supports various data input formats, and it is possible to fully describe the data harmonization process. This is scientifically very important when making Datashield analysis (how to know whether the harmonized variables of each study compare with each other and can be combined in a Datashield analysis?). Opal interoperates with the Mica data catalog to show up this harmonization information (module developed by Maelstrom Research). The Coral distribution integrates all these services.

In terms of performance, they are very similar, mainly because of the “resources” approach that can handle any size and format of datasets. During a Datashield session, the hard work happens on the R server(s) then this is where you should put your k8s efforts I think: as Datashield is a multi-user environment, the computation load may vary a lot depending on the type of analysis, the size of the datasets and the concurrency for hardware usage between users. Opal supports connections to multiple R servers (horizontal scaling), with the same or different profiles. Datashield profiles are managed from the Opal admin interface, including specific access rights settings.

Regards
Yannick

1 Like