How should we name our community?

Recently at various meetings there has been some confusion about the name “DataSHIELD”: is it only the R software used for federated analysis, or does it also include the whole community of software and processes that are required for working with federated data in a non-disclosive way (e.g. including software like Molgenis/Armadillo, the OBiBa suite, Coral, etc, but also the processes we use to communicate this way of working to governance/ethics groups)?

So far this has come up at the Full Stack technical meetings, the ongoing community governance workshops, and the recent EUCAN-Connect assembly, with people agreeing that there is a need for clarification so as to be able to distinguish the R packages from the overall community .

This post is being made on behalf of the technical group who met in the recent Full Stack meeting to request input into this issue from across the community:

  • Do you have a suggestion for a new name for the community?

  • Or, do you have alternate ideas for how to distinguish the R packages from the overall community?

Please use this thread for discussion (as a reminder, you can turn email notifications on or off by selecting from the drop down menu at the bottom of the thread).

We will continue to discuss this topic over the next few months. And, perhaps it will be possible for us to reach agreement by the time of the conference in Barcelona :slight_smile:

Thanks for opening this thread, Tom!

Indeed, while working on the community governance documents, there have been quite some discussion when using the word “DataSHIELD” as multiple people infer different concepts/structures behind it.

From my point of view, I have always described “DataSHIELD” as a series of R packages, and thus, would stick with the term “DataSHIELD” for those R packages. Personally, I think that the new term name should not be too divergent from “DataSHIELD” itself though as much of the software is associated with the R packages.

My ideas for a term representing the whole community of software related to DataSHIELD would be:

DataSHIELD Universe DataVERSE DataSPHERE

What do others think?

It’s nice that we have grown so much that we have this problem :slight_smile:

In my mind “DataSHIELD” is the core R packages. I do see the confusion around the other tools in the tool box though.

I guess it depends on who actually belongs to the community at present - does everyone either use or facilitate the use of the DataSHIELD R packages?

Maybe DataSHIELD R packages and DataSHIELD users existing in the DataSHIELD ecosystem?

Hi,

Here are my definitions:

  • For me “DataSHIELD” is more a “method”: making distributed privacy preserving computations. DataSHIELD results validity can be proved mathematically. The R packages, and the computation nodes infrastructure (Opal etc) are just implementation details. One could imagine a Python based DataSHIELD implementation for instance.

  • The “DataSHIELD community” is the group of users/developers/ethics experts etc. that are making DataSHIELD a reality.

  • The “DataSHIELD ecosystem” is the toolbox (R packages, Opal, resources, docker images, tutorials etc).

Yannick

I agree with Yannick’s point about DataSHIELD representing a method. If there was a new development of the same analysis tool in Python, my instinct is that it would be an example of DataSHIELD too.

However I have a thought experiment dilemma; what if Professor X of Consortium Y took it upon themselves to code it all up into Python, then the Python package took off and soon had 3x the monthly users as the R version. If they were the bigger of the two languages of DataSHIELD, could they call themselves the “leader” and subvert Becca’s role as current PI? If there was a disagreement over, say, a new disclosure method which the R team wanted to implement but Professor X didn’t want to implement for Python DataSHIELD, would they both remain DataSHIELD?

I’m not sure about the above, and not sure if it’s an example in favour of creating a new name for the wider “method”?

I respectfully request that these discussions take a pause. The one person who might have an interest in contributing, as initiator of the DataSHIELD method/ecosystem, is entirely excluded because he is in a hospital fighting for his life. This is not hyperbole; this is Paul’s reality. Please pay him the respect that he is due and allow the time needed to enable him to engage. Facilitating broad and appropriate engagement is surely the very essence of a collaborative and democratic community process.

Yannick’s comments are a solid foundation for an interregnum.

Madeleine (DataSHIELD ethics and governance advisor)

Prof Madeleine Murtagh Chair of Social Data Science University Of Glasgow

An excellent description

Prof Madeleine Murtagh Chair of Social Data Science University Of Glasgow

Dear @Madeleine_Murtagh,

I am very sorry to hear that Paul is in hospital, I hope he responds well (and quickly) to treatment. Please do pass on my best wishes for a rapid recovery and let him know that I am thinking about him - indeed, I’m sure I can speak for many here and say that we are all concerned and thinking about him, and wish him the very best.

As @tombishop mentioned in the first post, this is a question that has come up repeatedly in multiple fora - including at meetings where Paul was in attendance (indeed, Paul also acknowledged at several of the recent governance workshops that clarifying names was a very important thing and something that we need to do quite urgently). Furthermore, the discussion will be going on for at least the next couple of months: the suggestion above seems to be that no firm decisions are made before the conference in October. I am not sure therefore that “pausing” the discussion is helpful (when would you like it to restart?), particularly as there may well be natural pauses as people take holidays over the summer. And, of course, Paul’s is definitely a voice that we want to be included in the discussion, so I for one hope that it will be possible for him to participate in the discussion again very soon.

Best wishes,

Andrei

Hi,

There was a meeting last week to discuss names and nomenclature. Afterwards, I thought a bit more about this. In particular, I have been reading an article on privacy that some might find interesting:

I’m only about half way through but there is a quote that says:

“A problem well put is half-solved”

(by John Dewey).

This got me thinking: what is our problem? I think it is that:

DataSHIELD” is currently used by different people and in different circumstances to refer to:

  1. The method of using R packages for disclosure-minimising federated analyses.
  2. The additional softwares that are used for storage, harmonisation etc.
  3. The community that has developed around the storage, cataloguing, analysis, etc of federated data.
  4. The research team in Liverpool.

This is confusing.

I think everyone agrees with 1 (i.e. that DataSHIELD is a method). Where opinions start to diverge is around the other three possible uses / definitions of DataSHIELD.

One solution that has been proposed is to have clarifying suffixes. For example:

  • “DataSHIELD ecosystem” (for 2)
  • “DataSHIELD community” or “DataSHIELD project” (for 3)
  • “DataSHIELD research project” or DataSHIELD research team" (for 4)

It is clear, here, that there is in particular a lot of debate and discussion about the use of the word “project”. Perhaps it would be sensible to move away from this word altogether?

That is, the Liverpool team become known as the DataSHIELD Research Team (or DataSHIELD Research Group), and the community around DataSHIELD becomes known as the DataSHIELD Community.

An alternate suggestion is to come up with entirely new and different names for these things. For example:

  • PAMA (ecosystem | community) - from PAul and MAdeleine
  • PABUR (ecosystem | community) - from PAul BURton
  • FreeSHIELD - combining free (or open source) software with current name

However, this potentially still leaves a conflict between DataSHIELD (the method) and the research group at Liverpool.

What do others think? We said at the meeting that we would dedicate some time to talk about this at the conference. This will either be a workshop (e.g. at the end of the day on Wednesday) or could potentially be during the day as we have quite a bit of ‘free’ time allocated in the schedule.

In any case, I’m looking forward to seeing you all later this week :slight_smile:

Best,

– Andrei

Hi Tom

I really like the idea to have proper nomenclature. I really like the idea to start being specific about DataSHIELD ‘xyz’ to name the different concepts we have in our community. DS xyz for short.

Find below a try on such a list of definitions, probably incomplete. I hope this overlaps with the emerging consitution, consider this a test :slight_smile:

Definitions:

  • a DS Component = a piece that is part of the datashield ecosystem (server, client, elsi, training, etc)
  • the DS Ecosystem = the sum of all components
  • the DS Core ecosystem = all components you minimally need.These are the ones to be governed.
  • the DS Wider ecosystem = components we might have in projects such as Mica, MOLGENIS catalogue, federated AAI, Jupiter, R studio, Coral, etc.

Examples of technical components:

  • the DS Interface = the core datashield component that defined datashield, the most important standard we have. Can be only one
  • a DS Package = component that provides statistics functionality, just like any R package
  • a DS Software = some software component in datashield ecosystem
  • a DS Client = the software used by the end user to start their analysis
  • a DS Server = machine running Opal/Armadillo with the data at data provider end
  • a DS Network = set of servers that agreed integrated analysis
  • a DS Server host = organization hosting a datashield server
  • a DS Server software = software to install a service such as Opal/Armadillo
  • a DS Server interface = set of services you need to implement to serve DSI
  • a DS Central Analysis server = machine sometimes used to centrally control access to a network of datashield via central server. Such as Jupyter notebooks and R studio

Non-technical components:

  • todo, but involving training materials, elsi building blocks, forum, wiki etc.

Actors:

  • the DS Community = people somehow involved in Datashield
  • a DS User = people actually using any of the datashield components
  • a DS Stakeholder = people somehow depending on datashield
  • DS Support = people helping users
  • DS Developers = people developing software
  • DS Researcher = people researching aspects of DS ecoystem
  • DS Package developer, Server developer, etc = more specific flavor of developer
  • DS PI = people leading teams involved in datashield (support, developer, research)

Organisational bodies, still tbd of course

  • DS steering committee
  • DS advisory board
  • DS infrastructure working group
  • etc

I’m just adding here a link to the padlet for conference discussion topics. DataSHIELD discussion topics