CZI funding application meeting
17th December, 2019
- Andrei Morgan (INSERM, Paris, France)
- Paul Burton (Newcastle, UK)
- Patricia Ryser-Welch (Newcastle, UK)
- Hugh Garner (Newcastle, UK)
- Olly Butters (Newcastle, UK)
- Stuart Wheater (Newcastle, UK)
- Tom Bishop (Cambridge, UK)
- Becca Wilson (Newcastle, UK)
- Artur Rocha (INESC-TEC, Porto, Portugal)
- Rui Camacho (INESC-TEC, Porto, Portugal)
- Juan R. Gonzalez (ISGlobal, Barcelona, Spain)
Andrei started with a resumé of the funding call (as described in the post above).
Andrei then provided a list of potential ideas to start discussion:
- development of hierarchical modelling
- development of mediation analysis
- development of survival analysis (but already in development?)
- code sprints/ hackathons
- governance developments
Andrei’s big question was how to potentially avoid overlap with work that is already being done or is planned by EUCAN-Connect.
There was then a big discussion about current issues/needs of theDataSHIELD community – there follows some bullet list points.
EUCAN-Connect project is currently funding development of (some) specific functionalities. Specific functionalities may also have less generalisable usage.
Important issue (and bottleneck) at the moment is ensuring there are final checks on disclosure – this is primarily being done by only one person at present. But continued testing and integration of functionality is essential for the project.
Another issue that has been identified recently relates to the group working with encryption and non-parametric testing where a certain matrix could not be inverted without creating a large security hole. (Note from Andrei - I didn’t completely follow this point, so it may not be quite accurate!)
A good security person/hacker would be ideal to try and deal with these sorts of situations and, in particular, to identify such holes! Could even be two different roles – one for security/penetration-tesing, the other for auditing/quality assurance and ensuring that there is no disclosure.
Education and training is also very important – a lot of time has been spent by the DataSHIELD team so far in helping new(er) users and this takes away from development.
Discussion about potentials for distance or remote learning, including the use of potential technologies to help ensure information stays updated (e.g. video tutorials, scripted updates to documentation that pull new screenshots etc). The major problem with DS is that it has been moving quite quickly and so it’s hard to ensure that screenshots etc are current. This has been an issue previously when the user guides took so long to produce that the software had already been updated and the guides were therefore out-of-date.
Management strategies, perhaps using Agile approach to software development (“under which requirements and solutions evolve through the collaborative effort of self-organizing and cross-functional teams and their end users. It advocates adaptive planning, evolutionary development, early delivery, and continual improvement, and it encourages rapid and flexible response to change.” From Wikipedia)
- Further concerns are about how to ensure sustainability – long term financing, does there need to be interaction with industry, paid addons, etc. Mention of the ITIL framework which is a “is a set of detailed practices for IT service management (ITSM) that focuses on aligning IT services with the needs of business.” (from Wikipedia)
From all this discussion, we arrived at two principal ideas, and a couple of supplementary ideas also came out in further discussion since then (suggestions 3 and 4 below):
Improvements to project management – incorporating:
- support and development of project governance
- establishing links between developers, and between developers and end users
- potentially adopting new strategies for development, e.g. possible introduction of Agile concepts and working strategies
- establishing suport mechanisms for end users and identifying the best way for new people to learn basics
- also include investigating how to make it easier to bring functions developed externally into the main DataSHIELD codebase (noting that many who present their work at the DataSHIELD workshop have not integrated their code into mainstream DataSHIELD). Is the bottleneck disclosure testing or something else? Are there any action points from the Software Sustainability Institute review that was done a few years back that have not been worked on?
Developing auditing/security strategies – specifically:
- How to identify and mitigate disclosure risks (e.g. from a malicious user who does repeated low-number analyses)
- Security concerns, particularly with the potential future use of encryption to enable data flows between nodes.
- Penetration testing, e.g. to ensure data nodes are robust to external attackers.
- The need of passing data from one node to another should be a platform service as doing this involves having defined a key pair for bullet-proof asymmetric encryption.
The DataSHIELD platform should/must not be neglected. The platform capabilities can limit or, on the contrary, open new horizons in terms of analysis, tools and kind of data sources that can be handled. Recent post on the “resources” illustrates that. Expanding capabilities and usage
also mean that the R/DataSHIELD platform will soon have to face the problems of the
- scalability (as this is a multi-user environment and memory + computation resources are currently limited to what can offer a single R server),
- security (as “resources” are too powerful for not being contained (using apparmor for instance)),
- flexibility (various DataSHIELD configurations/user profiles on the same node),
- robustness (error handling),
- monitoring, auditing (to identify hacking attempts) etc.
Building a demonstrator of the capabilities of DataSHIELD for the analysis of genomic sequences.
This would be very much in line with both lines of projects sought:
- Foundational tools and infrastructure that enable a wide variety of downstream software across several domains of science and computational research (e.g., numerical computation, data structures, workflows, reproducibility). Here we could have the effort related with the resourcer package and it’s integration with the rest of the software stack to ensure reproducibility and security.
- Domain-specific software for analyzing, visualizing, and otherwise working with the specific data types that arise in biomedical science (e.g., genomic sequences, microscopy images, molecular structures). Here we could have one or two types of analyses made with genetic data, serving as a test case for the foundational part and allowing us the effort to implement specific algorithms.
As noted above, the terms of the call actually allow for multiple applications from a single project, but it was felt that it may be best to focus on an initial application (i.e. project management/governance/integration) and then make a further application in the next round (which is due to commence in June 2020).
There was some discussion about pragmatic aspects of a funding application: it is likely that money and coordination would be best managed via Newcastle University - particularly as this is where the bulk of development has been coordinated thus far. This is particularly true for the first suggestion (an application related to overall project governance and integration between developers/developers and users).
Andrei – volunteered to write up notes from this meeting;
All - will review the notes for accuracy.
- A summary of this meeting will be posted on the DataSHIELD forum by the end of this week to enable other members of the community to comment and to reflect and discuss over the winter holiday period.
All – will reflect and try to firm up an idea for the CZI application over the winter holiday period.
Olly – will try to find the list of people from the DataSHIELD workshop in September who said they were interested in participating in discussion about project development.
- Project governance (a separate issue) will be discussed in more detail at the next meeting (and will have its own thread on the forum).
Next teleconference will be on Tuesday, 7th January 2020 from 09:00 to 11:00 UTC (GMT – equivalent to 10:00 to 12:00 CET)