This thread is specifically to talk about the funding possibility. More details and notes from a first teleconference meeting held on Tuesday, 17th December are included in the next couple of posts.
The Chan-Zuckerberg Initiative has a current appeal for projects under their call “Essential Open Source Software for Science, Cycle 2” which provides funding of $50-250k US dollars.The closing date is 04 February 2020.
“[seek] applications for software projects that are essential to biomedical research, have already demonstrated impact, and can show potential for continued improvement.”
“aim to provide software projects with resources to support their tools and the communities behind them. Whether it’s hiring an additional developer, improving documentation, addressing usability, improving compatibility, onboarding contributors, or convening a community, we hope our support can help make the computational foundations of biological research more usable and robust.”
Two types of projects will be supported:
Domain-specific software for analyzing, visualizing, and otherwise working with the specific data types that arise in biomedical science (e.g., genomic sequences, microscopy images, molecular structures).
Foundational tools and infrastructure that enable a wide variety of downstream software across several domains of science and computational research (e.g., numerical computation, data structures, workflows, reproducibility).
NB, a key point is that application is open to projects only : “Grants are not permitted to individuals; only to organizations.” Additionally, they will “consider and potentially fund multiple applications from the same organization, multiple applications related to the same open source software project(s), and multiple applications that include the same staff and/or software project contributors. However, the proposed work in such applications must be distinct.” It is therefore possible to submit more than one proposal.
Proposal evaluation
“The Chan Zuckerberg Initiative’s core values center around people, technology, collaboration, and open science. We adhere to those values in both proposal selection and evaluation of progress.
Applications will be evaluated for their expected impact, the quality of the open source software project(s) involved, the feasibility of the proposal, and their diversity, equity, and inclusion statement—each of which will be assessed through quantitative and qualitative factors."
NB, any software code developed must be produced under a permissive license – which may rule out actual development on DataSHIELD itself (which I think is predominantly GPL/LGPG compatible).
Andrei started with a resumé of the funding call (as described in the post above).
Andrei then provided a list of potential ideas to start discussion:
development of hierarchical modelling
development of mediation analysis
development of survival analysis (but already in development?)
code sprints/ hackathons
governance developments
Andrei’s big question was how to potentially avoid overlap with work that is already being done or is planned by EUCAN-Connect.
There was then a big discussion about current issues/needs of theDataSHIELD community – there follows some bullet list points.
EUCAN-Connect project is currently funding development of (some) specific functionalities. Specific functionalities may also have less generalisable usage.
Important issue (and bottleneck) at the moment is ensuring there are final checks on disclosure – this is primarily being done by only one person at present. But continued testing and integration of functionality is essential for the project.
Another issue that has been identified recently relates to the group working with encryption and non-parametric testing where a certain matrix could not be inverted without creating a large security hole. (Note from Andrei - I didn’t completely follow this point, so it may not be quite accurate!)
A good security person/hacker would be ideal to try and deal with these sorts of situations and, in particular, to identify such holes! Could even be two different roles – one for security/penetration-tesing, the other for auditing/quality assurance and ensuring that there is no disclosure.
Education and training is also very important – a lot of time has been spent by the DataSHIELD team so far in helping new(er) users and this takes away from development.
Discussion about potentials for distance or remote learning, including the use of potential technologies to help ensure information stays updated (e.g. video tutorials, scripted updates to documentation that pull new screenshots etc). The major problem with DS is that it has been moving quite quickly and so it’s hard to ensure that screenshots etc are current. This has been an issue previously when the user guides took so long to produce that the software had already been updated and the guides were therefore out-of-date.
Management strategies, perhaps using Agile approach to software development (“under which requirements and solutions evolve through the collaborative effort of self-organizing and cross-functional teams and their end users. It advocates adaptive planning, evolutionary development, early delivery, and continual improvement, and it encourages rapid and flexible response to change.” From Wikipedia)
Further concerns are about how to ensure sustainability – long term financing, does there need to be interaction with industry, paid addons, etc. Mention of the ITIL framework which is a “is a set of detailed practices for IT service management (ITSM) that focuses on aligning IT services with the needs of business.” (from Wikipedia)
From all this discussion, we arrived at two principal ideas, and a couple of supplementary ideas also came out in further discussion since then (suggestions 3 and 4 below):
Improvements to project management – incorporating:
support and development of project governance
establishing links between developers, and between developers and end users
potentially adopting new strategies for development, e.g. possible introduction of Agile concepts and working strategies
establishing suport mechanisms for end users and identifying the best way for new people to learn basics
also include investigating how to make it easier to bring functions developed externally into the main DataSHIELD codebase (noting that many who present their work at the DataSHIELD workshop have not integrated their code into mainstream DataSHIELD). Is the bottleneck disclosure testing or something else? Are there any action points from the Software Sustainability Institute review that was done a few years back that have not been worked on?
How to identify and mitigate disclosure risks (e.g. from a malicious user who does repeated low-number analyses)
Security concerns, particularly with the potential future use of encryption to enable data flows between nodes.
Penetration testing, e.g. to ensure data nodes are robust to external attackers.
The need of passing data from one node to another should be a platform service as doing this involves having defined a key pair for bullet-proof asymmetric encryption.
DataSHIELD development
The DataSHIELD platform should/must not be neglected. The platform capabilities can limit or, on the contrary, open new horizons in terms of analysis, tools and kind of data sources that can be handled. Recent post on the “resources” illustrates that. Expanding capabilities and usage
also mean that the R/DataSHIELD platform will soon have to face the problems of the
scalability (as this is a multi-user environment and memory + computation resources are currently limited to what can offer a single R server),
security (as “resources” are too powerful for not being contained (using apparmor for instance)),
flexibility (various DataSHIELD configurations/user profiles on the same node),
robustness (error handling),
monitoring, auditing (to identify hacking attempts) etc.
Building a demonstrator of the capabilities of DataSHIELD for the analysis of genomic sequences.
This would be very much in line with both lines of projects sought:
Foundational tools and infrastructure that enable a wide variety of downstream software across several domains of science and computational research (e.g., numerical computation, data structures, workflows, reproducibility). Here we could have the effort related with the resourcer package and it’s integration with the rest of the software stack to ensure reproducibility and security.
Domain-specific software for analyzing, visualizing, and otherwise working with the specific data types that arise in biomedical science (e.g., genomic sequences, microscopy images, molecular structures). Here we could have one or two types of analyses made with genetic data, serving as a test case for the foundational part and allowing us the effort to implement specific algorithms.
Further discussion
As noted above, the terms of the call actually allow for multiple applications from a single project, but it was felt that it may be best to focus on an initial application (i.e. project management/governance/integration) and then make a further application in the next round (which is due to commence in June 2020).
There was some discussion about pragmatic aspects of a funding application: it is likely that money and coordination would be best managed via Newcastle University - particularly as this is where the bulk of development has been coordinated thus far. This is particularly true for the first suggestion (an application related to overall project governance and integration between developers/developers and users).
Next steps
Andrei – volunteered to write up notes from this meeting;
All - will review the notes for accuracy.
A summary of this meeting will be posted on the DataSHIELD forum by the end of this week to enable other members of the community to comment and to reflect and discuss over the winter holiday period.
All – will reflect and try to firm up an idea for the CZI application over the winter holiday period.
Olly – will try to find the list of people from the DataSHIELD workshop in September who said they were interested in participating in discussion about project development.
Project governance (a separate issue) will be discussed in more detail at the next meeting (and will have its own thread on the forum).
Next meeting
Next teleconference will be on Tuesday, 7th January 2020 from 09:00 to 11:00 UTC (GMT – equivalent to 10:00 to 12:00 CET)
There, I have copied information from the “Detailed Application Instructions” so that we can start working together on the project proposal.
NB, I also contacted CZI to find out about whether it was possible to “re-use” parts of the application (e.g. “Diversity, Equity and Accessibility” statement) in different project applications - the answer is “yes”
Hi, just to update people… We’ve had a couple of teleconf calls this week to discuss this funding application. We’ve decided at the moment to focus on two ideas:
To improve sustainability of DataSHIELD
To work on the genomics idea
There will be another teleconf on Friday to discuss the second idea in more detail (work is already progressing on the first idea). If there’s anyone who’s interested in being involved and hasn’t been invited to participate in either of these ideas yet, please reply to this thread or contact me or the team at Newcastle directly via email.
What time is the telecon this Friday? I’m not sure if I can participate this Friday but can you please add me in the invitation for the next discussions?
I cannot attend the teleconf (this week is extremely busy). I have been working a lot on omic data analysis using DataSHIELD and Bioconductor and I would like to be in the discussion. The package dsOmics has been really improved. My question is whether this call could be posponed to the next week.
Any day (morning) is fine to me. Actually, I could share a preliminary vignette describing the main advances I did join with Yannick.
We have your interest noted already! I think it’s probably best that we continue with this teleconf on Friday and we will keep you informed with plans etc. There will surely be another one - probably next week - to follow-up on this second idea. Are there any bad times for you that we should avoid, or good ones that we should try to aim for when we meet again?
Ok, that’s fine to me. Letting you know that I’m working on DataSHIELD/Bioconductor is fine. So that, you can take this into consideration when discussing future plans
Re to good times, as I previously mentioned, any day next week is fine. The 20th and 21st are also ok. The rest ot moth is complicated to me.
sustainability: 15:30 UTC / 16:30 CET Wednesday January 15
omics: 09:30 UTC / 10:30 CET Friday January 17
Most people who’ve already expressed interest should have received an invite for these - if you haven’t and you would like to join in, please let us know