jezcope/rdmf18.md

## rdmf18.md

      
    Raw
  

              rdmf18.md
            
          
    This is an attempt at using a gist to facilitate liveblogging in a static site. Thanks for joining me for the ride…
The event programme is available online. I'll be co-presenting a talk about using the figshare API with figshare's own Megan Hardeman on the Tuesday at 09.40.
Well, I’ve arrived and obtained biscuits and tea.
Day 1


Martin gives us the now standard housekeeping slide
Overview of the programme (see the link above)
I’m interested to hear about what they’ve been up to at Lancaster with their institutional RDM reporting dashboard
There will also be breakout groups tomorrow — I’m sure suggestions for these on the #rdmf18 hashtag will be welcome too, even if you can’t make it!

Keynote: What are the challenges or Data Science?

Prof Magnus Rattray, Professor of Computational & Systems Biology/Director of the Data Science Institute, University of Manchester
An example: Physics


Large Synoptic Survey Telescope (LSST): 3.2 Gpixel camera -> 2,000 exposures (= 20TB) per night -> 10 year survey = 100PB data
Large Hadron Collider (LHC): theoretical output of 68TB/s (!!!) -> about 1.5GB/s to disk -> 200PB total
Square Kilometre Array will produce more data than can be processed today, but will be curated and analysed over years
But this isn’t unexpected for physics: it’s being dealt with

Another example: Geography


Network analysis of 26m commuter journeys from 2011 census data
Classify journeys into 9 super-groups and a total of 40 groups
Individual journeys not interesting, but emerging patterns are
The tricky stuff is not the machine learning or analysis, but bringing together data from different sources

Mental health


Use of wearable devices to track location of people with mental illnesses
Handle missing data (e.g. due to mobile/GPS blackspots)
Classify places and activities
Overlay health status to identify patterns

Research is increasingly data driven


Bottom-up modelling: based on assumptions about microscopic principles; develop simulation, run and then compare to reality; refine assumptions
Data-driven modelling: identify measurable variables; fit a statistical model to data; make inferences and learn about system by identifying hidden variables
Increasingly connected: mixing “mechanistic” prior knowledge into data-driven models

Challenges for data science


Scalability
Complexity
Cleaning messy data (missing data, noise, poor formatting, poor/absent experimental design)
Human data (privacy, ethics)
Accessibility/availability (openness, reproducibility; e.g. clinicians who protect “their” data to safeguard their future career)

Example: genomics


Massive drop in cost of genome sequencing over the last decade
“It costs more to analyse a genome than to sequence it.” David Haussler
100k Genome project now collecting a huge number of genomes
But once you can sequence genomes you can examine much more: transcriptomics, epigenetics, proteomics
So we can now use this technology to investigate layer-upon-layer of different interacting systems and subsystems
E.g. asthma

Good for a cohort study because a lot of people have asthma
Inconsistency and complexity indicate multiple (sub-)diseases
E.g. 2 different versions of CD14 gene are associated with different risk levels in different parts of the world
Commonly thought to be a progression: eczema -> asthma -> rhinitis
Large scale analysis shows this progression only presents in a small fraction of the population: i.e. it is false


Towards genomic medicine


100k Genomes project: 30PB data held securely, restricted access through secure virtual desktop (“Inuvika”)
Privacy of individuals’ genomes is important but difficult

Next revolution: scaling down to single cells


Existing methods effectively take an average of ~10k cells
As well as looking at large populations of people, we can also go down to individual cell level
Single-cell methods show e.g. diverse sub-populations in particular cell types
Each cell is now a high-dimensional data point
E.g. can trace different mutations through sub-populations of tumour cells
Profile individual tumour cells circulating in the blood: can diagnose and design a drug regime based on a blood sample instead of an invasive biopsy
Sophisticated modelling required to disambiguate features of interest from multiple confounding factors

Dealing with the challenges


Data volume: move compute to the data (e.g. cloud solutions); will analysis be reproducible in the future, or even across current platforms
Data analysis: scale up algorithms (e.g. deep learning, TensorFlow); use approximate methods; streaming data processing; clever tricks to avoid computationally-intensive tasks

Things that used to be considered “software engineering” (e.g. object orientation, testing) are now important for everything


Data quality: big data often not collected for a single purpose, so no experimental design
Robust & reproducible research: record arbitrary modelling choices and vary them to test for robustness; hypothesis selection & p-hacking; keep track of all hypotheses considered (e.g. electronic lab notebook)

Conclusions


Research is increasingly data-driven; data science ubiquitous
Big & complex data: people (especially statisticians and computer scientists) are already motivated to solve these
How do we motivate people to confront problems of messiness, human data, openness (or lack of)

Day 2


Aaaand we're back again for day 2: a full day of content after yesterday's afternoon session

Case study: CRIS, Research Data & Institutional Reporting

Becky Gordon, Lancaster University

Research services view on data about research
Work quite closely with library: overlap primarily centred around Pure CRIS
Systems:

HR, student information, costing/pFact, finance → Pure
Pure → Departmental webpages, research directory, repository, data management, equipment register


Reporting

Financial reports: monthly (really valued by senior academic staff) & annual
Organisational unit performance
Individual performance: promotions etc.
External requirements: OA, REF, HESA, ResearchFish


Current project: strategic research management tool

Reduce time spent manually generating reports
Single hub with live, up-to-date data


Business questions - want data on:

Awards (number, value)
Applications (inc. success rates)
Impact (publications, OA compliance, …?)


Process overview:

Define data and pull out into a data warehouse
Build reports on top of this (using Tableau)
Additional internal exception reports to track things that might go wrong
Data audit & cleaning


Challenges

Differences in reporting criteria
Not enough good-quality data to work with
Difficult to make historical comparisons with older reports


Next steps

Continue to produce manual reports & develop tool & Tableau reports in parallel
Agree reporting criteria with senior management
Ongoing data cleanings


Case study: data repository APIs

No updates from me for a while because I’m part of this talk!
Our slides are available on figshare (of course!)
Managing research throughout its lifecycle

Prof Paul Jeffreys, Institute of Cancer Research

About the IRC

8 diverse research divisions
Able to recharge infrastructure costs to research so can fund development
Future plans: dynamic adaptive therapy

As you treat it in an individual, cancer mutates and evolves so you have to keep changing treatment to keep up
Data must be live and online


Big data is a key pillar in current strategic plan


HPC infrastructure

1,800 cores × 12–16 GB, designed for parallel workload
Dominated by next generation sequencing; approx 70% usage
Jisc data centre in Slough


Architecture

6PiB provisioned (expandable to at least 20PiB)
2 tier: tier 1 is fast storage (2PiB); tier 2 an object store (4PiB)
NAS layer on top so that storage tiers are a black box for users


Policy-based migration from tier 1 → 2

Typically migrated if not used for 90 days, but other possiblities exist
Migrated to long-term archive at some later date
Most files mirrored across 3 sites; smaller (<10MB) files only 2 sites
Object store cannot provide quotas, so charge based on actual usage


Projects to develop 2 new components for sharing & syncing; also currently using a Dropbox Business service
Looking for a metadata catalogue solution

Many solutions (e.g. iRods, DSpace) aimed at facilities or libraries
Need something easy to use for scientists, and off-the-shelf (able to deliver a proof of concept in one person-month)
Open to suggestions!


Scaling and empowering cultural change

Shoaib Sufi, Community Lead, Software Sustainability Institute (SSI)

SSI: national facility since 2010 to "cultivate better, more sustainable research software to enable world-class research"

Software development: to build and maintain expertise in software
Training: essential software skills for researchers
Policy: campaigning for research software support and career recognition/development for research software engineers
Community: workshops & fellowship
Outreach: website, blog, social media


Fellowship programme

£3000 travel/event bursary for people who want to improve research software
Funded by support grants from research councils
Turns out that "SSI Fellow" is quite a sought-after badge of recognition
Fellows = ambassadors


What makes a good fellow?

Strong plan: novelty (for institution/domain); have the skills/experience to succeed; will make a difference
Content: demonstrate ability to create impact
Communications skills


Typical activities

Workshops/conferences/training (including tailored carpentries)
Promote SSI and contribute to its success
Contribute to SSI blog


Some amazing lasting outcomes from the fellowship programme

Development of services (Melody Sandells)
Contribution to RSE conference & organisation (Alys Brett)
Library Carpentry (James Baker)
recipy workflow management software (Robin Wilson)
Open source versions of common commercial research software (Robin Grant)
Data science for doctors training (Steve Harris)
Establishing reproducible research as standard in a major research group (Stephen Eglen)


Conclusions

The right people to effect change are in the research community
Need support and community
Cross-pollinate ideas across different domains


Collaborations Workshop 2018 focus on themes of Culture Change, Productivity, Sustainability

Lunchtime!

And now it's time for lunch, but after that there will be three parallel breakout groups:

Supporting resources for RDM: toolkits & workflows
Integrating data systems & cataloguse
Impact & metrics: reporting & evidencing success

Breakout group feedback

1. Supporting resources for RDM: toolkits & workflows

This includes some information from surveys and interviews around the Jisc research data toolkit project.

Presenting content through journeys is a useful approach
If available, quite a lot of people would use resources in a RDM toolkit to augment their teaching
Preferred mechanism would be working group of HEI-based RDM professionals with Jisc support
Interesting possible features: institutional subdomains with customisable content; CC-BY license; funder policy summaries; regular newsletters

2. Integrating data systems & cataloguse


Important themes: ownership, provenance, privacy
Audit trails important, but

3. Impact & metrics: reporting & evidencing success