Skip to content

Instantly share code, notes, and snippets.

@jezcope
Last active November 21, 2017 21:17
Show Gist options
  • Save jezcope/482d61d472bc745b58c53e5b4f41fdf8 to your computer and use it in GitHub Desktop.
Save jezcope/482d61d472bc745b58c53e5b4f41fdf8 to your computer and use it in GitHub Desktop.
RDMF18 liveblog

This is an attempt at using a gist to facilitate liveblogging in a static site. Thanks for joining me for the ride…

The event programme is available online. I'll be co-presenting a talk about using the figshare API with figshare's own Megan Hardeman on the Tuesday at 09.40.

Well, I’ve arrived and obtained biscuits and tea.

Day 1

  • Martin gives us the now standard housekeeping slide
  • Overview of the programme (see the link above)
  • I’m interested to hear about what they’ve been up to at Lancaster with their institutional RDM reporting dashboard
  • There will also be breakout groups tomorrow — I’m sure suggestions for these on the #rdmf18 hashtag will be welcome too, even if you can’t make it!

Keynote: What are the challenges or Data Science?

Prof Magnus Rattray, Professor of Computational & Systems Biology/Director of the Data Science Institute, University of Manchester

An example: Physics

  • Large Synoptic Survey Telescope (LSST): 3.2 Gpixel camera -> 2,000 exposures (= 20TB) per night -> 10 year survey = 100PB data
  • Large Hadron Collider (LHC): theoretical output of 68TB/s (!!!) -> about 1.5GB/s to disk -> 200PB total
  • Square Kilometre Array will produce more data than can be processed today, but will be curated and analysed over years
  • But this isn’t unexpected for physics: it’s being dealt with

Another example: Geography

  • Network analysis of 26m commuter journeys from 2011 census data
  • Classify journeys into 9 super-groups and a total of 40 groups
  • Individual journeys not interesting, but emerging patterns are
  • The tricky stuff is not the machine learning or analysis, but bringing together data from different sources

Mental health

  • Use of wearable devices to track location of people with mental illnesses
  • Handle missing data (e.g. due to mobile/GPS blackspots)
  • Classify places and activities
  • Overlay health status to identify patterns

Research is increasingly data driven

  • Bottom-up modelling: based on assumptions about microscopic principles; develop simulation, run and then compare to reality; refine assumptions
  • Data-driven modelling: identify measurable variables; fit a statistical model to data; make inferences and learn about system by identifying hidden variables
  • Increasingly connected: mixing “mechanistic” prior knowledge into data-driven models

Challenges for data science

  • Scalability
  • Complexity
  • Cleaning messy data (missing data, noise, poor formatting, poor/absent experimental design)
  • Human data (privacy, ethics)
  • Accessibility/availability (openness, reproducibility; e.g. clinicians who protect “their” data to safeguard their future career)

Example: genomics

  • Massive drop in cost of genome sequencing over the last decade
  • “It costs more to analyse a genome than to sequence it.” David Haussler
  • 100k Genome project now collecting a huge number of genomes
  • But once you can sequence genomes you can examine much more: transcriptomics, epigenetics, proteomics
  • So we can now use this technology to investigate layer-upon-layer of different interacting systems and subsystems
  • E.g. asthma
    • Good for a cohort study because a lot of people have asthma
    • Inconsistency and complexity indicate multiple (sub-)diseases
    • E.g. 2 different versions of CD14 gene are associated with different risk levels in different parts of the world
    • Commonly thought to be a progression: eczema -> asthma -> rhinitis
    • Large scale analysis shows this progression only presents in a small fraction of the population: i.e. it is false

Towards genomic medicine

  • 100k Genomes project: 30PB data held securely, restricted access through secure virtual desktop (“Inuvika”)
  • Privacy of individuals’ genomes is important but difficult

Next revolution: scaling down to single cells

  • Existing methods effectively take an average of ~10k cells
  • As well as looking at large populations of people, we can also go down to individual cell level
  • Single-cell methods show e.g. diverse sub-populations in particular cell types
  • Each cell is now a high-dimensional data point
  • E.g. can trace different mutations through sub-populations of tumour cells
  • Profile individual tumour cells circulating in the blood: can diagnose and design a drug regime based on a blood sample instead of an invasive biopsy
  • Sophisticated modelling required to disambiguate features of interest from multiple confounding factors

Dealing with the challenges

  • Data volume: move compute to the data (e.g. cloud solutions); will analysis be reproducible in the future, or even across current platforms
  • Data analysis: scale up algorithms (e.g. deep learning, TensorFlow); use approximate methods; streaming data processing; clever tricks to avoid computationally-intensive tasks
    • Things that used to be considered “software engineering” (e.g. object orientation, testing) are now important for everything
  • Data quality: big data often not collected for a single purpose, so no experimental design
  • Robust & reproducible research: record arbitrary modelling choices and vary them to test for robustness; hypothesis selection & p-hacking; keep track of all hypotheses considered (e.g. electronic lab notebook)

Conclusions

  • Research is increasingly data-driven; data science ubiquitous
  • Big & complex data: people (especially statisticians and computer scientists) are already motivated to solve these
  • How do we motivate people to confront problems of messiness, human data, openness (or lack of)

Day 2

  • Aaaand we're back again for day 2: a full day of content after yesterday's afternoon session

Case study: CRIS, Research Data & Institutional Reporting

Becky Gordon, Lancaster University

  • Research services view on data about research
  • Work quite closely with library: overlap primarily centred around Pure CRIS
  • Systems:
    • HR, student information, costing/pFact, finance → Pure
    • Pure → Departmental webpages, research directory, repository, data management, equipment register
  • Reporting
    • Financial reports: monthly (really valued by senior academic staff) & annual
    • Organisational unit performance
    • Individual performance: promotions etc.
    • External requirements: OA, REF, HESA, ResearchFish
  • Current project: strategic research management tool
    • Reduce time spent manually generating reports
    • Single hub with live, up-to-date data
  • Business questions - want data on:
    • Awards (number, value)
    • Applications (inc. success rates)
    • Impact (publications, OA compliance, …?)
  • Process overview:
    • Define data and pull out into a data warehouse
    • Build reports on top of this (using Tableau)
    • Additional internal exception reports to track things that might go wrong
    • Data audit & cleaning
  • Challenges
    • Differences in reporting criteria
    • Not enough good-quality data to work with
    • Difficult to make historical comparisons with older reports
  • Next steps
    • Continue to produce manual reports & develop tool & Tableau reports in parallel
    • Agree reporting criteria with senior management
    • Ongoing data cleanings

Case study: data repository APIs

No updates from me for a while because I’m part of this talk!

Our slides are available on figshare (of course!)

Managing research throughout its lifecycle

Prof Paul Jeffreys, Institute of Cancer Research

  • About the IRC
    • 8 diverse research divisions
    • Able to recharge infrastructure costs to research so can fund development
    • Future plans: dynamic adaptive therapy
      • As you treat it in an individual, cancer mutates and evolves so you have to keep changing treatment to keep up
      • Data must be live and online
    • Big data is a key pillar in current strategic plan
  • HPC infrastructure
    • 1,800 cores × 12–16 GB, designed for parallel workload
    • Dominated by next generation sequencing; approx 70% usage
    • Jisc data centre in Slough
  • Architecture
    • 6PiB provisioned (expandable to at least 20PiB)
    • 2 tier: tier 1 is fast storage (2PiB); tier 2 an object store (4PiB)
    • NAS layer on top so that storage tiers are a black box for users
  • Policy-based migration from tier 1 → 2
    • Typically migrated if not used for 90 days, but other possiblities exist
    • Migrated to long-term archive at some later date
    • Most files mirrored across 3 sites; smaller (<10MB) files only 2 sites
    • Object store cannot provide quotas, so charge based on actual usage
  • Projects to develop 2 new components for sharing & syncing; also currently using a Dropbox Business service
  • Looking for a metadata catalogue solution
    • Many solutions (e.g. iRods, DSpace) aimed at facilities or libraries
    • Need something easy to use for scientists, and off-the-shelf (able to deliver a proof of concept in one person-month)
    • Open to suggestions!

Scaling and empowering cultural change

Shoaib Sufi, Community Lead, Software Sustainability Institute (SSI)

  • SSI: national facility since 2010 to "cultivate better, more sustainable research software to enable world-class research"
    • Software development: to build and maintain expertise in software
    • Training: essential software skills for researchers
    • Policy: campaigning for research software support and career recognition/development for research software engineers
    • Community: workshops & fellowship
    • Outreach: website, blog, social media
  • Fellowship programme
    • £3000 travel/event bursary for people who want to improve research software
    • Funded by support grants from research councils
    • Turns out that "SSI Fellow" is quite a sought-after badge of recognition
    • Fellows = ambassadors
  • What makes a good fellow?
    • Strong plan: novelty (for institution/domain); have the skills/experience to succeed; will make a difference
    • Content: demonstrate ability to create impact
    • Communications skills
  • Typical activities
    • Workshops/conferences/training (including tailored carpentries)
    • Promote SSI and contribute to its success
    • Contribute to SSI blog
  • Some amazing lasting outcomes from the fellowship programme
    • Development of services (Melody Sandells)
    • Contribution to RSE conference & organisation (Alys Brett)
    • Library Carpentry (James Baker)
    • recipy workflow management software (Robin Wilson)
    • Open source versions of common commercial research software (Robin Grant)
    • Data science for doctors training (Steve Harris)
    • Establishing reproducible research as standard in a major research group (Stephen Eglen)
  • Conclusions
    • The right people to effect change are in the research community
    • Need support and community
    • Cross-pollinate ideas across different domains
  • Collaborations Workshop 2018 focus on themes of Culture Change, Productivity, Sustainability

Lunchtime!

And now it's time for lunch, but after that there will be three parallel breakout groups:

  1. Supporting resources for RDM: toolkits & workflows
  2. Integrating data systems & cataloguse
  3. Impact & metrics: reporting & evidencing success

Breakout group feedback

1. Supporting resources for RDM: toolkits & workflows

This includes some information from surveys and interviews around the Jisc research data toolkit project.

  • Presenting content through journeys is a useful approach
  • If available, quite a lot of people would use resources in a RDM toolkit to augment their teaching
  • Preferred mechanism would be working group of HEI-based RDM professionals with Jisc support
  • Interesting possible features: institutional subdomains with customisable content; CC-BY license; funder policy summaries; regular newsletters

2. Integrating data systems & cataloguse

  • Important themes: ownership, provenance, privacy
  • Audit trails important, but

3. Impact & metrics: reporting & evidencing success

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment