Skip to content

Instantly share code, notes, and snippets.

Last active August 29, 2015 14:16
Show Gist options
  • Save alisonpope/6603bce453b1f872cd53 to your computer and use it in GitHub Desktop.
Save alisonpope/6603bce453b1f872cd53 to your computer and use it in GitHub Desktop.
JISC #digifest15

Big Data and the Data Arts (2015 Mar 10)

JISC Digital Festival 9-10 March 2015 #digifest15

COSMOS (Collaborative Online Social Media Observatory) Web Observatory, Peter Burnap (Cardiff University)

Biodiversity Heritage Library, Riza Batista-Navarro

Big Data for Lexical Research, Jack Grieve (Aston University)

Most words are uncommon so need massive corpora i.e. billions of documents for good lexical analysis.

Finding newly emerging words using Twitter. Measured frequency of most common 67,000 words each day from a Twitter corpus. Able to look for patterns e.g. and fairly uniform across a year, strawberry peaks in summer.

Looking for words that emerge from zero, increase and stay e.g. unbothered.

I is the most common word in Twitter. Not narcissistic at all then (the and and are most common in English language). However it is also a declining word: Twitter becoming less personal.

Can drill down not into usage but also which words are increasing and decreasing more quickly.

Most words are uncommon but if you drill down you are exposing the very edge of language.

The data is also geocoded so can also map the spread of these words. Not necessarily where these words start but pretty close and shows where they are being amplified and spread.

For example mapped spread of unbothered on US map across time and animate a visualisation of its spread across the country (both extensive and intensive).

The large corpora allows you to examine rarer and rarer words. The research is not that different but allows you to ask slightly different systems. Helpful for exploring rare events.

Interesting applications in forensic linguistics.

Twitter makes the data easily available so easier to do bigger data experiments using their API so more often used for research than other social media platforms.


Carole Goble Keynote (2015 Mar 10)

JISC Digital Festival 9-10 March 2015 #digifest15

The key problem: Information flow “too slow and too impoverished”


We bury data into the publication. RIP “Rest in Publication”.

Utopia Documents 'datafies’ PDFs by pulling out.

Why do we have to break publications into pieces to get at data instead of making data “born reproducible”?

Scientific Publications are “virtual witnessing”.

Publications are not the scholarship. They advertise the scholarship.

A lot of papers have no access to primary data, broken links, no software versioning, released code etc.

Not only can’t access the data can’t access the method. Need both to be able to reproduce.

Broken software = broken science.

Hilariously scathing about the effort put into creating software and training scientists in use of software tools compared to say laboratory equipment.

FAIR Publishing:

  • Findable
  • Accessible
  • Interoperable
  • Reusable

Involves tools, standards, machine actionable, formats, reporting, policies, practices.


Libraries are being crushed between the “republic of science” and the regulation of science.

Discusses at length many of the ways science can go wrong: it is messy, honest error, deliberate fraud, inherent to the type of experiment.

Scientist’s desks are messy. Scientist’s find it difficult to reproduce their own research in their own labs.

There can be problems with the scientific method: poor training and approach. There are also problems from the social environment with pressure to publish, impact factor mania, broken peer review, time pressures and general disorganisation.

Really fragmented research and publishing environments/ecosystems.


  • Data collection
  • Data discovery
  • Data assembly, cleaning, refinement
  • Modeling
  • Statistical analysis
  • Insights
  • Scholarly Communication and reporting

My Experiment ( pack contains all the assets needed to report and reproduce an experiment.

Research Object

Aggregate outputs. Compound investigations, research products. These are units of exchange. These form a commons and provide contextual metadata on the input to experiments not just the outputs.

Research objects are First Class Citizens. They include data, software, methods and paper. They have IDs, they can be managed, credited, tracked, profiled.

The resources in them may span, multiple assets not just those contained within the repository.

TARDIS: Time and Relative Dimension in Scholarship

Move from:

  • closed to open
  • local to alien
  • embed to refer
  • fixed to fluid

These multi-typed, stewarded, sited, authored objects span research, researchers, platforms and time. How do we store, cite, steward, store these?

Also a shift from

  • document to package
  • publish to release

Research objects being used to package code, study, data and metadata and send it to others.

Mozilla Science Lab been working on code as a research object.

Research is not a series of static documents that are published but a series of research objects that are released like software. They fork and merge like software. They are version controlled and cited much as software. Apply to all of the research object components.

It is the entire study that backs a paper … not a piece of data.

FAIRDOM: Aggregated Commons infrastructure = uber cataloguing tool. Holds all of the pieces together for a particular study.

Research objects can not just be thought of as metadata packages but as instruments. Data and software as instrument. The Research Object workflow as an instrument. Reproducibility is facing uncertainty and change. The lab changes, science changes.

“The questions don’t change but the answers do” - Dan Reed

You have to “prepare to repair"

Be careful with The Cloud. Try replacing the word Cloud with Clown and see how it sounds. If you use The Cloud make sure there is a way to get your data out: a lifeboat, and escape pod. Different types of reproducibility:

  • Rerun (Robust) - Variations/Internal
  • Repeat (Defend) - Same Experiment/Internal
  • Replicate (Certify) - Same Experiment/Peer Review
  • Reproduce (Compare) - Variations on Experiment
  • Reuse (Transfer) - Different Experiment

RARE Research

  • Robust
  • Accountable
  • Reproducable
  • Explained

It is a big jump from the RARE space (research environment) to the FAIR space (publishing environment)

  • Reproduce by Reading Archived Record/Retaining
  • Reproduce by Running (Virtual Machines)

Goble confessing to slight bitterness on the research/REF process. Tells of how she was criticised for writing a paper so that people would be able to read it.

Model and standards for packaging and publishing research objects manifests.

RARE Research and FAIR Publishing

All sounds good but is a small part used by computationally savvy researchers. Reality is lab books with things stuck in them, files and spreadsheets.

To move from there to RARE and FAIR we need:

  • stealthy progress (reduce friction, optimise The Neylon Equation). For example better data structures, controlled vocabulary in spreadsheets.
  • auto-magical end-to-end instrumentation. For example electronic lab notebooks.
  • get over credit. Credit is not the same as authorship. Need to optimise love, money, fame and duty
  • training. For example software and data carpentry. Also establish pool of software engineers that researchers can call on to help them develop software.
  • need to make reproducibility (public good) a side effect of personal productivity.

The Shift:

  • incremental for infrastructure providers
  • moderate for policy makers and stewards
  • paradigm for researchers and institutions


  • method matters
  • studies born reproducible
  • be smart about reproducibility
  • think commons not repository
  • think release not publish


Livestream recording:


Richard Watson Keynote (2015 Mar 10)

JISC Digital Festival 9-10 March 2015 #digifest15

Richard Watson, What's Next

Digital vs. Human

Technologies have benefits, they also have downsides. Most of all they have consequences.

Volume: we are flooded with information*
*is this necessarily information?

Digital technologies should be used to enhance human relationships, communication and judgement not replace them.

Distraction, and the constant anticipation of distraction, make people 20% more stupid.

Working with multiple screens is the equivalent to losing 10 IQ points (equivalent to not having slept for 36 hours).

You can work ... but not necessarily at your best.

Discussion of smart machines, smart because they are connected to the internet. They will replace humans in jobs that can be automated (possibly 1 in 3).

What can humans do that machines can't? We are curious, creative, like to interact in physical spaces and are caring. Connecting with people emotionally and intuitively may be more important than being smart.

Seems though that many people around the world are finding digital representations (2D or 3D) more satisfying than reality.

Discussing attention spans. More screens and more interactive content is responding to an attention deficiency?

Distraction and attention are solvable but is the loss of deep thinking? Screens are great for finding and filtering stuff fast. The price to be paid is possibly the loss of focused, contextual, reflective thought.

Only 1% of people go past page 1 of Google.

If you are searching for wisdom via the cross-fertilisation of ideas it matters if everyone is looking at the same small patch of information.

Key message is it's not Digital vs. Human it's Digital and Human*
*or Digital as Human?

Ways to Work Smarter

  • Switching Off
  • Match Technology to Task
  • Sleep

Switching Off

We should ritualise being without devices for at least a day a week. Useful to separate devices into home and work and switch work devices and turn work devices off after 7pm. Having a day of rest one day per week is important. This includes switching the mind off from thinking.

Stillness, silence and slowness hugely under-rated in the digital world.

Understand the Best Technology for the Task

Work out the problem you are trying to solve and pick the best technology. Paper is good for contextual arguments, good for spotting mistakes. A pencil is a technology and one that has endured because it is particularly good at some things.


Get enough sleep. Our brains don't switch off when we sleep: they are busy processing the day's information and stabilising it as memories. In a sleep state we actively filter information: remembering some, forgetting some, filing and connecting to create new ideas.

Also bed used to be for sleeping, now a greater range of tasks being performed in beds. Particular use of devices in bed. The information and the type of light they release disturbs sleep patterns.


Some thoughts on the keynote:

Often think when I hear these type of talks is they often feel too personal or anecdotal. Is this a particular perspective representing a personal preference or is there more widespread and rigorous evidence? Sure I could go and read Watson's book, plus others e.g. Nicholas Carr's The Shallows springs to mind, to find out more but having watched the earlier keynote on Research Objects and communicating the study not just the output it's then strange to watch a keynote that throws out statements without referencing evidence.

In the Q&A at the end Watson even confesses to not being an expert on attention and provides only 'anecdata' based on observation of his kids to back up his claims. Goes back to the argument about the rise of amateur as producer and prosumption reducing quality so that the navigating the dross to get to quality is too difficult.

As another example I checked Watson's website and the latest edition (Issue 36) of his newsletter. Most sources are periodicals e.g. newspapers and web sites. They are referenced, but not linked, and even given an integrity rating (though The Telegraph is given a 5 star rating). This means the newsletter is abstracting periodical articles, and maybe adding opinion?, that is reporting on studies. Reflects back on our discussion of the dissemination of scientific information we have been having in #citylis classes

This doesn't mean the keynote didn't contain interesting points but I just found it a but too frothy after all that preceded it. It is one opinion, but I find myself ending by being sceptical about Watson's scepticism.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment