philandstuff/csv-conf-v2.org

## csv-conf-v2.org

      
    Raw
  

              csv-conf-v2.org
            
          
    csv,conf,v2

Ben Foxall, Serving CSV from the browser


  @benjaminbenben @pusher

why do I love CSV?


  accessibility - don’t need specialized programs
  it’s the start of something, not the end
    
      you don’t print out a csv document, you do something further
        with it
    
  
  how do you get your data back from the cloud?

example: runkeeper


  runkeeper - GPS routes of running
  how do I get my data from it?
    
      attempt #1: download
        
          .zip file, containing .gpx coordinates and .csv with heart
            rate etc
          👌
          but:
            
              no format control
              functionality might change or disappear
              need to go online to retrieve different times
                
                  eg if you need a different set of dates
                
              
          can we gather together our data and cut it up ourselves offline?
        
      
      attempt #2: script
        
          bash script, jq for json processing
          github gist benfoxall/runkeeper-export.sh
          👌
            
              format choices
              sharable
              offline
            
          
          👎
            
              inaccessible
                
                  downloading a csv is easy, writing a bash script is hard
                
              
      attempt #3: web service
        
          runkeeper-to-csv.herokuapp.com
          connect to runkeeper api, convert to csv documents
          👌
            
              accessible
            
          
          👎
            
              non-trivial backend
              handling sensitive data
              online only
            
          
      attempt #4: serve from the browser
        
          👌
            
              accessible
              data stored locally
            
          
          👎
            
              …
            
          
what we will implement


  request -> process -> serve
  a small runkeeper API
    
      javascript fetch() API (it’s the new ajax, returns a promise)
      dataForUser(..)
    
  
  process
    
      turning JSON into CSV
    
  
  serve to the user
    
      data URIs
        
          present csv as a data URI
        
      
  🚢
  …
  how can we make it better?
    
      support bigger files
        
          csv might be big
          data uri does base64 which makes it even bigger
          solution: Blob()
            
              supported by browsers (except IE9)
              creates an object outside the javascript stack
                
                  it’s also immmutable
                
              
              avoids churning through VM memory
              can generate URLs to download Blobs
            
          
      no persistence
        
          IndexedDB (+ Dexie)
          chrome devtools resources tab shows the IndexedDB contents
        
      
      no permanent URLs
        
          the Blob URL is only valid while the page is loaded
          a static script can’t pull from this URL
          Service Workers!
            
              a script which runs separately from your UI thread
              allows offline-first websites
              can serve cached content
              can serve synthesized responses
              proper URL, but no web request
            
          
          👌
            
              we can cache the service worker response
              we can serve different views on our data
                
                  geoJSON, csv, html
                
              
              frontend code can use Service Workers without knowing they
                exist
              all of this is now offline-capable
            
          
  https://runkeeper-data.herokuapp.com
    
      when you visit, log in with OAuth
      service worker starts and downloads data into IndexedDB
      continues (even if you close the tab!)
    
  
Richard Jones, CSV as the master dataset


  founder, Cottage Labs
  software dev agency, higher education work

what are we trying to do?


  bespoke information systems, quickly and cost-effectively
    
      clients work with spreadsheets
      upload into a datastore
      people can then query the datastore and see interesting views
    
  
why?


  humans love spreadsheets
    
      especially in the non-technical world
      tabular data is easy to work with
      the desktop toolchain is excellent (much as we might complain)
      we could never meet the needs that these tools meet (especially
        on our time and budget and skills)
    
  
  lots of information systems are basically the same
    
      most of the differences are the kind of dta being worked on
      workflows exist, but they happen in the admin area
    
  
  admin areas are expensive and boring to build
    
      lots of web forms – create/edit/delete record
      I’d like it if people could manage their data outside of the
        admin system
    
  
  data visualisation, data science, data journalism are all in
    
      but also specialist domains and outside the reach of small
        organizations
      (I’m not a data specialist - no machine learning or stats –
        but I can help the client cut up their data)
    
  
  we find ourselves doing much the same thing over and over again

the weird things people do with spreadsheets


  they put blurb above their header rows
    
      the actual table of data is a few rows down
      a spreadsheet is a document, not a dataset
    
  
  they colour cells in, with the colour carrying meaning
    
      this disappears on export-to-csv
      the form-vs-function distinction isn’t clear when seeing a
        spreadsheet as a document
    
  
  sloppy with hard formats (like numbers)
    
      eg -£1,00,0000.0
    
  
  they break boundaries of acceptable use for typed fields
    
      eg cost column containing “$100 to about 200”
      data models are brittle, humans are flexible
    
  
how do we read a spreadsheet?


  decode the bits
    
      welcome to encoding hell!
        
          excel might give you cp1252 or Windows-1252 (not the same!)
          excel/numbers might give you MacRoman on OSX
          Calc will hopefully give you UTF-8
          any of them could do any one of hundreds of encodings
        
      
      some encodings are interchangable, but the newline character is
        not a common link
      we check we’ve actually got a rectangular dataset for confidence
    
  
  read the data
    
      ignore supporting documentation above the dataset
      translate the header rows
      trim content, ignore empty values, and “N/A” values
      coerce data into something cleaner (“£1,000” -> 1000)
      we’re not scrubbing the data, just allowing for the humanity in
        the book-keeping
      output: JSON
    
  
  make it queryable
    
      Elasticsearch
    
  
  publish interactive interfaces
    
      javascript frontend on top of elasticsearch query engine
    
  
other work we’ve done


  open access spectrum
  lantern (CSV-only interface)

what’s hard?


  some data is hard to represent in spreadsheets
    
      hierarchical or highly relational data
      don’t make people use a spreadsheet the way we’d use a database!
    
  
  consistency use of dictionary terms
    
      if the spreadsheet maintainers can use consistent names for
        things, like Countries, it can make things much easier
    
  
tech roll call


  we’re not trying to duplicate: open refine, trifacta, tableau
  things we do use:
    
      d3 + nvd3
      elasticsearch
      objectpath (xpath-like language for JSON)
    
  
  things we tried but aren’t currently using
    
      highcharts
      tablib
    
  
Q&A


  what’s your largest elasticsearch dataset? largest index?
    
      2.5 million records; 25Gb
    
  
Matthias Buus (filling in for Karissa), distributing open data with dat

what is dat?


  http://dat-data.com/
  open source project for sharing open data
  funded by Alfred P Sloan foundation
  meetings are open youtube hangouts
  3 person team
  >800 modules on npm
    
      around half a percent of all npm modules!
    
  
  dat is a p2p file sharing network
  written in javascript
  works in browser
  move the data to the code (don’t move your code to your data)
  data is just files
  you don’t need all the files
  move just the files you need to the code
  similar to bitorrent
  install: npm install dat

sharing data


  dat link ~/big-file.csv
    
      creates a content-addressable link dat://9620fb285...
      can give the link to a friend, then they run dat
        dat://9620fb285... and automatically discover you and start
        downloading the dataset
    
  
how does it work?


  split file into chunks which are unlikely to change
    
      git does one-chunk-per-lin
      if I change one line, I only have to sync that one line, even
        if the file is large
      only works for text files
    
  
  rabin fingerprinting (content-defined chunking)
  scans through the file and creates chunks based on actual file content
    
      if you insert something in the middle, a rabin fingerprint will
        create the same chunks on each side of the change
    
  
  npm install rabin

demo


  https://mafintosh.github.io/hyperdrive
    
      in-browser dat links
    
  
  video player with random access!
    
      fetch the file chunks needed right now
    
  
Zara Rahman, Bridging the gap: tech <-> activism


  @zararah

background


  open knowledge, school of data, engineroom
  bridging gaps between communities who don’t talk to each other,
    or people who do talk but in different ways

responsible data program


  https://responsibledata.io
  https://theengineroom.org
  privacy, security, legal challenges
  ask questions
    
      even if there aren’t any hard-and-fast answers
      this changes hugely in different contexts
    
  
  https://responsibledata.io/reflection-stories
  sometimes tech really DOES improve people’s lives
    
      use of Tor
      The Counted
    
  
  sometimes it doesn’t
    
      Google Photos identified two black people as ‘gorillas’
    
  
using sensitive data


  Physicians for Human Rights
    
      programme on sexual violence in conflict zones
      lots of victims don’t come forward to report
      even when they do, challenges to accurately record
      Kenya and eastern Democratic Republic of the Congo
      MediCapt
        
          standardising data collection
          digitising data collection
            
              mobile network penetration is v high, but the data is
                sensitive
            
          
          iterating upon tool choice
            
              tried an off-the-shelf tool, piloted, found it too
                cumbersome
              developed a new tool, user research with people on the
                ground
            
          
          reality check
            
              evaluate at the end
              start all over again and iterate
              slow development
            
          
  Sharing reports of violence
    
      a non-profit wanted to support a community which faces a lot of
        violence
      they weren’t particularly experienced in technology
      started thinking of developing an app
        
          report a perpetrator of violence to anyone in the area
        
      
      legal, privacy issues
        
          can’t have PII because this is an allegation
          but without PII the report isn’t that useful
          need to tread a fine line
        
      
      future proofing
        
          data minimization
          don’t want to hold data which could in future put people at
            risk
          people were put off from using app if they had to give too
            much information
        
      
      collaboration
      launch
    
  
analysing data

HRDAG


  human rights dta analysis group
  https://hrdag.org
  data on casualties in Syria
    
      listing different groups documenting
    
  
  “Numbers are only human”
    
      how do you categorise civilian vs military death?
      how do you categories death due to conflict vs “natural
        causes”?
    
  
  should you use exact (but uncertain) figures to draw attention
    to causes?

data in the Ebola response


  http://cis-india.org/papers/ebola-a-big-data-disaster
  in some countries there was a push to release Call Detail
    Records (CDRs) from mobile companies
  getting access to the data
    
      in Sierra Leone and Guinea, they released this data; in
        Liberia they didn’t
    
  
  decision-making
    
      the call was to have the data anonymised
        
          but: it’s hard to anonymise such detailed information
          and: in the Ebola response, the data is most useful when it
            can be linked to real personal identities
        
      
      privacy rights weren’t respected
    
  
  digital infrastructure

questions to ask yourself


  what might an adversary do with your data?
    
      not necessarily your adversary
      what malicious things could they do with your data and how
        might they gain from that?
      what would happen then?
    
  
  what’s your holistic security plan?
  what does informed consent look like for your users?
    
      if you know that noone’s reading your Ts & Cs
      are you making things visible that your users should know about
    
  
  what levels of technical literacy do your users have?
  in your team, whose job is it to think about the ethics?

conclusion


  tech & data projects can have unintended consequences, even when
    well-intended

Q&A


  do you have examples where they managed to embed context with the
    data
    
      the MediCapt team found the context cruicial
      the HRDAG work has lots of asides and nuanced explanations
        
          they’ve very careful about waht they say, though they are
            probably more sure about their findings than many other groups
        
      
  this reminds me of an app for reporting requests for bribes. how
    do organizations share anonymised data securely?
    
      https://responsibledata.io/
      let’s not reinvent the wheel
    
  
Jeni Tennison, Making CSV part of the web


  Technical Director, ODI

the dream

motivating example: election data


  data on wikipedia about last local elections
    
      data table + map
    
  
  all of this is hardcoded behind the scene in table rows
  if you want to get hold of the data, you need to parse the html
  election results are often entered on wikipedia really quickly
    
      it’d be really cool to be able to get them out quickly too
    
  
  it’s also not great to have the same data duplicated
  could we reference the CSV data directly?
  we can do it with images <img src="url://">; why not with
    tables of data?
  <table src="uk-local-election-summary-2015.csv">
  reference source for party-to-colour mapping
    
      could bring it into your maps and tables
    
  
benefits


  it would help people presenting data
  improve quality of data available for us
    
      motivate machine-readable data
      motivate fixing of errors
        
          visualisations of your tabular data demonstrate errors very
            quickly!
        
      
      motivate publishers to give accurate metadata
    
  
getting to a standard


  CSV on the Web @ W3C completed 2016
  building on and learning from:
    
      OKFN’s data packages / Tabular Data Format
      Google’s Dataset Publishing language
      national archives validation
      existing CSV parsers
      broad set of documented use cases & requirements
    
  
the difficult bits

discovering metadata


  CSV needs metadata
    
      “these columns contain numbers”
      “this column should be displayed as +/-”
      “this column is a pointer to this other table”
      the metadata needs to be in a separate file
    
  
  CSVW metadata standard
  people want to download CSV, not a zipped-up package or the JSON
    metadata
    
      the JSON metadata has a link to the CSV so you could discover
        it (in principle)
      but: normal people won’t do this
      the link generally needs to be to the CSV file itself
      how do we find the metadata?
      RFC 5988 link:
    
  
Link: <metadata.json>; rel="describedBy"; type="applications/csvm+json"


  often can’t control Link: headers though
    
      default filenames
        
          just add -metadata.json to end of csv file
        
      
      there’s some geeky stuff about /.well-known if you care
        about that
    
  
linking between CSVs


  hard links
    
      you can (if you want) think of CSVs as being like a relational
        database
      foreign key relationships in your metadata
    
  
  soft links
    
      give a URL template
      http://example.org/party/{party}
    
  
machine/human readability


  CSV is on the boundary between these two worlds
  human variability in CSV headers
    
      “country” vs “Country”
      “unemployment” vs “Unemployment rate”
      CSVW metadata standard allows you to give different options
        for titles and indicate they mean the same thing
      locale-specific variation
        
          {en:country, de:Land}
        
      
  formats for dates and numbers
    
      use standard number & date formats
        
          Unicode Technical Standard #35
        
      
      minimal set that MUST be implemented
        
          nothing that requires actually knowing languages
          eg names of months, currency units
        
      
      Implementations can do more
    
  
what’s next


  Implementations
    
      validation
      conversion
        
          into JSON and into RDF
        
      
      authoring metadata
      not yet for display
        
          tables, maps, etc
          it’d be really cool to have some web component type stuff
          <table src="...">
        
      
      annotation?
      navigation?
    
  
  https://www.w3.org/TR/tabular-data-primer

Q&A


  sometimes there’s a value in the header (eg “election results
    2014”). how do you deal with that?
    
      there is a facility for “virtual columns” for static
        information
    
  
Matt Chadburn, Democratising data at the FT


  principal engineer, FT

about the FT


  800,000 subscribers
  company licences

users of data


  page analytics
    
      education
      when do you remove something from the front page because it’s
        becoming stale?
    
  
  email communication with users

summary


  focus on the users need
  learnable
  ease of use (APIs to get stuff in and out)
  iterative

Mouse Reeve, Grimoires, Demonology and Databases


  I work for the Internet Archive, but I’m not here to talk about
    that
  @tripofmice
  grimoire.org

what is a grimoire?


  a book of magic spells and invocations - OED
  scope for this talk: 16th and 17th century, european christian
    tradition
  in this time:
    
      no clear divide between magic, religion, science
      cunning folk prevalent in Europe
        
          “low” magic
          common people, often illiterate
          medicine, divination, folk magic
        
      
      ceremonial magic
        
          “high” magic
          summoning angels, demons, spirits, fairies
          piously christian (sometimes, at least)
        
      
      witchcraft
        
          capital offence
          nobody self-identifies as a witch
        
      
  what’s okay vs a capital offence? what’s for scholars vs common
    people? it’s a bit woolly
  England, 1580
    
      Queen Elizabeth I
      John Dee
        
          some of his magical items are now in the British Museum
        
      
      William Shakespeare
        
          Propspero from the Tempest (based on John Dee?)
          Oberon from a midsummer night’s dream
            
              grimoires offered spells to summon Oberon
            
          
      Psudomanarchia Daemonum (1577)
      Lesser Key of Solomon (1641)
    
  
  King Solomon’s Temple
    
      Solomon was able to summon and control and use demons to help
        build his temple, aided by archangel Gabriel
    
  
demons


  examples:
    
      agares
      crocell
      buer
    
  
  every demon is given a sigil, which is a calling card used to
    summon them
  summoning a demon is really involved
    
      elaborate circles
      if you get it wrong, you might get eaten
    
  
  crocell’s powers:
    
      make it sound like it’s raining
      run you a warm bath
      teach you geometry
      …
      that’s it!
    
  
what I want to know


  what are grimoires for?
    
      how do they get used?
    
  
how I did it


  it’s tough to model in relational model
  lots of many-many relationships (eg demon <-> grimoire)
    
      join tables
    
  
  I used neo4j to model this as a graph problem

spells!


  eg: glue to fix a porcelain vase (?!)

graph data structures


  advantages:
    
      designed for relationships & connections
      flexible
      no migrations
    
  
  disadvantages
    
      no schema for consistency
      non-performant for simple tabular data
    
  
  common use cases
    
      social networks
      public transport systems
    
  
results


  https://www.grimoire.org/
  eg https://www.grimoire.org/demon/vual

Q&A


  do any of these demons appear in paintings?
    
      don’t know
    
  
  what did people use these grimoires for?
    
      hard to know
    
  
  do you have a way to tell how comprehensive your dataset is?
    
      the complete dataset is borderline infinite
      there’s a finite number of grimoires that have survived and
        been translated into english
    
  
  you mentioned node4j for pictorial representation. anything else
    for this purpose?
    
      no
      I have tables of spells and a timeline, but not much else in
        terms of data visualisation
    
  
  could you use this dataset to perform unsupervised learning to
    generate new spells or demons?
    
      sure why not
    
  
Sarah Gold, keynote: designin for data


  @sarahtgold
  @projectsbyif

my background


  government, politics, civics, …
  GDS
  currently: IF
    
      a design studio
      we make things that change how people think about data
      we are multidisciplinary
        
          product development
          design
          security
        
      
      we understand technology and design as disciplines which inform
        each other
      everything we do is centred on people
        
          people who understand the things they use make better
            decisions about how to use them
        
      
problem space


  more things are becoming data conscious
    
      more data being collected
      more things being connected to the internet
      it’s never been so cheap to put a chip in it
      IoT
      Internet of Shit
        
          @InternetOfShit
            
              there’s a lot of nonsense
            
          
  we are producing a lot of personal data
    
      phones, laptops, fitbits, etc
      data maximalism
    
  
  Ts & Cs are our default consent model
    
      and they don’t work
      samsung smart TV privacy policy: “Don’t talk in front of the
        TV”
    
  
  objects are becoming informants
    
      and they will betray us
      smart bins
        
          collecting MAC addresses of passers-by
        
      
      hyde park visitors covertly tracked via mobile phone data
    
  
  we don’t know if something is working properly
    
      http://androidvulnerabilities.org/
        
          terrifying graph of devices running vulnerable versions of
            android
        
      
  software is politics – Richard Pope

monitoring & testing


  gherkin syntax
  makerversity

design for data


  design for minimum viable data
  know which data type you’re designing with

consent models


  https://projectsbyif.github.io/data-permissions-catalogue
  data licences
    
      how do I licence my data? what do I care about?
    
  
Q&A


  the more informed people are to the implications of tracking, the
    more likely they are to say no; how do companies which provide
    free services deal with this?
    
      it’s very complicated
      ad blockers
      not enough time to do this justice
      with instances like royal parks, they could give their patrons
        information about how useful their data has been
    
  
Jenny Bryan, keynote: spreadsheets 😱


  professor of statistics at UBC
  @JennyBryan @STAT545

spreadsheets!


  it’s nice to be allowed to talk about spreadsheets for once
  people like to moan about them
  slides (with references!) https://github.com/jennybc/2016-05_csvconf-spreadsheets
  inspiration: csv,conf,v1 talk Felienne Hermans “Spreadsheets are code”
  it’s okay to care about spreadsheets!
  how I pick people to work with:
    
      venn diagram overlap of (crazy technically competent ∩
        intellectually generous, loves gifs)
        
          => Rich Fitzjohn
          https://github.com/richfitz/jiffy
        
      
“some of my best friends use spreadsheets”


  inequality is toxic in a whole lot of contexts
    
      in this case: ability to do what you want with data
      there’s this “data 1%”
      anything we want to do, we know how, or how to figure it out,
        or how to find someone who knows
      lots of people I teach at UBC are much less able to get these
        things done, feel paralysed
      down with software elitism
      up with the last mile of data munging
    
  
  I supported myself for ~4 years doing spreadsheets
    
      I was doing a management consulting gig
      during grad school I supported myself doing high-end excel work
      there’s a lot you can do with these consumer-level tools
      I’d like to create a more porous border between spreadsheets
        and R/python/etc
    
  
  https://twitter.com/tomaspetricek/status/687947134088392704
    
      “Ouch. “50 million accountants use monads in Excel. They just
        don’t go around explaining monads to everyone…” @Felienne
        #ndclondon”
    
  
  reactivity is one of the main things people love about
    spreadsheets
    
      spreadsheets have pushed computer science to deal with
        reactivity
      i was talking on a podcast about the future of spreadsheets and
        whether they will go away; i felt reactivity was key
      with R, I write a Makefile to rebuild everything from scratch
        
          but I still have to kick this thing
        
      
  spreadsheets also have less syntax bullshittery
    
      argument names, separators, etc
      you can just select things with your mouse and click “average”
    
  
  FACTS!
    
      about 1 billion people use MS OFfice
      about 650 million people use spreadsheets
      up to half use formulas
      …
      250k - 1m use R
      1-5m use Python
    
  
  you go into data analysis with the tools you know, not the tools
    you need

crazy spreadsheet stories


  what you think people are doing ≠ what you think people
    should be doing ≠ what people are actually doing
  most tools are designed for the middle thing (what you think
    people should be doing)
  The Enron Corpus
    
      “the pompeii of spreadsheets”
      600k emails
      15k spreadsheets
    
  
  example:
    
      some cells are data
      some are formulas
      some are phone numbers
      visualizations
      spreadsheets within spreadsheets (ie a rectangular group of
        cells)
      Hermans, Murphy-Hill (research paper on the corpus)
        
          prevalence of formulas
          prevalence of unique formulas
          http://www.felienne.com/archives/3634
        
      
  lots of colour
    
      data and formatting blurred together
      font choice and colour of cell gives you a categorical variable
    
  
  inconsistency between rows and columns
  references to other spreadsheets, that you don’t have
  columns of intermediate computations are so boring, so they get
    hidden
  http://xkcd.com/1667/

what makes spreadsheets so vexing?


  machine readable & human readable
    
      (see JeniT’s keynote further up)
      a spreadsheet is often neither machine nor human readable
        
          technically, yes you can open them and look at them
          but a machine cannot get useful data out in an unsupervised,
            scalable way
          and a human reading someone else’s spreadsheet is like
            reading another person’s codebase
        
      
      spreadsheets are (data ∩ formatting ∩ programming logic)
        
          but often we only care about one or two of these concerns
          (can we separate them after the fact?)
        
      
how do we fix this?


  what are the problems?
  which ones can we solve?
    
      with training?
        
          sometimes people use spreadsheets for inappropriate things
            and we can train them to stop it
        
      
      with tooling?
        
          (just a subset; not all problems can be solved with tooling)
        
      
  two angles:
    
      create new spreadsheet implementations that use, eg, R or
        python for computation and visualization
        
          anticipate version control, collaboration
          AlphaSheets
          stencila
        
      
      accept spreadsheets as they are
        
          create tools to get goodies out
          maybe write back into sheets?
        
      
  ~googlesheets~ R package
    
      (google sheets are much less common than excel, but they’re
        still reasonably common)
    
  
  goal: spreadsheet reading tools in R
    
      with no non-R dependency
    
  
  Book: Spreadsheet implementation technology

Q&A


  what are the interesting differences between excel and google
    sheets (for ingesting data)
    
      the excel spec is 6000 pages long; the google sheets spec is 0 pages long
      I wish there was something in between
      they’re both very verbose xml
      not really big differences in parsing
      google sheets has to chase excel and be super compatible with
        excel
    
    -
  

Rufus Pollock and Dan Fowler, Frictionless Data


  http://frictionlessdata.io

motivation


  getting UK government to publish data on all their spending
    
      in CSV format
      with a spec
        
          defined columns
        
      
  but: problems
    
      401 html document saved as csv :/
    
  
  friction
  containerization for data
    
      docker docker docker
    
  
  key principles
    
      simplicity
      web oriented
      existing tools
      open
    
  
  validation

Darren Barnes, Data Baker: Pretty Spreadsheets to Useful CSVs


  a success story from the previous csv,conf

Context


  ONS produces thousands of spreadsheets each year on our website
    
      we’re getting more efficient at it
      the underlying structures no longer exist for us to get that
        data in a machine-readable way
      we’ve gotten so good at producing these spreadsheets but
        neglected the source data
      we have CSVs, but “we can’t publish that on the website”
        
          I can’t do my aggregation in there
        
      
  how do we get to a point where we publish CSVs?

history


  scraperwiki + ONS at csv,conf,v1
  Dragon Dave McKee’s talk on XYPath
  version 1
    
      python
      command-line
      not pretty but functional
    
  
  example
    
      spreadsheet with merged cells, multiple tabs, hidden columns,
        etc etc (see Jenny Bryan’s keynote above)
      we set up some recipes to instruct Data Baker:
        
          what files we want to look at
          where the data is
          what transformations we want to do
        
      
      run the command
        
          slurp in the .xls files
          generates some output .xls files
          one output: a colour-coded .xls file to show how the data
            was sliced up
            
              sanity check to make sure we’re doing it right
            
          
  code! https://github.com/scraperwiki/databaker

Jeremy Freeman, open source neuroscience


  the jenalia research campus (“the bell labs of neuroscience”)
    
      northern virginia
      research institute, non-profit funded
    
  
motivations: why do we study the brain?


  there’s a lot we don’t know
    
      try talking to fifth graders!
      “how is it that I can hear a phone number and the next day I
        still remember that phone number?”
      “why do I always dream about robots and dinosaurs?”
    
  
  mice as a model
    
      two-photon imaging
    
  
using data


  we often want to analyse data as quickly as possible to drive
    decisions about what experiment to do next
  random access two photon mesoscope
  rich data patterns of brain activity
  the 80/20 problem
    
      time spent doing incredible measurements
      time spent doing other stuff
      used to be 80% data gathering & experimental research; 20%
        analysis
      now, it’s all changed; only 20% doing actual science
    
  
  analysis isn’t a linear process
    
      lots of backtracking and dead ends
      lots of reinventing the wheel between different labs
        
          no sharing of infrastructure
          often no source control
        
      
  goal: lots of modules that solve well-defined small problems,
    that can be glued together
    
      eg thunder project & bolt-project
      thunder: a collection of modules for image and time series
        data analysis
      neurofinder.codeneuro.org
        
          analysing a picture and determining which groups of pixels
            correspond to neurons
          a really common neuroscience problem!
          but every lab has come up with their own independent way of
            doing it
          website to allow people to submit results from their
            algorithms (against training and testing datasets)
          (Question: why didn’t you use kaggle?
            
              this seemed like a simple enough problem to solve for
                ourselves rather than buying into the kaggle space
              we originally thought about having people submit code and
                run it in a container but running matlab in a container
                is somewhere between difficult and illegal)
            
          
  lightning-viz.org – modular visualization things
  https://github.com/mikolalysenko/regl
    
      webgl and 3d is a really important part of the future of
        scientific visualization
    
  
  the 1 to 2 problem:
    
      starting collaboration between two individuals
      jupyter notebooks
    
  
  https://github.com/sofroniewn/tactile-coding
    
      github is great for sharing code (and to some degree, data)
      it doesn’t solve the problem of making an environment usable
        on someone else’s machine
      can we use things like docker to take jupyter notebooks and
        data and code and bundle them all together?
        
          good, we had to repeat the complex process each time
        
      
  mybinder.org
    
      tell us a github repo
        
          has to have a certain set of contents
            
              code needed to run your notebooks
              some metadata
              (not required: a complete Dockerfile)
            
          
          builds a docker image
          then embed a button in your github repo
            
              the button launches into a running environment
            
          
      what’s the value in being able to reproduce someone else’s
        analysis?
        
          if someone can rerun this and, as a result, start a
            collaboration, that’s really cool
        
      
  buzzfeed made a binder to analyse refugee data
    
      data relevant for policy decisions: we should have access
      the analysis should be open too
    
  
  binder doesn’t address data sharing
    
      you can put it in a github repo
        
          but it’s not a wonderfully sustainable solution
        
      
      dat sounds really cool though! http://dat-data.com
    
  
  Question: nick had a live image render in a jupyter notebook –
    how do you do that?
    
      the data comes off the microscope
      goes directly to the machines in a cluster
      crunching happens
      then gets absorbed into html rendering in the notebook
    
  
back to brains


  mouse VR
    
      data from neurons as a mouse’s whiskers get closer or further
        from a wall
    
  
  hexaworld

Q&A


  what do you do about describing the data? where did it come from?
    when was it measured?
    
      almost no coordination of metadata right now in neuroscience
      I don’t know how to get two postdocs in the same lab to
        coordinate on data
    
  
Serah Njambi Rono, Life/death decisions powered by CSVs


  @CallMeAlien
  developer advocate, @CodeForAfrica
    
      a civic tech organization
      works to empower citizens by giving them access to information
    
  
  call for action: build more tools that directly impact the
    communities we live in

the problem


  access to proper healthcare is a basic human right; but the WHO
    estimates about a third of the world’s population has no access
    to the most basic medicines
  in Kenya, quack doctors are very common
    
      story: my boss (from south africa) had a business trip to kenya
        
          got really sick, sought medical advice, got treated, felt
            better, returned to SA
          then got even worse
          visited his regular family doctor
          SA requested medical records from kenyan treatment
          when the SA doctor’s office contacted the kenyan doctor’s
            office, it turned out the “doctor” was in fact a vet
        
      
      a lot of people in rural africa or south east asia struggle to
        access doctors
      how sure are they that they’re seeing a registered practitioner?
    
  
the solution


  Code For Africa collaborated with The Star, the largest
    blue-collar newspaper
    
      http://bit.ly/starHeatlh
      enter the name of the town you’re in
      get a list of medical practitioners you can see, what their
        speciality is, what clinics they are in
    
  
  story: a woman went to the police and reported she had been
    drugged and raped by an alleged gynaecologist
    
      it hit the news, then many more women came forward
      it turned out he was a quack doctor; he wasn’t even registered
      just put up a sign
      and people trusted him with their lives
      public outcry
      The Star started publicising the platform and people started
        using it
    
  
the data


  Kenya Medical Practitioners and Dentists Board is the authority
    
      published the list across >300 web pages
      websites are not universally accessible
      a lot of people still have feature phones
    
  
  our service has an SMS interface
    
      text us a request and we can tell you details about specific
        doctors
    
  
  we don’t just take the data from the government; we also validate
    and report errors back to the government
  it’s now been replicated by a newsroom in Nigeria
    
      they’ve started adding medicine prices too
    
  
Q&A


  is the data available too?
    
      yes it’s available, I can point you to the github
    
  
  re: sms delivery: how do people submit the names?
    
      people submit a name
      we have to do some normalization to allow variability “D” “Dr”
        “Doctor” etc
      another issue: the database only has 11,000 doctors
        
          we have 44 million people in kenya!
          either we have only 1 doctor per 4000 people (far too low!)
          or there are many many unregistered doctors (also bad!)
        
      
  could you look at geographical variabity? eg pockets of countries
    with low coverage
    
      yes, definitely
    
  
  how do you keep the data up to date?
    
      the scrapers are automated
      re-scrape on a weekly basis
      in January this year, we realised that our scrapers weren’t
        updated themselves
      it’s a contant gardening effort
    
  
  have you reached out to the organization to see if you could get
    a data dump?
    
      there’s a big trend in kenya (#dodgydoctors hashtag, and
        another swahili hashtag)
      people are calling for all government services to have SMS
        interfaces
      it’s a bit complicated to get the data from the government
    
  
  https://github.com/CodeForAfrica/theStarHealth