Skip to content

Instantly share code, notes, and snippets.

@philandstuff
Created May 4, 2016 21:15
  • Star 5 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save philandstuff/496b0a4872b89da5bb33a245171d789a to your computer and use it in GitHub Desktop.

csv,conf,v2

Ben Foxall, Serving CSV from the browser

  • @benjaminbenben @pusher

why do I love CSV?

  • accessibility - don’t need specialized programs
  • it’s the start of something, not the end
    • you don’t print out a csv document, you do something further with it
  • how do you get your data back from the cloud?

example: runkeeper

  • runkeeper - GPS routes of running
  • how do I get my data from it?
    • attempt #1: download
      • .zip file, containing .gpx coordinates and .csv with heart rate etc
      • 👌
      • but:
        • no format control
        • functionality might change or disappear
        • need to go online to retrieve different times
          • eg if you need a different set of dates
      • can we gather together our data and cut it up ourselves offline?
    • attempt #2: script
      • bash script, jq for json processing
      • github gist benfoxall/runkeeper-export.sh
      • 👌
        • format choices
        • sharable
        • offline
      • 👎
        • inaccessible
          • downloading a csv is easy, writing a bash script is hard
    • attempt #3: web service
      • runkeeper-to-csv.herokuapp.com
      • connect to runkeeper api, convert to csv documents
      • 👌
        • accessible
      • 👎
        • non-trivial backend
        • handling sensitive data
        • online only
    • attempt #4: serve from the browser
      • 👌
        • accessible
        • data stored locally
      • 👎

what we will implement

  • request -> process -> serve
  • a small runkeeper API
    • javascript fetch() API (it’s the new ajax, returns a promise)
    • dataForUser(..)
  • process
    • turning JSON into CSV
  • serve to the user
    • data URIs
      • present csv as a data URI
  • 🚢
  • how can we make it better?
    • support bigger files
      • csv might be big
      • data uri does base64 which makes it even bigger
      • solution: Blob()
        • supported by browsers (except IE9)
        • creates an object outside the javascript stack
          • it’s also immmutable
        • avoids churning through VM memory
        • can generate URLs to download Blobs
    • no persistence
      • IndexedDB (+ Dexie)
      • chrome devtools resources tab shows the IndexedDB contents
    • no permanent URLs
      • the Blob URL is only valid while the page is loaded
      • a static script can’t pull from this URL
      • Service Workers!
        • a script which runs separately from your UI thread
        • allows offline-first websites
        • can serve cached content
        • can serve synthesized responses
        • proper URL, but no web request
      • 👌
        • we can cache the service worker response
        • we can serve different views on our data
          • geoJSON, csv, html
        • frontend code can use Service Workers without knowing they exist
        • all of this is now offline-capable
  • https://runkeeper-data.herokuapp.com
    • when you visit, log in with OAuth
    • service worker starts and downloads data into IndexedDB
    • continues (even if you close the tab!)

Richard Jones, CSV as the master dataset

  • founder, Cottage Labs
  • software dev agency, higher education work

what are we trying to do?

  • bespoke information systems, quickly and cost-effectively
    • clients work with spreadsheets
    • upload into a datastore
    • people can then query the datastore and see interesting views

why?

  • humans love spreadsheets
    • especially in the non-technical world
    • tabular data is easy to work with
    • the desktop toolchain is excellent (much as we might complain)
    • we could never meet the needs that these tools meet (especially on our time and budget and skills)
  • lots of information systems are basically the same
    • most of the differences are the kind of dta being worked on
    • workflows exist, but they happen in the admin area
  • admin areas are expensive and boring to build
    • lots of web forms – create/edit/delete record
    • I’d like it if people could manage their data outside of the admin system
  • data visualisation, data science, data journalism are all in
    • but also specialist domains and outside the reach of small organizations
    • (I’m not a data specialist - no machine learning or stats – but I can help the client cut up their data)
  • we find ourselves doing much the same thing over and over again

the weird things people do with spreadsheets

  • they put blurb above their header rows
    • the actual table of data is a few rows down
    • a spreadsheet is a document, not a dataset
  • they colour cells in, with the colour carrying meaning
    • this disappears on export-to-csv
    • the form-vs-function distinction isn’t clear when seeing a spreadsheet as a document
  • sloppy with hard formats (like numbers)
    • eg -£1,00,0000.0
  • they break boundaries of acceptable use for typed fields
    • eg cost column containing “$100 to about 200”
    • data models are brittle, humans are flexible

how do we read a spreadsheet?

  • decode the bits
    • welcome to encoding hell!
      • excel might give you cp1252 or Windows-1252 (not the same!)
      • excel/numbers might give you MacRoman on OSX
      • Calc will hopefully give you UTF-8
      • any of them could do any one of hundreds of encodings
    • some encodings are interchangable, but the newline character is not a common link
    • we check we’ve actually got a rectangular dataset for confidence
  • read the data
    • ignore supporting documentation above the dataset
    • translate the header rows
    • trim content, ignore empty values, and “N/A” values
    • coerce data into something cleaner (“£1,000” -> 1000)
    • we’re not scrubbing the data, just allowing for the humanity in the book-keeping
    • output: JSON
  • make it queryable
    • Elasticsearch
  • publish interactive interfaces
    • javascript frontend on top of elasticsearch query engine

other work we’ve done

  • open access spectrum
  • lantern (CSV-only interface)

what’s hard?

  • some data is hard to represent in spreadsheets
    • hierarchical or highly relational data
    • don’t make people use a spreadsheet the way we’d use a database!
  • consistency use of dictionary terms
    • if the spreadsheet maintainers can use consistent names for things, like Countries, it can make things much easier

tech roll call

  • we’re not trying to duplicate: open refine, trifacta, tableau
  • things we do use:
    • d3 + nvd3
    • elasticsearch
    • objectpath (xpath-like language for JSON)
  • things we tried but aren’t currently using
    • highcharts
    • tablib

Q&A

  • what’s your largest elasticsearch dataset? largest index?
    • 2.5 million records; 25Gb

Matthias Buus (filling in for Karissa), distributing open data with dat

what is dat?

  • http://dat-data.com/
  • open source project for sharing open data
  • funded by Alfred P Sloan foundation
  • meetings are open youtube hangouts
  • 3 person team
  • >800 modules on npm
    • around half a percent of all npm modules!
  • dat is a p2p file sharing network
  • written in javascript
  • works in browser
  • move the data to the code (don’t move your code to your data)
  • data is just files
  • you don’t need all the files
  • move just the files you need to the code
  • similar to bitorrent
  • install: npm install dat

sharing data

  • dat link ~/big-file.csv
    • creates a content-addressable link dat://9620fb285...
    • can give the link to a friend, then they run dat dat://9620fb285... and automatically discover you and start downloading the dataset

how does it work?

  • split file into chunks which are unlikely to change
    • git does one-chunk-per-lin
    • if I change one line, I only have to sync that one line, even if the file is large
    • only works for text files
  • rabin fingerprinting (content-defined chunking)
  • scans through the file and creates chunks based on actual file content
    • if you insert something in the middle, a rabin fingerprint will create the same chunks on each side of the change
  • npm install rabin

demo

Zara Rahman, Bridging the gap: tech <-> activism

  • @zararah

background

  • open knowledge, school of data, engineroom
  • bridging gaps between communities who don’t talk to each other, or people who do talk but in different ways

responsible data program

using sensitive data

  • Physicians for Human Rights
    • programme on sexual violence in conflict zones
    • lots of victims don’t come forward to report
    • even when they do, challenges to accurately record
    • Kenya and eastern Democratic Republic of the Congo
    • MediCapt
      • standardising data collection
      • digitising data collection
        • mobile network penetration is v high, but the data is sensitive
      • iterating upon tool choice
        • tried an off-the-shelf tool, piloted, found it too cumbersome
        • developed a new tool, user research with people on the ground
      • reality check
        • evaluate at the end
        • start all over again and iterate
        • slow development
  • Sharing reports of violence
    • a non-profit wanted to support a community which faces a lot of violence
    • they weren’t particularly experienced in technology
    • started thinking of developing an app
      • report a perpetrator of violence to anyone in the area
    • legal, privacy issues
      • can’t have PII because this is an allegation
      • but without PII the report isn’t that useful
      • need to tread a fine line
    • future proofing
      • data minimization
      • don’t want to hold data which could in future put people at risk
      • people were put off from using app if they had to give too much information
    • collaboration
    • launch

analysing data

HRDAG

  • human rights dta analysis group
  • https://hrdag.org
  • data on casualties in Syria
    • listing different groups documenting
  • “Numbers are only human”
    • how do you categorise civilian vs military death?
    • how do you categories death due to conflict vs “natural causes”?
  • should you use exact (but uncertain) figures to draw attention to causes?

data in the Ebola response

  • http://cis-india.org/papers/ebola-a-big-data-disaster
  • in some countries there was a push to release Call Detail Records (CDRs) from mobile companies
  • getting access to the data
    • in Sierra Leone and Guinea, they released this data; in Liberia they didn’t
  • decision-making
    • the call was to have the data anonymised
      • but: it’s hard to anonymise such detailed information
      • and: in the Ebola response, the data is most useful when it can be linked to real personal identities
    • privacy rights weren’t respected
  • digital infrastructure

questions to ask yourself

  • what might an adversary do with your data?
    • not necessarily your adversary
    • what malicious things could they do with your data and how might they gain from that?
    • what would happen then?
  • what’s your holistic security plan?
  • what does informed consent look like for your users?
    • if you know that noone’s reading your Ts & Cs
    • are you making things visible that your users should know about
  • what levels of technical literacy do your users have?
  • in your team, whose job is it to think about the ethics?

conclusion

  • tech & data projects can have unintended consequences, even when well-intended

Q&A

  • do you have examples where they managed to embed context with the data
    • the MediCapt team found the context cruicial
    • the HRDAG work has lots of asides and nuanced explanations
      • they’ve very careful about waht they say, though they are probably more sure about their findings than many other groups
  • this reminds me of an app for reporting requests for bribes. how do organizations share anonymised data securely?

Jeni Tennison, Making CSV part of the web

  • Technical Director, ODI

the dream

motivating example: election data

  • data on wikipedia about last local elections
    • data table + map
  • all of this is hardcoded behind the scene in table rows
  • if you want to get hold of the data, you need to parse the html
  • election results are often entered on wikipedia really quickly
    • it’d be really cool to be able to get them out quickly too
  • it’s also not great to have the same data duplicated
  • could we reference the CSV data directly?
  • we can do it with images <img src="url://">; why not with tables of data?
  • <table src="uk-local-election-summary-2015.csv">
  • reference source for party-to-colour mapping
    • could bring it into your maps and tables

benefits

  • it would help people presenting data
  • improve quality of data available for us
    • motivate machine-readable data
    • motivate fixing of errors
      • visualisations of your tabular data demonstrate errors very quickly!
    • motivate publishers to give accurate metadata

getting to a standard

  • CSV on the Web @ W3C completed 2016
  • building on and learning from:
    • OKFN’s data packages / Tabular Data Format
    • Google’s Dataset Publishing language
    • national archives validation
    • existing CSV parsers
    • broad set of documented use cases & requirements

the difficult bits

discovering metadata

  • CSV needs metadata
    • “these columns contain numbers”
    • “this column should be displayed as +/-”
    • “this column is a pointer to this other table”
    • the metadata needs to be in a separate file
  • CSVW metadata standard
  • people want to download CSV, not a zipped-up package or the JSON metadata
    • the JSON metadata has a link to the CSV so you could discover it (in principle)
    • but: normal people won’t do this
    • the link generally needs to be to the CSV file itself
    • how do we find the metadata?
    • RFC 5988 link:
Link: <metadata.json>; rel="describedBy"; type="applications/csvm+json"
  • often can’t control Link: headers though
    • default filenames
      • just add -metadata.json to end of csv file
    • there’s some geeky stuff about /.well-known if you care about that

linking between CSVs

  • hard links
    • you can (if you want) think of CSVs as being like a relational database
    • foreign key relationships in your metadata
  • soft links

machine/human readability

  • CSV is on the boundary between these two worlds
  • human variability in CSV headers
    • “country” vs “Country”
    • “unemployment” vs “Unemployment rate”
    • CSVW metadata standard allows you to give different options for titles and indicate they mean the same thing
    • locale-specific variation
      • {en:country, de:Land}
  • formats for dates and numbers
    • use standard number & date formats
      • Unicode Technical Standard #35
    • minimal set that MUST be implemented
      • nothing that requires actually knowing languages
      • eg names of months, currency units
    • Implementations can do more

what’s next

  • Implementations
    • validation
    • conversion
      • into JSON and into RDF
    • authoring metadata
    • not yet for display
      • tables, maps, etc
      • it’d be really cool to have some web component type stuff
      • <table src="...">
    • annotation?
    • navigation?
  • https://www.w3.org/TR/tabular-data-primer

Q&A

  • sometimes there’s a value in the header (eg “election results 2014”). how do you deal with that?
    • there is a facility for “virtual columns” for static information

Matt Chadburn, Democratising data at the FT

  • principal engineer, FT

about the FT

  • 800,000 subscribers
  • company licences

users of data

  • page analytics
    • education
    • when do you remove something from the front page because it’s becoming stale?
  • email communication with users

summary

  • focus on the users need
  • learnable
  • ease of use (APIs to get stuff in and out)
  • iterative

Mouse Reeve, Grimoires, Demonology and Databases

  • I work for the Internet Archive, but I’m not here to talk about that
  • @tripofmice
  • grimoire.org

what is a grimoire?

  • a book of magic spells and invocations - OED
  • scope for this talk: 16th and 17th century, european christian tradition
  • in this time:
    • no clear divide between magic, religion, science
    • cunning folk prevalent in Europe
      • “low” magic
      • common people, often illiterate
      • medicine, divination, folk magic
    • ceremonial magic
      • “high” magic
      • summoning angels, demons, spirits, fairies
      • piously christian (sometimes, at least)
    • witchcraft
      • capital offence
      • nobody self-identifies as a witch
  • what’s okay vs a capital offence? what’s for scholars vs common people? it’s a bit woolly
  • England, 1580
    • Queen Elizabeth I
    • John Dee
      • some of his magical items are now in the British Museum
    • William Shakespeare
      • Propspero from the Tempest (based on John Dee?)
      • Oberon from a midsummer night’s dream
        • grimoires offered spells to summon Oberon
    • Psudomanarchia Daemonum (1577)
    • Lesser Key of Solomon (1641)
  • King Solomon’s Temple
    • Solomon was able to summon and control and use demons to help build his temple, aided by archangel Gabriel

demons

  • examples:
    • agares
    • crocell
    • buer
  • every demon is given a sigil, which is a calling card used to summon them
  • summoning a demon is really involved
    • elaborate circles
    • if you get it wrong, you might get eaten
  • crocell’s powers:
    • make it sound like it’s raining
    • run you a warm bath
    • teach you geometry
    • that’s it!

what I want to know

  • what are grimoires for?
    • how do they get used?

how I did it

  • it’s tough to model in relational model
  • lots of many-many relationships (eg demon <-> grimoire)
    • join tables
  • I used neo4j to model this as a graph problem

spells!

  • eg: glue to fix a porcelain vase (?!)

graph data structures

  • advantages:
    • designed for relationships & connections
    • flexible
    • no migrations
  • disadvantages
    • no schema for consistency
    • non-performant for simple tabular data
  • common use cases
    • social networks
    • public transport systems

results

Q&A

  • do any of these demons appear in paintings?
    • don’t know
  • what did people use these grimoires for?
    • hard to know
  • do you have a way to tell how comprehensive your dataset is?
    • the complete dataset is borderline infinite
    • there’s a finite number of grimoires that have survived and been translated into english
  • you mentioned node4j for pictorial representation. anything else for this purpose?
    • no
    • I have tables of spells and a timeline, but not much else in terms of data visualisation
  • could you use this dataset to perform unsupervised learning to generate new spells or demons?
    • sure why not

Sarah Gold, keynote: designin for data

  • @sarahtgold
  • @projectsbyif

my background

  • government, politics, civics, …
  • GDS
  • currently: IF
    • a design studio
    • we make things that change how people think about data
    • we are multidisciplinary
      • product development
      • design
      • security
    • we understand technology and design as disciplines which inform each other
    • everything we do is centred on people
      • people who understand the things they use make better decisions about how to use them

problem space

  • more things are becoming data conscious
    • more data being collected
    • more things being connected to the internet
    • it’s never been so cheap to put a chip in it
    • IoT
    • Internet of Shit
      • @InternetOfShit
        • there’s a lot of nonsense
  • we are producing a lot of personal data
    • phones, laptops, fitbits, etc
    • data maximalism
  • Ts & Cs are our default consent model
    • and they don’t work
    • samsung smart TV privacy policy: “Don’t talk in front of the TV”
  • objects are becoming informants
  • we don’t know if something is working properly
  • software is politics – Richard Pope

monitoring & testing

  • gherkin syntax
  • makerversity

design for data

  • design for minimum viable data
  • know which data type you’re designing with

consent models

Q&A

  • the more informed people are to the implications of tracking, the more likely they are to say no; how do companies which provide free services deal with this?
    • it’s very complicated
    • ad blockers
    • not enough time to do this justice
    • with instances like royal parks, they could give their patrons information about how useful their data has been

Jenny Bryan, keynote: spreadsheets 😱

  • professor of statistics at UBC
  • @JennyBryan @STAT545

spreadsheets!

  • it’s nice to be allowed to talk about spreadsheets for once
  • people like to moan about them
  • slides (with references!) https://github.com/jennybc/2016-05_csvconf-spreadsheets
  • inspiration: csv,conf,v1 talk Felienne Hermans “Spreadsheets are code”
  • it’s okay to care about spreadsheets!
  • how I pick people to work with:

“some of my best friends use spreadsheets”

  • inequality is toxic in a whole lot of contexts
    • in this case: ability to do what you want with data
    • there’s this “data 1%”
    • anything we want to do, we know how, or how to figure it out, or how to find someone who knows
    • lots of people I teach at UBC are much less able to get these things done, feel paralysed
    • down with software elitism
    • up with the last mile of data munging
  • I supported myself for ~4 years doing spreadsheets
    • I was doing a management consulting gig
    • during grad school I supported myself doing high-end excel work
    • there’s a lot you can do with these consumer-level tools
    • I’d like to create a more porous border between spreadsheets and R/python/etc
  • https://twitter.com/tomaspetricek/status/687947134088392704
    • “Ouch. “50 million accountants use monads in Excel. They just don’t go around explaining monads to everyone…” @Felienne #ndclondon”
  • reactivity is one of the main things people love about spreadsheets
    • spreadsheets have pushed computer science to deal with reactivity
    • i was talking on a podcast about the future of spreadsheets and whether they will go away; i felt reactivity was key
    • with R, I write a Makefile to rebuild everything from scratch
      • but I still have to kick this thing
  • spreadsheets also have less syntax bullshittery
    • argument names, separators, etc
    • you can just select things with your mouse and click “average”
  • FACTS!
    • about 1 billion people use MS OFfice
    • about 650 million people use spreadsheets
    • up to half use formulas
    • 250k - 1m use R
    • 1-5m use Python
  • you go into data analysis with the tools you know, not the tools you need

crazy spreadsheet stories

  • what you think people are doing ≠ what you think people should be doing ≠ what people are actually doing
  • most tools are designed for the middle thing (what you think people should be doing)
  • The Enron Corpus
    • “the pompeii of spreadsheets”
    • 600k emails
    • 15k spreadsheets
  • example:
    • some cells are data
    • some are formulas
    • some are phone numbers
    • visualizations
    • spreadsheets within spreadsheets (ie a rectangular group of cells)
    • Hermans, Murphy-Hill (research paper on the corpus)
  • lots of colour
    • data and formatting blurred together
    • font choice and colour of cell gives you a categorical variable
  • inconsistency between rows and columns
  • references to other spreadsheets, that you don’t have
  • columns of intermediate computations are so boring, so they get hidden
  • http://xkcd.com/1667/

what makes spreadsheets so vexing?

  • machine readable & human readable
    • (see JeniT’s keynote further up)
    • a spreadsheet is often neither machine nor human readable
      • technically, yes you can open them and look at them
      • but a machine cannot get useful data out in an unsupervised, scalable way
      • and a human reading someone else’s spreadsheet is like reading another person’s codebase
    • spreadsheets are (data ∩ formatting ∩ programming logic)
      • but often we only care about one or two of these concerns
      • (can we separate them after the fact?)

how do we fix this?

  • what are the problems?
  • which ones can we solve?
    • with training?
      • sometimes people use spreadsheets for inappropriate things and we can train them to stop it
    • with tooling?
      • (just a subset; not all problems can be solved with tooling)
  • two angles:
    • create new spreadsheet implementations that use, eg, R or python for computation and visualization
      • anticipate version control, collaboration
      • AlphaSheets
      • stencila
    • accept spreadsheets as they are
      • create tools to get goodies out
      • maybe write back into sheets?
  • ~googlesheets~ R package
    • (google sheets are much less common than excel, but they’re still reasonably common)
  • goal: spreadsheet reading tools in R
    • with no non-R dependency
  • Book: Spreadsheet implementation technology

Q&A

  • what are the interesting differences between excel and google sheets (for ingesting data)
    • the excel spec is 6000 pages long; the google sheets spec is 0 pages long
    • I wish there was something in between
    • they’re both very verbose xml
    • not really big differences in parsing
    • google sheets has to chase excel and be super compatible with excel

    -

Rufus Pollock and Dan Fowler, Frictionless Data

motivation

  • getting UK government to publish data on all their spending
    • in CSV format
    • with a spec
      • defined columns
  • but: problems
    • 401 html document saved as csv :/
  • friction
  • containerization for data
    • docker docker docker
  • key principles
    • simplicity
    • web oriented
    • existing tools
    • open
  • validation

Darren Barnes, Data Baker: Pretty Spreadsheets to Useful CSVs

  • a success story from the previous csv,conf

Context

  • ONS produces thousands of spreadsheets each year on our website
    • we’re getting more efficient at it
    • the underlying structures no longer exist for us to get that data in a machine-readable way
    • we’ve gotten so good at producing these spreadsheets but neglected the source data
    • we have CSVs, but “we can’t publish that on the website”
      • I can’t do my aggregation in there
  • how do we get to a point where we publish CSVs?

history

  • scraperwiki + ONS at csv,conf,v1
  • Dragon Dave McKee’s talk on XYPath
  • version 1
    • python
    • command-line
    • not pretty but functional
  • example
    • spreadsheet with merged cells, multiple tabs, hidden columns, etc etc (see Jenny Bryan’s keynote above)
    • we set up some recipes to instruct Data Baker:
      • what files we want to look at
      • where the data is
      • what transformations we want to do
    • run the command
      • slurp in the .xls files
      • generates some output .xls files
      • one output: a colour-coded .xls file to show how the data was sliced up
        • sanity check to make sure we’re doing it right
  • code! https://github.com/scraperwiki/databaker

Jeremy Freeman, open source neuroscience

  • the jenalia research campus (“the bell labs of neuroscience”)
    • northern virginia
    • research institute, non-profit funded

motivations: why do we study the brain?

  • there’s a lot we don’t know
    • try talking to fifth graders!
    • “how is it that I can hear a phone number and the next day I still remember that phone number?”
    • “why do I always dream about robots and dinosaurs?”
  • mice as a model
    • two-photon imaging

using data

  • we often want to analyse data as quickly as possible to drive decisions about what experiment to do next
  • random access two photon mesoscope
  • rich data patterns of brain activity
  • the 80/20 problem
    • time spent doing incredible measurements
    • time spent doing other stuff
    • used to be 80% data gathering & experimental research; 20% analysis
    • now, it’s all changed; only 20% doing actual science
  • analysis isn’t a linear process
    • lots of backtracking and dead ends
    • lots of reinventing the wheel between different labs
      • no sharing of infrastructure
      • often no source control
  • goal: lots of modules that solve well-defined small problems, that can be glued together
    • eg thunder project & bolt-project
    • thunder: a collection of modules for image and time series data analysis
    • neurofinder.codeneuro.org
      • analysing a picture and determining which groups of pixels correspond to neurons
      • a really common neuroscience problem!
      • but every lab has come up with their own independent way of doing it
      • website to allow people to submit results from their algorithms (against training and testing datasets)
      • (Question: why didn’t you use kaggle?
        • this seemed like a simple enough problem to solve for ourselves rather than buying into the kaggle space
        • we originally thought about having people submit code and run it in a container but running matlab in a container is somewhere between difficult and illegal)
  • lightning-viz.org – modular visualization things
  • https://github.com/mikolalysenko/regl
    • webgl and 3d is a really important part of the future of scientific visualization
  • the 1 to 2 problem:
    • starting collaboration between two individuals
    • jupyter notebooks
  • https://github.com/sofroniewn/tactile-coding
    • github is great for sharing code (and to some degree, data)
    • it doesn’t solve the problem of making an environment usable on someone else’s machine
    • can we use things like docker to take jupyter notebooks and data and code and bundle them all together?
      • good, we had to repeat the complex process each time
  • mybinder.org
    • tell us a github repo
      • has to have a certain set of contents
        • code needed to run your notebooks
        • some metadata
        • (not required: a complete Dockerfile)
      • builds a docker image
      • then embed a button in your github repo
        • the button launches into a running environment
    • what’s the value in being able to reproduce someone else’s analysis?
      • if someone can rerun this and, as a result, start a collaboration, that’s really cool
  • buzzfeed made a binder to analyse refugee data
    • data relevant for policy decisions: we should have access
    • the analysis should be open too
  • binder doesn’t address data sharing
    • you can put it in a github repo
      • but it’s not a wonderfully sustainable solution
    • dat sounds really cool though! http://dat-data.com
  • Question: nick had a live image render in a jupyter notebook – how do you do that?
    • the data comes off the microscope
    • goes directly to the machines in a cluster
    • crunching happens
    • then gets absorbed into html rendering in the notebook

back to brains

  • mouse VR
    • data from neurons as a mouse’s whiskers get closer or further from a wall
  • hexaworld

Q&A

  • what do you do about describing the data? where did it come from? when was it measured?
    • almost no coordination of metadata right now in neuroscience
    • I don’t know how to get two postdocs in the same lab to coordinate on data

Serah Njambi Rono, Life/death decisions powered by CSVs

  • @CallMeAlien
  • developer advocate, @CodeForAfrica
    • a civic tech organization
    • works to empower citizens by giving them access to information
  • call for action: build more tools that directly impact the communities we live in

the problem

  • access to proper healthcare is a basic human right; but the WHO estimates about a third of the world’s population has no access to the most basic medicines
  • in Kenya, quack doctors are very common
    • story: my boss (from south africa) had a business trip to kenya
      • got really sick, sought medical advice, got treated, felt better, returned to SA
      • then got even worse
      • visited his regular family doctor
      • SA requested medical records from kenyan treatment
      • when the SA doctor’s office contacted the kenyan doctor’s office, it turned out the “doctor” was in fact a vet
    • a lot of people in rural africa or south east asia struggle to access doctors
    • how sure are they that they’re seeing a registered practitioner?

the solution

  • Code For Africa collaborated with The Star, the largest blue-collar newspaper
    • http://bit.ly/starHeatlh
    • enter the name of the town you’re in
    • get a list of medical practitioners you can see, what their speciality is, what clinics they are in
  • story: a woman went to the police and reported she had been drugged and raped by an alleged gynaecologist
    • it hit the news, then many more women came forward
    • it turned out he was a quack doctor; he wasn’t even registered
    • just put up a sign
    • and people trusted him with their lives
    • public outcry
    • The Star started publicising the platform and people started using it

the data

  • Kenya Medical Practitioners and Dentists Board is the authority
    • published the list across >300 web pages
    • websites are not universally accessible
    • a lot of people still have feature phones
  • our service has an SMS interface
    • text us a request and we can tell you details about specific doctors
  • we don’t just take the data from the government; we also validate and report errors back to the government
  • it’s now been replicated by a newsroom in Nigeria
    • they’ve started adding medicine prices too

Q&A

  • is the data available too?
    • yes it’s available, I can point you to the github
  • re: sms delivery: how do people submit the names?
    • people submit a name
    • we have to do some normalization to allow variability “D” “Dr” “Doctor” etc
    • another issue: the database only has 11,000 doctors
      • we have 44 million people in kenya!
      • either we have only 1 doctor per 4000 people (far too low!)
      • or there are many many unregistered doctors (also bad!)
  • could you look at geographical variabity? eg pockets of countries with low coverage
    • yes, definitely
  • how do you keep the data up to date?
    • the scrapers are automated
    • re-scrape on a weekly basis
    • in January this year, we realised that our scrapers weren’t updated themselves
    • it’s a contant gardening effort
  • have you reached out to the organization to see if you could get a data dump?
    • there’s a big trend in kenya (#dodgydoctors hashtag, and another swahili hashtag)
    • people are calling for all government services to have SMS interfaces
    • it’s a bit complicated to get the data from the government
  • https://github.com/CodeForAfrica/theStarHealth
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment