Skip to content

Instantly share code, notes, and snippets.

What would you like to do?


Ben Foxall, Serving CSV from the browser

  • @benjaminbenben @pusher

why do I love CSV?

  • accessibility - don’t need specialized programs
  • it’s the start of something, not the end
    • you don’t print out a csv document, you do something further with it
  • how do you get your data back from the cloud?

example: runkeeper

  • runkeeper - GPS routes of running
  • how do I get my data from it?
    • attempt #1: download
      • .zip file, containing .gpx coordinates and .csv with heart rate etc
      • 👌
      • but:
        • no format control
        • functionality might change or disappear
        • need to go online to retrieve different times
          • eg if you need a different set of dates
      • can we gather together our data and cut it up ourselves offline?
    • attempt #2: script
      • bash script, jq for json processing
      • github gist benfoxall/
      • 👌
        • format choices
        • sharable
        • offline
      • 👎
        • inaccessible
          • downloading a csv is easy, writing a bash script is hard
    • attempt #3: web service
      • connect to runkeeper api, convert to csv documents
      • 👌
        • accessible
      • 👎
        • non-trivial backend
        • handling sensitive data
        • online only
    • attempt #4: serve from the browser
      • 👌
        • accessible
        • data stored locally
      • 👎

what we will implement

  • request -> process -> serve
  • a small runkeeper API
    • javascript fetch() API (it’s the new ajax, returns a promise)
    • dataForUser(..)
  • process
    • turning JSON into CSV
  • serve to the user
    • data URIs
      • present csv as a data URI
  • 🚢
  • how can we make it better?
    • support bigger files
      • csv might be big
      • data uri does base64 which makes it even bigger
      • solution: Blob()
        • supported by browsers (except IE9)
        • creates an object outside the javascript stack
          • it’s also immmutable
        • avoids churning through VM memory
        • can generate URLs to download Blobs
    • no persistence
      • IndexedDB (+ Dexie)
      • chrome devtools resources tab shows the IndexedDB contents
    • no permanent URLs
      • the Blob URL is only valid while the page is loaded
      • a static script can’t pull from this URL
      • Service Workers!
        • a script which runs separately from your UI thread
        • allows offline-first websites
        • can serve cached content
        • can serve synthesized responses
        • proper URL, but no web request
      • 👌
        • we can cache the service worker response
        • we can serve different views on our data
          • geoJSON, csv, html
        • frontend code can use Service Workers without knowing they exist
        • all of this is now offline-capable
    • when you visit, log in with OAuth
    • service worker starts and downloads data into IndexedDB
    • continues (even if you close the tab!)

Richard Jones, CSV as the master dataset

  • founder, Cottage Labs
  • software dev agency, higher education work

what are we trying to do?

  • bespoke information systems, quickly and cost-effectively
    • clients work with spreadsheets
    • upload into a datastore
    • people can then query the datastore and see interesting views


  • humans love spreadsheets
    • especially in the non-technical world
    • tabular data is easy to work with
    • the desktop toolchain is excellent (much as we might complain)
    • we could never meet the needs that these tools meet (especially on our time and budget and skills)
  • lots of information systems are basically the same
    • most of the differences are the kind of dta being worked on
    • workflows exist, but they happen in the admin area
  • admin areas are expensive and boring to build
    • lots of web forms – create/edit/delete record
    • I’d like it if people could manage their data outside of the admin system
  • data visualisation, data science, data journalism are all in
    • but also specialist domains and outside the reach of small organizations
    • (I’m not a data specialist - no machine learning or stats – but I can help the client cut up their data)
  • we find ourselves doing much the same thing over and over again

the weird things people do with spreadsheets

  • they put blurb above their header rows
    • the actual table of data is a few rows down
    • a spreadsheet is a document, not a dataset
  • they colour cells in, with the colour carrying meaning
    • this disappears on export-to-csv
    • the form-vs-function distinction isn’t clear when seeing a spreadsheet as a document
  • sloppy with hard formats (like numbers)
    • eg -£1,00,0000.0
  • they break boundaries of acceptable use for typed fields
    • eg cost column containing “$100 to about 200”
    • data models are brittle, humans are flexible

how do we read a spreadsheet?

  • decode the bits
    • welcome to encoding hell!
      • excel might give you cp1252 or Windows-1252 (not the same!)
      • excel/numbers might give you MacRoman on OSX
      • Calc will hopefully give you UTF-8
      • any of them could do any one of hundreds of encodings
    • some encodings are interchangable, but the newline character is not a common link
    • we check we’ve actually got a rectangular dataset for confidence
  • read the data
    • ignore supporting documentation above the dataset
    • translate the header rows
    • trim content, ignore empty values, and “N/A” values
    • coerce data into something cleaner (“£1,000” -> 1000)
    • we’re not scrubbing the data, just allowing for the humanity in the book-keeping
    • output: JSON
  • make it queryable
    • Elasticsearch
  • publish interactive interfaces
    • javascript frontend on top of elasticsearch query engine

other work we’ve done

  • open access spectrum
  • lantern (CSV-only interface)

what’s hard?

  • some data is hard to represent in spreadsheets
    • hierarchical or highly relational data
    • don’t make people use a spreadsheet the way we’d use a database!
  • consistency use of dictionary terms
    • if the spreadsheet maintainers can use consistent names for things, like Countries, it can make things much easier

tech roll call

  • we’re not trying to duplicate: open refine, trifacta, tableau
  • things we do use:
    • d3 + nvd3
    • elasticsearch
    • objectpath (xpath-like language for JSON)
  • things we tried but aren’t currently using
    • highcharts
    • tablib


  • what’s your largest elasticsearch dataset? largest index?
    • 2.5 million records; 25Gb

Matthias Buus (filling in for Karissa), distributing open data with dat

what is dat?

  • open source project for sharing open data
  • funded by Alfred P Sloan foundation
  • meetings are open youtube hangouts
  • 3 person team
  • >800 modules on npm
    • around half a percent of all npm modules!
  • dat is a p2p file sharing network
  • written in javascript
  • works in browser
  • move the data to the code (don’t move your code to your data)
  • data is just files
  • you don’t need all the files
  • move just the files you need to the code
  • similar to bitorrent
  • install: npm install dat

sharing data

  • dat link ~/big-file.csv
    • creates a content-addressable link dat://9620fb285...
    • can give the link to a friend, then they run dat dat://9620fb285... and automatically discover you and start downloading the dataset

how does it work?

  • split file into chunks which are unlikely to change
    • git does one-chunk-per-lin
    • if I change one line, I only have to sync that one line, even if the file is large
    • only works for text files
  • rabin fingerprinting (content-defined chunking)
  • scans through the file and creates chunks based on actual file content
    • if you insert something in the middle, a rabin fingerprint will create the same chunks on each side of the change
  • npm install rabin


Zara Rahman, Bridging the gap: tech <-> activism

  • @zararah


  • open knowledge, school of data, engineroom
  • bridging gaps between communities who don’t talk to each other, or people who do talk but in different ways

responsible data program

using sensitive data

  • Physicians for Human Rights
    • programme on sexual violence in conflict zones
    • lots of victims don’t come forward to report
    • even when they do, challenges to accurately record
    • Kenya and eastern Democratic Republic of the Congo
    • MediCapt
      • standardising data collection
      • digitising data collection
        • mobile network penetration is v high, but the data is sensitive
      • iterating upon tool choice
        • tried an off-the-shelf tool, piloted, found it too cumbersome
        • developed a new tool, user research with people on the ground
      • reality check
        • evaluate at the end
        • start all over again and iterate
        • slow development
  • Sharing reports of violence
    • a non-profit wanted to support a community which faces a lot of violence
    • they weren’t particularly experienced in technology
    • started thinking of developing an app
      • report a perpetrator of violence to anyone in the area
    • legal, privacy issues
      • can’t have PII because this is an allegation
      • but without PII the report isn’t that useful
      • need to tread a fine line
    • future proofing
      • data minimization
      • don’t want to hold data which could in future put people at risk
      • people were put off from using app if they had to give too much information
    • collaboration
    • launch

analysing data


  • human rights dta analysis group
  • data on casualties in Syria
    • listing different groups documenting
  • “Numbers are only human”
    • how do you categorise civilian vs military death?
    • how do you categories death due to conflict vs “natural causes”?
  • should you use exact (but uncertain) figures to draw attention to causes?

data in the Ebola response

  • in some countries there was a push to release Call Detail Records (CDRs) from mobile companies
  • getting access to the data
    • in Sierra Leone and Guinea, they released this data; in Liberia they didn’t
  • decision-making
    • the call was to have the data anonymised
      • but: it’s hard to anonymise such detailed information
      • and: in the Ebola response, the data is most useful when it can be linked to real personal identities
    • privacy rights weren’t respected
  • digital infrastructure

questions to ask yourself

  • what might an adversary do with your data?
    • not necessarily your adversary
    • what malicious things could they do with your data and how might they gain from that?
    • what would happen then?
  • what’s your holistic security plan?
  • what does informed consent look like for your users?
    • if you know that noone’s reading your Ts & Cs
    • are you making things visible that your users should know about
  • what levels of technical literacy do your users have?
  • in your team, whose job is it to think about the ethics?


  • tech & data projects can have unintended consequences, even when well-intended


  • do you have examples where they managed to embed context with the data
    • the MediCapt team found the context cruicial
    • the HRDAG work has lots of asides and nuanced explanations
      • they’ve very careful about waht they say, though they are probably more sure about their findings than many other groups
  • this reminds me of an app for reporting requests for bribes. how do organizations share anonymised data securely?

Jeni Tennison, Making CSV part of the web

  • Technical Director, ODI

the dream

motivating example: election data

  • data on wikipedia about last local elections
    • data table + map
  • all of this is hardcoded behind the scene in table rows
  • if you want to get hold of the data, you need to parse the html
  • election results are often entered on wikipedia really quickly
    • it’d be really cool to be able to get them out quickly too
  • it’s also not great to have the same data duplicated
  • could we reference the CSV data directly?
  • we can do it with images <img src="url://">; why not with tables of data?
  • <table src="uk-local-election-summary-2015.csv">
  • reference source for party-to-colour mapping
    • could bring it into your maps and tables


  • it would help people presenting data
  • improve quality of data available for us
    • motivate machine-readable data
    • motivate fixing of errors
      • visualisations of your tabular data demonstrate errors very quickly!
    • motivate publishers to give accurate metadata

getting to a standard

  • CSV on the Web @ W3C completed 2016
  • building on and learning from:
    • OKFN’s data packages / Tabular Data Format
    • Google’s Dataset Publishing language
    • national archives validation
    • existing CSV parsers
    • broad set of documented use cases & requirements

the difficult bits

discovering metadata

  • CSV needs metadata
    • “these columns contain numbers”
    • “this column should be displayed as +/-”
    • “this column is a pointer to this other table”
    • the metadata needs to be in a separate file
  • CSVW metadata standard
  • people want to download CSV, not a zipped-up package or the JSON metadata
    • the JSON metadata has a link to the CSV so you could discover it (in principle)
    • but: normal people won’t do this
    • the link generally needs to be to the CSV file itself
    • how do we find the metadata?
    • RFC 5988 link:
Link: <metadata.json>; rel="describedBy"; type="applications/csvm+json"
  • often can’t control Link: headers though
    • default filenames
      • just add -metadata.json to end of csv file
    • there’s some geeky stuff about /.well-known if you care about that

linking between CSVs

  • hard links
    • you can (if you want) think of CSVs as being like a relational database
    • foreign key relationships in your metadata
  • soft links

machine/human readability

  • CSV is on the boundary between these two worlds
  • human variability in CSV headers
    • “country” vs “Country”
    • “unemployment” vs “Unemployment rate”
    • CSVW metadata standard allows you to give different options for titles and indicate they mean the same thing
    • locale-specific variation
      • {en:country, de:Land}
  • formats for dates and numbers
    • use standard number & date formats
      • Unicode Technical Standard #35
    • minimal set that MUST be implemented
      • nothing that requires actually knowing languages
      • eg names of months, currency units
    • Implementations can do more

what’s next

  • Implementations
    • validation
    • conversion
      • into JSON and into RDF
    • authoring metadata
    • not yet for display
      • tables, maps, etc
      • it’d be really cool to have some web component type stuff
      • <table src="...">
    • annotation?
    • navigation?


  • sometimes there’s a value in the header (eg “election results 2014”). how do you deal with that?
    • there is a facility for “virtual columns” for static information

Matt Chadburn, Democratising data at the FT

  • principal engineer, FT

about the FT

  • 800,000 subscribers
  • company licences

users of data

  • page analytics
    • education
    • when do you remove something from the front page because it’s becoming stale?
  • email communication with users


  • focus on the users need
  • learnable
  • ease of use (APIs to get stuff in and out)
  • iterative

Mouse Reeve, Grimoires, Demonology and Databases

  • I work for the Internet Archive, but I’m not here to talk about that
  • @tripofmice

what is a grimoire?

  • a book of magic spells and invocations - OED
  • scope for this talk: 16th and 17th century, european christian tradition
  • in this time:
    • no clear divide between magic, religion, science
    • cunning folk prevalent in Europe
      • “low” magic
      • common people, often illiterate
      • medicine, divination, folk magic
    • ceremonial magic
      • “high” magic
      • summoning angels, demons, spirits, fairies
      • piously christian (sometimes, at least)
    • witchcraft
      • capital offence
      • nobody self-identifies as a witch
  • what’s okay vs a capital offence? what’s for scholars vs common people? it’s a bit woolly
  • England, 1580
    • Queen Elizabeth I
    • John Dee
      • some of his magical items are now in the British Museum
    • William Shakespeare
      • Propspero from the Tempest (based on John Dee?)
      • Oberon from a midsummer night’s dream
        • grimoires offered spells to summon Oberon
    • Psudomanarchia Daemonum (1577)
    • Lesser Key of Solomon (1641)
  • King Solomon’s Temple
    • Solomon was able to summon and control and use demons to help build his temple, aided by archangel Gabriel


  • examples:
    • agares
    • crocell
    • buer
  • every demon is given a sigil, which is a calling card used to summon them
  • summoning a demon is really involved
    • elaborate circles
    • if you get it wrong, you might get eaten
  • crocell’s powers:
    • make it sound like it’s raining
    • run you a warm bath
    • teach you geometry
    • that’s it!

what I want to know

  • what are grimoires for?
    • how do they get used?

how I did it

  • it’s tough to model in relational model
  • lots of many-many relationships (eg demon <-> grimoire)
    • join tables
  • I used neo4j to model this as a graph problem


  • eg: glue to fix a porcelain vase (?!)

graph data structures

  • advantages:
    • designed for relationships & connections
    • flexible
    • no migrations
  • disadvantages
    • no schema for consistency
    • non-performant for simple tabular data
  • common use cases
    • social networks
    • public transport systems



  • do any of these demons appear in paintings?
    • don’t know
  • what did people use these grimoires for?
    • hard to know
  • do you have a way to tell how comprehensive your dataset is?
    • the complete dataset is borderline infinite
    • there’s a finite number of grimoires that have survived and been translated into english
  • you mentioned node4j for pictorial representation. anything else for this purpose?
    • no
    • I have tables of spells and a timeline, but not much else in terms of data visualisation
  • could you use this dataset to perform unsupervised learning to generate new spells or demons?
    • sure why not

Sarah Gold, keynote: designin for data

  • @sarahtgold
  • @projectsbyif

my background

  • government, politics, civics, …
  • GDS
  • currently: IF
    • a design studio
    • we make things that change how people think about data
    • we are multidisciplinary
      • product development
      • design
      • security
    • we understand technology and design as disciplines which inform each other
    • everything we do is centred on people
      • people who understand the things they use make better decisions about how to use them

problem space

  • more things are becoming data conscious
    • more data being collected
    • more things being connected to the internet
    • it’s never been so cheap to put a chip in it
    • IoT
    • Internet of Shit
      • @InternetOfShit
        • there’s a lot of nonsense
  • we are producing a lot of personal data
    • phones, laptops, fitbits, etc
    • data maximalism
  • Ts & Cs are our default consent model
    • and they don’t work
    • samsung smart TV privacy policy: “Don’t talk in front of the TV”
  • objects are becoming informants
  • we don’t know if something is working properly
  • software is politics – Richard Pope

monitoring & testing

  • gherkin syntax
  • makerversity

design for data

  • design for minimum viable data
  • know which data type you’re designing with

consent models


  • the more informed people are to the implications of tracking, the more likely they are to say no; how do companies which provide free services deal with this?
    • it’s very complicated
    • ad blockers
    • not enough time to do this justice
    • with instances like royal parks, they could give their patrons information about how useful their data has been

Jenny Bryan, keynote: spreadsheets 😱

  • professor of statistics at UBC
  • @JennyBryan @STAT545


  • it’s nice to be allowed to talk about spreadsheets for once
  • people like to moan about them
  • slides (with references!)
  • inspiration: csv,conf,v1 talk Felienne Hermans “Spreadsheets are code”
  • it’s okay to care about spreadsheets!
  • how I pick people to work with:

“some of my best friends use spreadsheets”

  • inequality is toxic in a whole lot of contexts
    • in this case: ability to do what you want with data
    • there’s this “data 1%”
    • anything we want to do, we know how, or how to figure it out, or how to find someone who knows
    • lots of people I teach at UBC are much less able to get these things done, feel paralysed
    • down with software elitism
    • up with the last mile of data munging
  • I supported myself for ~4 years doing spreadsheets
    • I was doing a management consulting gig
    • during grad school I supported myself doing high-end excel work
    • there’s a lot you can do with these consumer-level tools
    • I’d like to create a more porous border between spreadsheets and R/python/etc
    • “Ouch. “50 million accountants use monads in Excel. They just don’t go around explaining monads to everyone…” @Felienne #ndclondon”
  • reactivity is one of the main things people love about spreadsheets
    • spreadsheets have pushed computer science to deal with reactivity
    • i was talking on a podcast about the future of spreadsheets and whether they will go away; i felt reactivity was key
    • with R, I write a Makefile to rebuild everything from scratch
      • but I still have to kick this thing
  • spreadsheets also have less syntax bullshittery
    • argument names, separators, etc
    • you can just select things with your mouse and click “average”
  • FACTS!
    • about 1 billion people use MS OFfice
    • about 650 million people use spreadsheets
    • up to half use formulas
    • 250k - 1m use R
    • 1-5m use Python
  • you go into data analysis with the tools you know, not the tools you need

crazy spreadsheet stories

  • what you think people are doing ≠ what you think people should be doing ≠ what people are actually doing
  • most tools are designed for the middle thing (what you think people should be doing)
  • The Enron Corpus
    • “the pompeii of spreadsheets”
    • 600k emails
    • 15k spreadsheets
  • example:
    • some cells are data
    • some are formulas
    • some are phone numbers
    • visualizations
    • spreadsheets within spreadsheets (ie a rectangular group of cells)
    • Hermans, Murphy-Hill (research paper on the corpus)
  • lots of colour
    • data and formatting blurred together
    • font choice and colour of cell gives you a categorical variable
  • inconsistency between rows and columns
  • references to other spreadsheets, that you don’t have
  • columns of intermediate computations are so boring, so they get hidden

what makes spreadsheets so vexing?

  • machine readable & human readable
    • (see JeniT’s keynote further up)
    • a spreadsheet is often neither machine nor human readable
      • technically, yes you can open them and look at them
      • but a machine cannot get useful data out in an unsupervised, scalable way
      • and a human reading someone else’s spreadsheet is like reading another person’s codebase
    • spreadsheets are (data ∩ formatting ∩ programming logic)
      • but often we only care about one or two of these concerns
      • (can we separate them after the fact?)

how do we fix this?

  • what are the problems?
  • which ones can we solve?
    • with training?
      • sometimes people use spreadsheets for inappropriate things and we can train them to stop it
    • with tooling?
      • (just a subset; not all problems can be solved with tooling)
  • two angles:
    • create new spreadsheet implementations that use, eg, R or python for computation and visualization
      • anticipate version control, collaboration
      • AlphaSheets
      • stencila
    • accept spreadsheets as they are
      • create tools to get goodies out
      • maybe write back into sheets?
  • ~googlesheets~ R package
    • (google sheets are much less common than excel, but they’re still reasonably common)
  • goal: spreadsheet reading tools in R
    • with no non-R dependency
  • Book: Spreadsheet implementation technology


  • what are the interesting differences between excel and google sheets (for ingesting data)
    • the excel spec is 6000 pages long; the google sheets spec is 0 pages long
    • I wish there was something in between
    • they’re both very verbose xml
    • not really big differences in parsing
    • google sheets has to chase excel and be super compatible with excel


Rufus Pollock and Dan Fowler, Frictionless Data


  • getting UK government to publish data on all their spending
    • in CSV format
    • with a spec
      • defined columns
  • but: problems
    • 401 html document saved as csv :/
  • friction
  • containerization for data
    • docker docker docker
  • key principles
    • simplicity
    • web oriented
    • existing tools
    • open
  • validation

Darren Barnes, Data Baker: Pretty Spreadsheets to Useful CSVs

  • a success story from the previous csv,conf


  • ONS produces thousands of spreadsheets each year on our website
    • we’re getting more efficient at it
    • the underlying structures no longer exist for us to get that data in a machine-readable way
    • we’ve gotten so good at producing these spreadsheets but neglected the source data
    • we have CSVs, but “we can’t publish that on the website”
      • I can’t do my aggregation in there
  • how do we get to a point where we publish CSVs?


  • scraperwiki + ONS at csv,conf,v1
  • Dragon Dave McKee’s talk on XYPath
  • version 1
    • python
    • command-line
    • not pretty but functional
  • example
    • spreadsheet with merged cells, multiple tabs, hidden columns, etc etc (see Jenny Bryan’s keynote above)
    • we set up some recipes to instruct Data Baker:
      • what files we want to look at
      • where the data is
      • what transformations we want to do
    • run the command
      • slurp in the .xls files
      • generates some output .xls files
      • one output: a colour-coded .xls file to show how the data was sliced up
        • sanity check to make sure we’re doing it right
  • code!

Jeremy Freeman, open source neuroscience

  • the jenalia research campus (“the bell labs of neuroscience”)
    • northern virginia
    • research institute, non-profit funded

motivations: why do we study the brain?

  • there’s a lot we don’t know
    • try talking to fifth graders!
    • “how is it that I can hear a phone number and the next day I still remember that phone number?”
    • “why do I always dream about robots and dinosaurs?”
  • mice as a model
    • two-photon imaging

using data

  • we often want to analyse data as quickly as possible to drive decisions about what experiment to do next
  • random access two photon mesoscope
  • rich data patterns of brain activity
  • the 80/20 problem
    • time spent doing incredible measurements
    • time spent doing other stuff
    • used to be 80% data gathering & experimental research; 20% analysis
    • now, it’s all changed; only 20% doing actual science
  • analysis isn’t a linear process
    • lots of backtracking and dead ends
    • lots of reinventing the wheel between different labs
      • no sharing of infrastructure
      • often no source control
  • goal: lots of modules that solve well-defined small problems, that can be glued together
    • eg thunder project & bolt-project
    • thunder: a collection of modules for image and time series data analysis
      • analysing a picture and determining which groups of pixels correspond to neurons
      • a really common neuroscience problem!
      • but every lab has come up with their own independent way of doing it
      • website to allow people to submit results from their algorithms (against training and testing datasets)
      • (Question: why didn’t you use kaggle?
        • this seemed like a simple enough problem to solve for ourselves rather than buying into the kaggle space
        • we originally thought about having people submit code and run it in a container but running matlab in a container is somewhere between difficult and illegal)
  • – modular visualization things
    • webgl and 3d is a really important part of the future of scientific visualization
  • the 1 to 2 problem:
    • starting collaboration between two individuals
    • jupyter notebooks
    • github is great for sharing code (and to some degree, data)
    • it doesn’t solve the problem of making an environment usable on someone else’s machine
    • can we use things like docker to take jupyter notebooks and data and code and bundle them all together?
      • good, we had to repeat the complex process each time
    • tell us a github repo
      • has to have a certain set of contents
        • code needed to run your notebooks
        • some metadata
        • (not required: a complete Dockerfile)
      • builds a docker image
      • then embed a button in your github repo
        • the button launches into a running environment
    • what’s the value in being able to reproduce someone else’s analysis?
      • if someone can rerun this and, as a result, start a collaboration, that’s really cool
  • buzzfeed made a binder to analyse refugee data
    • data relevant for policy decisions: we should have access
    • the analysis should be open too
  • binder doesn’t address data sharing
    • you can put it in a github repo
      • but it’s not a wonderfully sustainable solution
    • dat sounds really cool though!
  • Question: nick had a live image render in a jupyter notebook – how do you do that?
    • the data comes off the microscope
    • goes directly to the machines in a cluster
    • crunching happens
    • then gets absorbed into html rendering in the notebook

back to brains

  • mouse VR
    • data from neurons as a mouse’s whiskers get closer or further from a wall
  • hexaworld


  • what do you do about describing the data? where did it come from? when was it measured?
    • almost no coordination of metadata right now in neuroscience
    • I don’t know how to get two postdocs in the same lab to coordinate on data

Serah Njambi Rono, Life/death decisions powered by CSVs

  • @CallMeAlien
  • developer advocate, @CodeForAfrica
    • a civic tech organization
    • works to empower citizens by giving them access to information
  • call for action: build more tools that directly impact the communities we live in

the problem

  • access to proper healthcare is a basic human right; but the WHO estimates about a third of the world’s population has no access to the most basic medicines
  • in Kenya, quack doctors are very common
    • story: my boss (from south africa) had a business trip to kenya
      • got really sick, sought medical advice, got treated, felt better, returned to SA
      • then got even worse
      • visited his regular family doctor
      • SA requested medical records from kenyan treatment
      • when the SA doctor’s office contacted the kenyan doctor’s office, it turned out the “doctor” was in fact a vet
    • a lot of people in rural africa or south east asia struggle to access doctors
    • how sure are they that they’re seeing a registered practitioner?

the solution

  • Code For Africa collaborated with The Star, the largest blue-collar newspaper
    • enter the name of the town you’re in
    • get a list of medical practitioners you can see, what their speciality is, what clinics they are in
  • story: a woman went to the police and reported she had been drugged and raped by an alleged gynaecologist
    • it hit the news, then many more women came forward
    • it turned out he was a quack doctor; he wasn’t even registered
    • just put up a sign
    • and people trusted him with their lives
    • public outcry
    • The Star started publicising the platform and people started using it

the data

  • Kenya Medical Practitioners and Dentists Board is the authority
    • published the list across >300 web pages
    • websites are not universally accessible
    • a lot of people still have feature phones
  • our service has an SMS interface
    • text us a request and we can tell you details about specific doctors
  • we don’t just take the data from the government; we also validate and report errors back to the government
  • it’s now been replicated by a newsroom in Nigeria
    • they’ve started adding medicine prices too


  • is the data available too?
    • yes it’s available, I can point you to the github
  • re: sms delivery: how do people submit the names?
    • people submit a name
    • we have to do some normalization to allow variability “D” “Dr” “Doctor” etc
    • another issue: the database only has 11,000 doctors
      • we have 44 million people in kenya!
      • either we have only 1 doctor per 4000 people (far too low!)
      • or there are many many unregistered doctors (also bad!)
  • could you look at geographical variabity? eg pockets of countries with low coverage
    • yes, definitely
  • how do you keep the data up to date?
    • the scrapers are automated
    • re-scrape on a weekly basis
    • in January this year, we realised that our scrapers weren’t updated themselves
    • it’s a contant gardening effort
  • have you reached out to the organization to see if you could get a data dump?
    • there’s a big trend in kenya (#dodgydoctors hashtag, and another swahili hashtag)
    • people are calling for all government services to have SMS interfaces
    • it’s a bit complicated to get the data from the government
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.