Skip to content

Instantly share code, notes, and snippets.

What would you like to do? verification preliminary results
Received: by with HTTP; Sat, 20 Mar 2010 07:17:28 -0700 (PDT)
Date: Sat, 20 Mar 2010 14:17:28 +0000
Subject: format verification preliminary results
From: Tom Morris <>
Content-Type: text/plain; charset=ISO-8859-1
Sorry it has taken so long, but here are the aggregate results of the format verification exercise.
HTML - 252
XML - 5
Word - 4
RTF - 1
OpenOffice - 1
Something odd - 85
JSON - 9
Nothing there! - 190
CSV - 12
Multiple formats - 1211
PDF - 468
RDF - 10
Excel - 408
TOTAL - 2656
Sadly, this is over-optimistic. I've manually checked some of the data
that has been categorised as JSON and RDF. Most of it is not actually
correctly categorised - either people clicked, say, 'RDF' when they
meant to click 'PDF', or they have seen an RSS or Atom feed and
categorised it as RDF.
What this admittedly imperfect dataset is basically saying is that the
vast majority of the 'data' on is not actually
machine-readable data but human-readable documents.
I'll publish the complete dataset later.
Tom Morris
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment