Skip to content

Instantly share code, notes, and snippets.

@bibliotechy
Last active November 15, 2016 22:23
Show Gist options
  • Save bibliotechy/ba59ce5bc7aea2d1911c0cf79405437c to your computer and use it in GitHub Desktop.
Save bibliotechy/ba59ce5bc7aea2d1911c0cf79405437c to your computer and use it in GitHub Desktop.

Highlights:

** Downloading Open Data Philly Datasets**

  • 1707 resources across 350 packages
  • 1446 not marked as an API, so possibly suitable for downloading / archiving
  • Attempting to download those 1446 resulted in 1327 successful downloads and 88 urls that timed / errored out.
  • This produced about 19G files

** Are they already backed up in the Wayback Machine? **

  • 1141 appear to not have a crawl
  • 566 Resouces URL have at least one crawl present
  • Of the 566, some had duplicate URLs, so only 325 unique resources URLs present in Wayback
  • HUGE CAVEAT Wayback's API (http://archive.org/wayback/available?url=http://someurl.tld) doesn't handle single page apps, and routing done with url fragments, so this data is not totally accurate, especially for data in the Philly Data Catalog.

Nov 11th, Used the CKAN API python package to dump out metadata for all packages in OpenDataPhilly.

ckanapi dump datasets --all -O opendataphilly.json1.gz -z -p 2 -r https://opendataphilly.org

Ths resulted in 350 individual packages - mirroring what is in the public interface of opendataphilly.org.

gunzip -c opendataphilly.json1.gz | wc -l # each package is a single line JSON object
350

Each package contains one or more resources. For example, the Vacant Properties Indicator package contains multiple resources whcih include:

  • Subsets of data within this broader package
  • Multiple data formats for each subset
  • Metadata describing the data, to assist usage and comprehension of the data

So, next I checked how many total resources existed across packages.

gunzip -c opendataphilly.json1.gz | jq '.resources| .[] | .url ' |wc -l
1707

I knew we would want to exclude resources that are API's because I don't think it makes sense to try to estimate the size of an API. So, I narrowed the search a bit:

# Count resource that are not format api - though that number difers from data when crunching numbers in python? ¯\_(ツ)_/¯ 
gunzip -c opendataphilly.json1.gz | jq '.resources|label $res| .[]| if .format != "api" then  .url else break $res end' | wc -l '
1016

I was interested in seeing what formats were included

#list unique formats
opendataphilly.json1.gz | jq '.resources|label $res| .[]| .format'  |sort| uniq
""
"api"
"CSV"
"data lens"
"gdb"
"geojson"
"geoservice"
"grid"
"gtfs"
"HTML"
"imagery"
"JSON"
"KML"
"odata"
"PDF"
"rss"
"shp"
"table"
"tsv"
"TXT"
"XLS"
"xlsx"
"XML"
"zip:csv"
"zip:xls"

A quick python script shows some numbers:

{u'': 2,
 u'CSV': 329,
 u'HTML': 452,
 u'JSON': 23,
 u'KML': 126,
 u'PDF': 2,
 u'TXT': 3,
 u'XLS': 3,
 u'XML': 26,
 u'api': 261,
 u'data lens': 4,
 u'gdb': 0,
 u'geojson': 168,
 u'geoservice': 1,
 u'grid': 0,
 u'gtfs': 2,
 u'imagery': 21,
 u'odata': 3,
 u'rss': 17,
 u'shp': 220,
 u'table': 0,
 u'tsv': 2,
 u'xlsx': 0,
 u'zip:csv': 15,
 u'zip:xls': 2}

Probably a few more of those data formats don't make sense to try to download, like "geoservice".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment