bibliotechy/open-data-philly-stats.md

## open-data-philly-stats.md

      
    Raw
  

              open-data-philly-stats.md
            
          
    Highlights:
** Downloading Open Data Philly Datasets**

1707 resources across 350 packages
1446 not marked as an API, so possibly suitable for downloading / archiving
Attempting to download those 1446 resulted in 1327 successful downloads and 88 urls that timed / errored out.
This produced about 19G files

** Are they already backed up in the Wayback Machine? **

1141 appear to not have a crawl
566 Resouces URL have at least one crawl present
Of the 566, some had duplicate URLs, so only 325 unique resources URLs present in Wayback
HUGE CAVEAT Wayback's API (http://archive.org/wayback/available?url=http://someurl.tld) doesn't handle single page apps, and routing
done with url fragments, so this data is not totally accurate, especially for data in the Philly Data Catalog.


Nov 11th, Used the CKAN API python package to dump out metadata for all packages in OpenDataPhilly.
ckanapi dump datasets --all -O opendataphilly.json1.gz -z -p 2 -r https://opendataphilly.org
Ths resulted in 350 individual packages - mirroring what is in the public interface of opendataphilly.org.
gunzip -c opendataphilly.json1.gz | wc -l # each package is a single line JSON object
350
Each package contains one or more resources. For example, the Vacant Properties Indicator package contains multiple resources whcih include:

Subsets of data within this broader package
Multiple data formats for each subset
Metadata describing the data, to assist usage and comprehension of the data

So, next I checked how many total resources existed across packages.
gunzip -c opendataphilly.json1.gz | jq '.resources| .[] | .url ' |wc -l
1707
I knew we would want to exclude resources that are API's because I don't think it makes sense to try to estimate the size of an API. So, I narrowed the search a bit:
# Count resource that are not format api - though that number difers from data when crunching numbers in python? ¯\_(ツ)_/¯ 
gunzip -c opendataphilly.json1.gz | jq '.resources|label $res| .[]| if .format != "api" then  .url else break $res end' | wc -l '
1016
I was interested in seeing what formats were included
#list unique formats
opendataphilly.json1.gz | jq '.resources|label $res| .[]| .format'  |sort| uniq
""
"api"
"CSV"
"data lens"
"gdb"
"geojson"
"geoservice"
"grid"
"gtfs"
"HTML"
"imagery"
"JSON"
"KML"
"odata"
"PDF"
"rss"
"shp"
"table"
"tsv"
"TXT"
"XLS"
"xlsx"
"XML"
"zip:csv"
"zip:xls"
A quick python script shows some numbers:
{u'': 2,
 u'CSV': 329,
 u'HTML': 452,
 u'JSON': 23,
 u'KML': 126,
 u'PDF': 2,
 u'TXT': 3,
 u'XLS': 3,
 u'XML': 26,
 u'api': 261,
 u'data lens': 4,
 u'gdb': 0,
 u'geojson': 168,
 u'geoservice': 1,
 u'grid': 0,
 u'gtfs': 2,
 u'imagery': 21,
 u'odata': 3,
 u'rss': 17,
 u'shp': 220,
 u'table': 0,
 u'tsv': 2,
 u'xlsx': 0,
 u'zip:csv': 15,
 u'zip:xls': 2}
Probably a few more of those data formats don't make sense to try to download, like "geoservice".