Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save chriswhong/93ed4660ff23fdf483b41654290f0b18 to your computer and use it in GitHub Desktop.
Save chriswhong/93ed4660ff23fdf483b41654290f0b18 to your computer and use it in GitHub Desktop.
Socrata Download URL generates different files when accessed multiple times
## Problem
When troubleshooting another bug at qri.io, we noticed that hitting the same download URL from the NYC Open Data portal (powered by Socrata) generated csv files with different md5 hashes. This was causing our version control software to consider the csv "changed". On further investigation, the actual data was the same, but the ordering of the rows was different. This is likely due to the fact that the files are being generated on the fly from a database query (PostgreSQL in the case of Socrata open data portals) returning unpredictable row orders when no `ORDER BY` is specified in the query.
To reproduce:
- curl the same download URL several times and save to csv, such as this one for popular baby names:
```
$curl https://data.cityofnewyork.us/resource/25th-nujf.csv > 0.csv
$curl https://data.cityofnewyork.us/resource/25th-nujf.csv > 1.csv
$curl https://data.cityofnewyork.us/resource/25th-nujf.csv > 2.csv
...
```
- calculate the md5 hash of each CSV file
`$openssl md5 *.csv`
- not all of the hashes are identical
```
MD5(0.csv)= e91c5f61fce2d026b3e3a4d0456bc987
MD5(1.csv)= e91c5f61fce2d026b3e3a4d0456bc987
MD5(2.csv)= e91c5f61fce2d026b3e3a4d0456bc987
MD5(3.csv)= e91c5f61fce2d026b3e3a4d0456bc987
MD5(4.csv)= e91c5f61fce2d026b3e3a4d0456bc987
MD5(5.csv)= 6af3bbad8c49866a1fd934187b961de7
MD5(6.csv)= e91c5f61fce2d026b3e3a4d0456bc987
```
## Solution
Using the Socrata API to specify ordering, including all columns, resulted in consistent downloaded files:
`curl 'https://data.cityofnewyork.us/resource/25th-nujf.csv?$select=*&$order=brth_yr,gndr,ethcty,nm, cnt, rnk'`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment