Created
April 7, 2020 17:57
-
-
Save chriswhong/93ed4660ff23fdf483b41654290f0b18 to your computer and use it in GitHub Desktop.
Socrata Download URL generates different files when accessed multiple times
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## Problem | |
When troubleshooting another bug at qri.io, we noticed that hitting the same download URL from the NYC Open Data portal (powered by Socrata) generated csv files with different md5 hashes. This was causing our version control software to consider the csv "changed". On further investigation, the actual data was the same, but the ordering of the rows was different. This is likely due to the fact that the files are being generated on the fly from a database query (PostgreSQL in the case of Socrata open data portals) returning unpredictable row orders when no `ORDER BY` is specified in the query. | |
To reproduce: | |
- curl the same download URL several times and save to csv, such as this one for popular baby names: | |
``` | |
$curl https://data.cityofnewyork.us/resource/25th-nujf.csv > 0.csv | |
$curl https://data.cityofnewyork.us/resource/25th-nujf.csv > 1.csv | |
$curl https://data.cityofnewyork.us/resource/25th-nujf.csv > 2.csv | |
... | |
``` | |
- calculate the md5 hash of each CSV file | |
`$openssl md5 *.csv` | |
- not all of the hashes are identical | |
``` | |
MD5(0.csv)= e91c5f61fce2d026b3e3a4d0456bc987 | |
MD5(1.csv)= e91c5f61fce2d026b3e3a4d0456bc987 | |
MD5(2.csv)= e91c5f61fce2d026b3e3a4d0456bc987 | |
MD5(3.csv)= e91c5f61fce2d026b3e3a4d0456bc987 | |
MD5(4.csv)= e91c5f61fce2d026b3e3a4d0456bc987 | |
MD5(5.csv)= 6af3bbad8c49866a1fd934187b961de7 | |
MD5(6.csv)= e91c5f61fce2d026b3e3a4d0456bc987 | |
``` | |
## Solution | |
Using the Socrata API to specify ordering, including all columns, resulted in consistent downloaded files: | |
`curl 'https://data.cityofnewyork.us/resource/25th-nujf.csv?$select=*&$order=brth_yr,gndr,ethcty,nm, cnt, rnk'` |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment