How to download, import, and analyze San Francisco restaurant inspection data using SQLite3 and csvkit from the command-line.
A quick example of doing data wrangling from the command-line, as well as getting to know one of San Francisco's data sets: the San Francisco restaurant inspections, courtesy of the SF Department of Public Health. I don't normally do database work from the command-line, but importing bulk data into SQLite is pretty frustrating using the available GUIs or just the shell.
So thank goodness for Christopher Groskopf's csvkit, a suite of Unix-like tools that use Python to robustly handle CSV files. There's a lot of great tools in csvkit, but for this gist, I just use csvsql, which can parse a CSV and turn it into properly-flavored SQL to pass directly into your database app of choice.
This gist is basically one giant Bash script (OS X 10.10).
csvsql has a convenient mode that will read in a CSV file and generate a
CREATE TABLE statement:
$ csvsql some.csv --dialect sqlite --tables some_table_name
I ran that to generate the initial
CREATE TABLE commands, which I adjusted to my own liking...since the restaurant inspection data is not that big, we don't have to worry too much about optimization. We could probably get by without even indexing it.
About the data
This is what you see if you do a search for the inspection records for the In-N-Out Burger:
(Thankfully, it's been doing fine...whew!)
The actual raw data is in an external directory:
That zip file holds 3 files:
- businesses_plus.csv - unique identifiers, names, addresses, geospatial coordinates, and other data related to each individual business
- inspections_plus.csv - contains a row for each inspection, including the business_id of the inspected business, the score it received (if applicable), the date of the inspection, and the reason for the inspection
- violations_plus.csv - contains a row for every violation, including the severity and a short description of the violation. It contains business_id and an inspection date which can presumably be used to join against the inspections data.
Setting up your workspace and downloading the data
############## # Setup your workspace BASE_DIR=/tmp/whatever DB_PATH=$BASE_DIR/sf_food_program_db.sqlite ZIP_DIR=/tmp/sffoodzips mkdir -p $BASE_DIR mkdir -p $ZIP_DIR ################ # Download the files into $ZIP_DIR curl https://extxfer.sfdph.org/food/SFFoodProgram_Complete_Data.zip \ -o $ZIP_DIR/SFFoodProgram_Complete_Data.zip unzip $ZIP_DIR/SFFoodProgram_Complete_Data.zip -d $ZIP_DIR
Create the database and tables
At this point, we just pipe raw SQL into the
sqlite3 program. I previously used
csvsql with the
--dialect sqlite option to generate the
################# # Build the database # First remove the existing database file, if any rm -f $DB_PATH ## Create the businesses table echo "CREATE TABLE businesses ( business_id INTEGER NOT NULL, name VARCHAR, address VARCHAR, city VARCHAR, postal_code INTEGER, latitude FLOAT, longitude FLOAT, phone_number BIGINT, TaxCode VARCHAR, business_certificate INTEGER, application_date DATE, owner_name VARCHAR, owner_address VARCHAR, owner_city VARCHAR, owner_state VARCHAR, owner_zip VARCHAR );" | sqlite3 $DB_PATH ## Create the inspections table echo "CREATE TABLE inspections ( business_id INTEGER NOT NULL, Score INTEGER, date DATE, type VARCHAR );" | sqlite3 $DB_PATH ## Create the violations table echo "CREATE TABLE violations ( business_id VARCHAR NOT NULL, date DATE NOT NULL, ViolationTypeID INTEGER NOT NULL, risk_category VARCHAR(20), description VARCHAR(100) );" | sqlite3 $DB_PATH
Insert the data with csvsql
This is where
csvsql really helps. SQLite3 (as far as I know) does not have a robust CSV parser (i.e. can't handle quoted fields and funky line breaks). So the
csvsql command will read a CSV and convert it into SQL
##################### # Insert the data ## The businesses data ## Note that it's encoded as `windows-1252`, hence the `iconv` call to ## convert it to UTF-8 iconv -f windows-1252 -t UTF-8 \ < $ZIP_DIR/businesses_plus.csv | csvsql --db sqlite:///$DB_PATH --insert --no-create \ --tables businesses ## Insert the inspections data csvsql $ZIP_DIR/inspections_plus.csv \ --db sqlite:///$DB_PATH --insert --no-create \ --tables inspections ## The dates come as `YYYYMMDD`; this update statement ## changes it to `YYYY-MM-DD` echo "UPDATE inspections SET date = SUBSTR(date, 1, 4) || '-' || SUBSTR(date, 5, 2) || '-' || SUBSTR(date, 7, 2); " | sqlite3 $DB_PATH ## Insert the violations data ## (this will take awhile) csvsql $ZIP_DIR/violations_plus.csv \ --db sqlite:///$DB_PATH --insert --no-create --tables violations ## The dates come as `YYYYMMDD`; this update statement ## changes it to `YYYY-MM-DD` echo "UPDATE violations SET date = SUBSTR(date, 1, 4) || '-' || SUBSTR(date, 5, 2) || '-' || SUBSTR(date, 7, 2); " | sqlite3 $DB_PATH
Add indexes to the tables
This is just more piping of raw SQL.
############# # Add indexes to the table echo " CREATE INDEX business_id_o_businesses_idx ON businesses(business_id); CREATE INDEX name_o_businesses_idx ON businesses(name); CREATE INDEX phone_o_businesses_idx ON businesses(phone_number); CREATE INDEX business_id_date_o_inspections_idx ON inspections(business_id, date); CREATE INDEX date_o_inspections_idx ON inspections(date); CREATE INDEX business_id_date_o_violations_idx ON violations(business_id, date); CREATE INDEX date_o_violations_idx ON violations(date); CREATE INDEX date_o_violation_type_id_idx ON violations(ViolationTypeID); " | sqlite3 $DB_PATH
Let's look for Starbucks!
When everything is done, we can write a query to test out the data.
The following query asks:
- Find every business with "Starbucks" in its name
- that has had an inspection with a date since 2014
- and that inspection resulted in a score below 90 points
- and that inspection found at least one "High Risk" violation
Then, for every violation that is found above, display:
- The name of the business
- The address of the business
- The score of the inspection in which the violation was found
- The date of that inspection
- The description of the violation
And list the violations in ascending order of the score of its respective inspection.
Passing that query via shell:
echo " .headers on SELECT businesses.name AS biz_name, businesses.address AS biz_address, inspections.Score AS score, CAST(inspections.date AS CHAR) AS inspection_date, violations.description AS violation_description FROM businesses INNER JOIN inspections ON businesses.business_id = inspections.business_id INNER JOIN violations ON inspections.business_id = violations.business_id AND inspections.date = violations.date WHERE inspection_date > "2014" AND biz_name LIKE '%STARBUCKS%' AND score < 90 AND violations.risk_category = 'High Risk' ORDER BY score; " | sqlite3 $DB_PATH
|Starbucks Coffee Co||0264 Kearny St||85||2014-01-03||Improper cooling methods|
|Starbucks Coffee||201 Powell St||87||2015-03-17||Unclean hands or improper use of gloves|
|STARBUCKS||350 RHODE ISLAND||89||2014-03-19||No hot water or running water|
|STARBUCKS COFFEE #9219||333 MARKET St||89||2014-10-03||High risk food holding temperature|
|STARBUCKS COFFEE CO||1231 MARKET St||89||2014-12-30||Unclean hands or improper use of gloves|
Note: I do not guarantee the accuracy of these results -- e.g. I didn't check to see if unique IDs were really kept unique, and so forth. I didn't manually double check all of these results against the online database. But it seems to check out for a couple of the places I eyeballed. It's also worth noting that the vast majority of Starbucks did very well...a handful of mishaps in a year-and-a-half isn't too bad at all.
For example, here's the entry for Starbucks on 333 Market St, with the October 3, 2014 violation of "High risk food holding temperature":