Skip to content

Instantly share code, notes, and snippets.

@dannguyen
Last active January 20, 2016 07:16
Show Gist options
  • Star 9 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dannguyen/a3c6631f64669dcaba76 to your computer and use it in GitHub Desktop.
Save dannguyen/a3c6631f64669dcaba76 to your computer and use it in GitHub Desktop.
Scripts to autodownload and organize the California kindergarten immunization data files

Fetching and collating the California Kindergarten immunization data in Python and Bash

by Dan Nguyen @dancow

tl;dr: a quick example of practicing reproducible data journalism, and somewhat timely given the recent school vaccination law signed by California Gov. Jerry Brown

These are scripts that are part of the mundaneprogramming.github.io repo for SRCCON 2015 and will soon have their own entry/explanation on that site. They aren't meant to be best/canonical practices (e.g. I felt like using csv.DictWriter so there it is), nor do I guarantee that they work. But you're free to run them to see what happens. All they currently do is download the relevant spreadsheets and compile them into a file, which ends up being one of the most tedious parts of the entire investigation due to how the files are organized on the homepage and their internal organization. For example, after 2012, the column layouts change, which accounts for most of the code in collate.py.

Example pre-2012 spreadsheet

Imgur

Example post-2012 spreadsheet

Imgur

Resulting spreadsheet after collate.py

Imgur

Bash preparation

It's always nice when dealing with government data to create a complete mirror of the site, just in case it changes or gets removed at some point in the future. This is a good time to practice some basic but very reusable shell commands.

Mirror the site

Use wget to perform a recursive fetch, one-level deep. In the fetched pages, convert the links to relative locations, and append .html to all webpage-like files (i.e. somepage.aspx.html). This partial mirror will be saved to the local path in a subdirectory named www.cdph.ca.gov:

wget --recursive --level=1 --adjust-extension --convert-links \
http://www.cdph.ca.gov/programs/immunize/pages/immunizationlevels.aspx

Use Amazon's AWS command-line interface tool to upload the wgotten-files to S3:

aws s3 sync www.cdph.ca.gov/ s3://www.mundaneprogramming.com/sites/ --acl public-read

The resulting live Web address (if you've configured your S3 instance to act as a staticweb server) would look something like this:

http://www.mundaneprogramming.com.s3.amazonaws.com/sites/programs/immunize/pages/immunizationlevels.aspx.html

Description of the Python scripts

The scripts below are written for Python 3.x and were tested specifically on OS X 10.10 / Python 3.4.3 :: Anaconda 2.2.0 (x86_64). The Anaconda 3.x distribution includes all the necessary dependencies, including xlrd for manipulating of Excel files.

  • fetcher.py - Downloads the Excel spreadsheets containing Kindergarten-level data from the California Department of Health's Immunization Levels in Child Care and Schools homepage. Files are saved into a local path, ./data-hold/xls/immunization, and are given a more uniform-naming scheme, e.g. K--2013-2014.xls rather than 2013-2014%20CA%20Kindergarten%20Data.xls. The data currently dates back to the 2000-2001 school year

  • collate.py - Assuming fetcher.py has been run, this opens each downloaded spreadsheet and saves them to a single spreadsheet. In 2012, the column layout changes. This script reconciles both layouts and includes the union of possible columns (e.g., the mmr1_num column is simply left empty for post-2012 datasets). The data is compiled into two different (but equivalent) formats: CSV and JSON. One caveat: if you decide to open up one of the downloaded spreadsheets with Excel, it will create a temporary, hidden .xlsx file while Excel is running. And then if you run collate.py, you'll get an error as it tries to open this shadow file...So, don't open these files in Excel while running collate.py, basically.

Output of collate.py

./data-hold/xls/immunization/K--2000-2001.xls has 7442 rows
./data-hold/xls/immunization/K--2001-2002.xls has 7527 rows
./data-hold/xls/immunization/K--2002-2003.xls has 7453 rows
./data-hold/xls/immunization/K--2003-2004.xls has 7377 rows
./data-hold/xls/immunization/K--2004-2005.xls has 7386 rows
./data-hold/xls/immunization/K--2005-2006.xls has 7396 rows
./data-hold/xls/immunization/K--2006-2007.xls has 7389 rows
./data-hold/xls/immunization/K--2007-2008.xls has 7334 rows
./data-hold/xls/immunization/K--2008-2009.xls has 7199 rows
./data-hold/xls/immunization/K--2009-2010.xls has 7150 rows
./data-hold/xls/immunization/K--2010-2011.xls has 7190 rows
./data-hold/xls/immunization/K--2011-2012.xls has 7351 rows
./data-hold/xls/immunization/K--2012-2013.xls has 7662 rows
./data-hold/xls/immunization/K--2013-2014.xls has 7404 rows
./data-hold/xls/immunization/K--2014-2015.xlsx has 7494 rows
There are 110264 data rows all together
Writing to JSON: ./data-hold/finished/k-immune.json
Writing to CSV: ./data-hold/finished/k-immune.csv

Sample CSV output

Sample JSON output

[
    {
        "school_type": "PUBLIC",
        "polio_num": 80.0,
        "mmr2_num": 80.0,
        "pme_num": 0.0,
        "district_code": "75101",
        "conditional_num": 0.0,
        "pbe_pct": 0.0,
        "hepb_num": 80.0,
        "pbe_num": 0.0,
        "dtp_num": 80.0,
        "uptodate_num": 80.0,
        "polio_pct": 100.0,
        "mmr2_pct": 100.0,
        "school_name": "ALISAL ELEM",
        "conditional_pct": 0.0,
        "dtp_pct": 100.0,
        "year": 2000,
        "enrollment": 80.0,
        "school_code": "6002315",
        "county": "ALAMEDA",
        "mmr1_num": 80.0,
        "hepb_pct": 100.0,
        "uptodate_pct": 100.0,
        "pme_pct": 0.0,
        "mmr1_pct": 100.0
    }
]

Reproducible data

These scripts are meant to be examples of reproducability. Specifically, the kind of reproducible build that comes as part and parcel of programming. Obviously, you could write scripts to turn the collated the data into visualizations. Or, more mundanely but just as important -- back up the original files to a separate mirror, just in case California state changes their data offering. Either way, the mindset is to describe the steps you have to do, as code. It takes extra time at first, but then you can re-run them at any time to get exactly what you had before. This consistency is not just a hobgoblin of little minds; once the steps have been abstracted to code, it becomes very easy, and error-free, to recollect/reorganize the data when, later in the investigation, you've decided that, oh, it'd be nice to include what in retrospect seemed like an irrelevant datapoint/dataset.

Case in point: the preschool data isn't currently included in these batch scripts but they seem to be similar in structure and organization (though the earliest, non-summarized data is for 2010-2011). It's relatively easy to adjust fetcher.py and collate.py to create a dataset that includes both kinds of schools, even if you're already at the part where you've visualized/mapped the data.

And speaking of mapping the data: After running collate.py, you would need to join the schools against the California Department of Education's Public Schools Data file, which contains addresses and geo coordinates. However, that's just another script in the pipeline. And considering the Department of Health has an inconsistent way of naming the schools, across the years and across post/pre-2012 layouts, having a script like collate.py is pretty much essential to sanely joining the health data to the CDE's geodata.

In other words, for the immunization dataset, you need to derive the unique school identifier from this:

SCHOOL CODE COUNTY PUBLIC/PRIVATE PUBLIC SCHOOL DISTRICT CITY SCHOOL NAME
0130419 ALAMEDA PUBLIC ALAMEDA COUNTY OFFICE OF EDUCATION HAYWARD ALAMEDA COUNTY COMMUNITY

To join against the corresponding foreign key in the public schools database:

     01100170130419

Why doesn't the immunization dataset just include the school's unique identifier, i.e. the CDSCode found in the public schools data file? Because the CDSCode is a convention created by the Department of Education. And the immunization data is generated by the Department of Public Health. Obvious, isn't it?.

So it's not necessary to see reproducible data as being an issue of transparency; it is a direct consequence of systematically expressing the data-gathering steps as code, a process which is necessary for doing the kind of expansive data-joining operations common in investigative journalism. So it's a win-win for journalist and audience.

It's also horribly boring to write that code. On the other hand, even if it takes you longer to write the code than to do the work manually, it's still a great time to learn how to program, or to experiement with different libraries. For instance, I had almost never used Python's xlrd library for opening/reading Excel files. But not only is it essential for this scenario, it is an incredibly useful library that now makes it much, much easier for me to work with Excel data in the future -- and yet I would've never bothered looking at xlrd for a project on deadline.

## This expects the fetching script to have run
## it compiles all the spreadsheets:
## - k-immune.csv
## - k-immune.json
import csv
import json
import os.path
import re
from glob import glob
from os import makedirs
from xlrd import open_workbook
XLS_DIR = "./data-hold/xls/immunization"
FINISHED_DIR = './data-hold/finished'
pre_2012_headers = ['school_code', 'county', 'school_type', 'district_code', 'school_name',
'enrollment', 'uptodate_num', 'uptodate_pct', 'conditional_num', 'conditional_pct',
'pme_num', 'pme_pct', 'pbe_num', 'pbe_pct', 'dtp_num', 'dtp_pct', 'polio_num',
'polio_pct', 'mmr1_num', 'mmr1_pct', 'mmr2_num', 'mmr2_pct', 'hepb_num', 'hepb_pct',
'vari_num', 'vari_pct']
post_2012_headers = ['school_code', 'county', 'school_type', 'district_name', 'city',
'school_name', 'enrollment', 'uptodate_num', 'uptodate_pct', 'conditional_num', 'conditional_pct',
'pme_num', 'pme_pct', 'pbe_num', 'pbe_pct', 'dtp_num', 'dtp_pct', 'polio_num',
'polio_pct', 'mmr2_num', 'mmr2_pct', 'hepb_num', 'hepb_pct', 'vari_num', 'vari_pct', 'reported']
# differences between pre/post 2012:
# post-2012 only records 2-dose MMR, e.g. `mmr2_num` and `mmr2_pct`
makedirs(XLS_DIR, exist_ok = True)
makedirs(FINISHED_DIR, exist_ok = True)
data = []
for xlsname in glob(os.path.join(XLS_DIR, '*.xls*')):
# extract the year numbers from the file name
# e.g. "2006" and "2007" from "K--2006-2007.xls"
yr_1, yr_2 = re.search('(\d{4})-(\d{4})', xlsname).groups()
year = int(yr_1)
headers = pre_2012_headers if year < 2012 else post_2012_headers
# open the Excel workbook
book = open_workbook(xlsname)
# open the first non-empty spreadsheet
sheet = [s for s in book.sheets() if s.nrows > 0][0]
print(xlsname, "has", sheet.nrows, "rows")
for x in range(1, sheet.nrows - 1):
row = sheet.row_values(x)
if re.search('\d{7}', str(row[0])):
d = dict(zip(headers, row))
d['year'] = year
data.append(d)
print("There are", len(data), 'data rows all together')
# write a JSON
# Note: fields that don't exist in a given layout are *not* included as null values.
# they are simply left out of each dict
jname = os.path.join(FINISHED_DIR, 'k-immune.json')
print("Writing to JSON:", jname)
with open(jname, "w") as jfile:
jfile.write(json.dumps(data, indent = 4))
# write a CSV
cname = os.path.join(FINISHED_DIR, 'k-immune.csv')
print("Writing to CSV:", cname)
writer = csv.DictWriter(open(cname, 'w', encoding = 'utf-8'),
fieldnames = set(pre_2012_headers + post_2012_headers + ['year']),
delimiter=','
)
writer.writeheader()
for d in data:
writer.writerow(d)
# just fetches the spreadsheets from the California site
import re
import os.path
import requests
from urllib.parse import urljoin
from lxml import html
from os import makedirs
XLS_DIR = "./data-hold/xls/immunization"
INDEX_URL = "http://www.cdph.ca.gov/programs/immunize/pages/immunizationlevels.aspx"
makedirs(XLS_DIR, exist_ok = True)
# Download the HTML listing
response = requests.get(INDEX_URL)
doc = html.fromstring(response.text)
all_urls = doc.xpath('//a[contains(@href, "Kinder") and contains(@href, "xls")]/@href')
for url in all_urls:
y1, y2 = re.search('(\d{2})-(?:\d{2})?(\d{2})', url).groups()
# y1 and y2 are the 2 digit years
ext = os.path.splitext(url)[1]
# Now rename to a proper year
# e.g. ./data-hold/xls/K--2005-2006.xls
oname = os.path.join(XLS_DIR, "K--20{0}-20{1}{2}".format(y1, y2, ext))
full_url = urljoin(INDEX_URL, url)
print("Downloading:\n {0}\n into: {1}".format(full_url, oname))
# Sample output:
# Downloading:
# http://www.cdph.ca.gov/programs/immunize/Documents/2007-2008%20CA%20Kindergarten%20Data.xls
# into: ./data-hold/xls/K--2007-2008.xls
xlsfile = requests.get(full_url)
with open(oname, 'wb') as ofile:
ofile.write(xlsfile.content)
# In progress...this script, when done, should produce a lookup table to find CDS code given a school code and county name/code
# Ideally, this would run _after_ the collate.py step, which could attach CDScodes to the health data...however
# there are a number of complexities, foremost being that school districts change over the years, and the CDE only
# has the file for the current date. So a decision has to be made by the user on how much to care about tracking
# historical data across different schools.
# Landing page
# http://www.cde.ca.gov/ds/si/ds/pubschls.asp
# More about schools
# http://www.cde.ca.gov/ds/si/ds/
from os import makedirs
from urllib.request import urlretrieve
import csv
import os.path
import shutil
DATA_DIR = './data-hold/cde'
DATA_FILES = {
'district': {
"url": 'http://www.cde.ca.gov/ds/si/ds/documents/legdist2014.xls',
'local': os.path.join(DATA_DIR, 'legdist2014.xls')
},
'schools': {
"url": 'ftp://ftp.cde.ca.gov/demo/schlname/pubschls.txt',
'local': os.path.join(DATA_DIR, 'pubschls.txt')
}
}
# TODO: pubschls gets updated daily; might be worth timestamping it upon each download
makedirs(DATA_DIR, exist_ok = True)
for d in DATA_FILES.values():
# download the data
resp = urlretrieve(d['url'])
# whatever I don't even understand urllib
shutil.copy(resp[0], d['local'])
print("Copied", d['url'], 'to:', d['local'])
# Now work with the local schools file:
# TODO
# txt = open(DATA_FILES['schools']['local'], encoding = 'latin-1').read()
# rows = list(csv.DictReader(txt.splitlines(), delimiter = "\t"))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment