dannguyen/_README.md

## _README.md

      
    Raw
  

              _README.md
            
          
    Fetching and collating the California Kindergarten immunization data in Python and Bash

by Dan Nguyen @dancow
tl;dr: a quick example of practicing reproducible data journalism, and somewhat timely given the recent school vaccination law signed by California Gov. Jerry Brown
These are scripts that are part of the mundaneprogramming.github.io repo for SRCCON 2015 and will soon have their own entry/explanation on that site. They aren't meant to be best/canonical practices (e.g. I felt like using csv.DictWriter so there it is), nor do I guarantee that they work. But you're free to run them to see what happens. All they currently do is download the relevant spreadsheets and compile them into a file, which ends up being one of the most tedious parts of the entire investigation due to how the files are organized on the homepage and their internal organization. For example, after 2012, the column layouts change, which accounts for most of the code in collate.py.
Example pre-2012 spreadsheet


Example post-2012 spreadsheet


Resulting spreadsheet after collate.py


Bash preparation

It's always nice when dealing with government data to create a complete mirror of the site, just in case it changes or gets removed at some point in the future. This is a good time to practice some basic but very reusable shell commands.
Mirror the site

Use wget to perform a recursive fetch, one-level deep. In the fetched pages, convert the links to relative locations, and append .html to all webpage-like files (i.e. somepage.aspx.html). This partial mirror will be saved to the local path in a subdirectory named www.cdph.ca.gov:
wget --recursive --level=1 --adjust-extension --convert-links \
http://www.cdph.ca.gov/programs/immunize/pages/immunizationlevels.aspx

Use Amazon's AWS command-line interface tool to upload the wgotten-files to S3:
aws s3 sync www.cdph.ca.gov/ s3://www.mundaneprogramming.com/sites/ --acl public-read

The resulting live Web address (if you've configured your S3 instance to act as a staticweb server) would look something like this:
http://www.mundaneprogramming.com.s3.amazonaws.com/sites/programs/immunize/pages/immunizationlevels.aspx.html
Description of the Python scripts

The scripts below are written for Python 3.x and were tested specifically on OS X 10.10 / Python 3.4.3 :: Anaconda 2.2.0 (x86_64). The Anaconda 3.x distribution includes all the necessary dependencies, including xlrd for manipulating of Excel files.


fetcher.py - Downloads the Excel spreadsheets containing Kindergarten-level data from the California Department of Health's Immunization Levels in Child Care and Schools homepage. Files are saved into a local path, ./data-hold/xls/immunization, and are given a more uniform-naming scheme, e.g. K--2013-2014.xls rather than 2013-2014%20CA%20Kindergarten%20Data.xls. The data currently dates back to the 2000-2001 school year


collate.py - Assuming fetcher.py has been run, this opens each downloaded spreadsheet and saves them to a single spreadsheet. In 2012, the column layout changes. This script reconciles both layouts and includes the union of possible columns (e.g., the mmr1_num column is simply left empty for post-2012 datasets). The data is compiled into two different (but equivalent) formats: CSV and JSON. One caveat: if you decide to open up one of the downloaded spreadsheets with Excel, it will create a temporary, hidden .xlsx file while Excel is running. And then if you run collate.py, you'll get an error as it tries to open this shadow file...So, don't open these files in Excel while running collate.py, basically.


Output of collate.py

./data-hold/xls/immunization/K--2000-2001.xls has 7442 rows
./data-hold/xls/immunization/K--2001-2002.xls has 7527 rows
./data-hold/xls/immunization/K--2002-2003.xls has 7453 rows
./data-hold/xls/immunization/K--2003-2004.xls has 7377 rows
./data-hold/xls/immunization/K--2004-2005.xls has 7386 rows
./data-hold/xls/immunization/K--2005-2006.xls has 7396 rows
./data-hold/xls/immunization/K--2006-2007.xls has 7389 rows
./data-hold/xls/immunization/K--2007-2008.xls has 7334 rows
./data-hold/xls/immunization/K--2008-2009.xls has 7199 rows
./data-hold/xls/immunization/K--2009-2010.xls has 7150 rows
./data-hold/xls/immunization/K--2010-2011.xls has 7190 rows
./data-hold/xls/immunization/K--2011-2012.xls has 7351 rows
./data-hold/xls/immunization/K--2012-2013.xls has 7662 rows
./data-hold/xls/immunization/K--2013-2014.xls has 7404 rows
./data-hold/xls/immunization/K--2014-2015.xlsx has 7494 rows
There are 110264 data rows all together
Writing to JSON: ./data-hold/finished/k-immune.json
Writing to CSV: ./data-hold/finished/k-immune.csv

Sample CSV output

Sample JSON output

[
    {
        "school_type": "PUBLIC",
        "polio_num": 80.0,
        "mmr2_num": 80.0,
        "pme_num": 0.0,
        "district_code": "75101",
        "conditional_num": 0.0,
        "pbe_pct": 0.0,
        "hepb_num": 80.0,
        "pbe_num": 0.0,
        "dtp_num": 80.0,
        "uptodate_num": 80.0,
        "polio_pct": 100.0,
        "mmr2_pct": 100.0,
        "school_name": "ALISAL ELEM",
        "conditional_pct": 0.0,
        "dtp_pct": 100.0,
        "year": 2000,
        "enrollment": 80.0,
        "school_code": "6002315",
        "county": "ALAMEDA",
        "mmr1_num": 80.0,
        "hepb_pct": 100.0,
        "uptodate_pct": 100.0,
        "pme_pct": 0.0,
        "mmr1_pct": 100.0
    }
]
Reproducible data

These scripts are meant to be examples of reproducability. Specifically, the kind of reproducible build that comes as part and parcel of programming. Obviously, you could write scripts to turn the collated the data into visualizations. Or, more mundanely but just as important -- back up the original files to a separate mirror, just in case California state changes their data offering. Either way, the mindset is to describe the steps you have to do, as code. It takes extra time at first, but then you can re-run them at any time to get exactly what you had before. This consistency is not just a hobgoblin of little minds; once the steps have been abstracted to code, it becomes very easy, and error-free, to recollect/reorganize the data when, later in the investigation, you've decided that, oh, it'd be nice to include what in retrospect seemed like an irrelevant datapoint/dataset.
Case in point: the preschool data isn't currently included in these batch scripts but they seem to be similar in structure and organization (though the earliest, non-summarized data is for 2010-2011). It's relatively easy to adjust fetcher.py and collate.py to create a dataset that includes both kinds of schools, even if you're already at the part where you've visualized/mapped the data.
And speaking of mapping the data: After running collate.py, you would need to join the schools against the California Department of Education's Public Schools Data file, which contains addresses and geo coordinates. However, that's just another script in the pipeline. And considering the Department of Health has an inconsistent way of naming the schools, across the years and across post/pre-2012 layouts, having a script like collate.py is pretty much essential to sanely joining the health data to the CDE's geodata.
In other words, for the immunization dataset, you need to derive the unique school identifier from this:


SCHOOL CODE
COUNTY
PUBLIC/PRIVATE
PUBLIC SCHOOL DISTRICT
CITY
SCHOOL NAME


0130419
ALAMEDA
PUBLIC
ALAMEDA COUNTY OFFICE OF EDUCATION
HAYWARD
ALAMEDA COUNTY COMMUNITY


To join against the corresponding foreign key in the public schools database:
     01100170130419

Why doesn't the immunization dataset just include the school's unique identifier, i.e. the CDSCode found in the public schools data file? Because the CDSCode is a convention created by the Department of Education. And the immunization data is generated by the Department of Public Health. Obvious, isn't it?.
So it's not necessary to see reproducible data as being an issue of transparency; it is a direct consequence of systematically expressing the data-gathering steps as code, a process which is necessary for doing the kind of expansive data-joining operations common in investigative journalism. So it's a win-win for journalist and audience.
It's also horribly boring to write that code. On the other hand, even if it takes you longer to write the code than to do the work manually, it's still a great time to learn how to program, or to experiement with different libraries. For instance, I had almost never used Python's xlrd library for opening/reading Excel files. But not only is it essential for this scenario, it is an incredibly useful library that now makes it much, much easier for me to work with Excel data in the future -- and yet I would've never bothered looking at xlrd for a project on deadline.

  
## collate.py
## This expects the fetching script to have run
## it compiles all the spreadsheets:
##   - k-immune.csv
##   - k-immune.json

import csv
import json
import os.path
import re
from glob import glob
from os import makedirs
from xlrd import open_workbook
XLS_DIR = "./data-hold/xls/immunization"
FINISHED_DIR = './data-hold/finished'

pre_2012_headers = ['school_code', 'county', 'school_type', 'district_code', 'school_name',
'enrollment', 'uptodate_num', 'uptodate_pct', 'conditional_num', 'conditional_pct',
'pme_num', 'pme_pct', 'pbe_num', 'pbe_pct', 'dtp_num', 'dtp_pct', 'polio_num',
'polio_pct', 'mmr1_num', 'mmr1_pct', 'mmr2_num', 'mmr2_pct', 'hepb_num', 'hepb_pct',
'vari_num', 'vari_pct']
post_2012_headers = ['school_code', 'county', 'school_type', 'district_name', 'city',
'school_name', 'enrollment', 'uptodate_num', 'uptodate_pct', 'conditional_num', 'conditional_pct',
'pme_num', 'pme_pct', 'pbe_num', 'pbe_pct', 'dtp_num', 'dtp_pct', 'polio_num',
'polio_pct', 'mmr2_num', 'mmr2_pct', 'hepb_num', 'hepb_pct', 'vari_num', 'vari_pct', 'reported']
# differences between pre/post 2012:
#  post-2012 only records 2-dose MMR, e.g. `mmr2_num` and `mmr2_pct`

makedirs(XLS_DIR, exist_ok = True)
makedirs(FINISHED_DIR, exist_ok = True)

data = []
for xlsname in glob(os.path.join(XLS_DIR, '*.xls*')):
    # extract the year numbers from the file name
    # e.g. "2006" and "2007" from "K--2006-2007.xls"
    yr_1, yr_2 = re.search('(\d{4})-(\d{4})', xlsname).groups()
    year = int(yr_1)
    headers = pre_2012_headers if year < 2012 else post_2012_headers
    # open the Excel workbook
    book = open_workbook(xlsname)
    # open the first non-empty spreadsheet
    sheet = [s for s in book.sheets() if s.nrows > 0][0]
    print(xlsname, "has", sheet.nrows, "rows")
    for x in range(1, sheet.nrows - 1):
        row = sheet.row_values(x)
        if re.search('\d{7}', str(row[0])):
            d = dict(zip(headers, row))
            d['year'] = year
            data.append(d)


print("There are", len(data), 'data rows all together')

# write a JSON
# Note: fields that don't exist in a given layout are *not* included as null values.
#       they are simply left out of each dict
jname = os.path.join(FINISHED_DIR, 'k-immune.json')
print("Writing to JSON:", jname)
with open(jname, "w") as jfile:
    jfile.write(json.dumps(data, indent = 4))

# write a CSV
cname = os.path.join(FINISHED_DIR, 'k-immune.csv')
print("Writing to CSV:", cname)
writer = csv.DictWriter(open(cname, 'w', encoding = 'utf-8'),
   fieldnames = set(pre_2012_headers + post_2012_headers + ['year']),
   delimiter=','
)
writer.writeheader()
for d in data:
    writer.writerow(d)

## fetcher.py
# just fetches the spreadsheets from the California site
import re
import os.path
import requests
from urllib.parse import urljoin
from lxml import html
from os import makedirs

XLS_DIR = "./data-hold/xls/immunization"
INDEX_URL = "http://www.cdph.ca.gov/programs/immunize/pages/immunizationlevels.aspx"
makedirs(XLS_DIR, exist_ok = True)
# Download the HTML listing
response = requests.get(INDEX_URL)
doc = html.fromstring(response.text)
all_urls = doc.xpath('//a[contains(@href, "Kinder") and contains(@href, "xls")]/@href')
for url in all_urls:
    y1, y2 = re.search('(\d{2})-(?:\d{2})?(\d{2})', url).groups()
    # y1 and y2 are the 2 digit years
    ext = os.path.splitext(url)[1]

    # Now rename to a proper year
    # e.g. ./data-hold/xls/K--2005-2006.xls
    oname = os.path.join(XLS_DIR, "K--20{0}-20{1}{2}".format(y1, y2, ext))
    full_url = urljoin(INDEX_URL, url)
    print("Downloading:\n {0}\n into: {1}".format(full_url, oname))

    # Sample output:
    # Downloading:
    #  http://www.cdph.ca.gov/programs/immunize/Documents/2007-2008%20CA%20Kindergarten%20Data.xls
    #  into: ./data-hold/xls/K--2007-2008.xls
    xlsfile = requests.get(full_url)
    with open(oname, 'wb') as ofile:
         ofile.write(xlsfile.content)

## get_cde_data.py
# In progress...this script, when done, should produce a lookup table to find CDS code given a school code and county name/code

# Ideally, this would run _after_ the collate.py step, which could attach CDScodes to the health data...however
# there are a number of complexities, foremost being that school districts change over the years, and the CDE only
# has the file for the current date. So a decision has to be made by the user on how much to care about tracking
# historical data across different schools.

# Landing page
# http://www.cde.ca.gov/ds/si/ds/pubschls.asp
# More about schools
# http://www.cde.ca.gov/ds/si/ds/

from os import makedirs
from urllib.request import urlretrieve
import csv
import os.path
import shutil

DATA_DIR = './data-hold/cde'
DATA_FILES = {
    'district': {
        "url": 'http://www.cde.ca.gov/ds/si/ds/documents/legdist2014.xls',
        'local': os.path.join(DATA_DIR, 'legdist2014.xls')
    },
    'schools': {
        "url": 'ftp://ftp.cde.ca.gov/demo/schlname/pubschls.txt',
        'local': os.path.join(DATA_DIR, 'pubschls.txt')
    }
}
# TODO: pubschls gets updated daily; might be worth timestamping it upon each download
makedirs(DATA_DIR, exist_ok = True)
for d in DATA_FILES.values():
    # download the data
    resp = urlretrieve(d['url'])
    # whatever I don't even understand urllib
    shutil.copy(resp[0], d['local'])
    print("Copied", d['url'], 'to:', d['local'])

# Now work with the local schools file:
# TODO
# txt = open(DATA_FILES['schools']['local'], encoding = 'latin-1').read()
# rows = list(csv.DictReader(txt.splitlines(), delimiter = "\t"))
	## This expects the fetching script to have run
	## it compiles all the spreadsheets:
	## - k-immune.csv
	## - k-immune.json

	import csv
	import json
	import os.path
	import re
	from glob import glob
	from os import makedirs
	from xlrd import open_workbook
	XLS_DIR = "./data-hold/xls/immunization"
	FINISHED_DIR = './data-hold/finished'

	pre_2012_headers = ['school_code', 'county', 'school_type', 'district_code', 'school_name',
	'enrollment', 'uptodate_num', 'uptodate_pct', 'conditional_num', 'conditional_pct',
	'pme_num', 'pme_pct', 'pbe_num', 'pbe_pct', 'dtp_num', 'dtp_pct', 'polio_num',
	'polio_pct', 'mmr1_num', 'mmr1_pct', 'mmr2_num', 'mmr2_pct', 'hepb_num', 'hepb_pct',
	'vari_num', 'vari_pct']
	post_2012_headers = ['school_code', 'county', 'school_type', 'district_name', 'city',
	'school_name', 'enrollment', 'uptodate_num', 'uptodate_pct', 'conditional_num', 'conditional_pct',
	'pme_num', 'pme_pct', 'pbe_num', 'pbe_pct', 'dtp_num', 'dtp_pct', 'polio_num',
	'polio_pct', 'mmr2_num', 'mmr2_pct', 'hepb_num', 'hepb_pct', 'vari_num', 'vari_pct', 'reported']
	# differences between pre/post 2012:
	# post-2012 only records 2-dose MMR, e.g. `mmr2_num` and `mmr2_pct`

	makedirs(XLS_DIR, exist_ok = True)
	makedirs(FINISHED_DIR, exist_ok = True)

	data = []
	for xlsname in glob(os.path.join(XLS_DIR, '.xls')):
	# extract the year numbers from the file name
	# e.g. "2006" and "2007" from "K--2006-2007.xls"
	yr_1, yr_2 = re.search('(\d{4})-(\d{4})', xlsname).groups()
	year = int(yr_1)
	headers = pre_2012_headers if year < 2012 else post_2012_headers
	# open the Excel workbook
	book = open_workbook(xlsname)
	# open the first non-empty spreadsheet
	sheet = [s for s in book.sheets() if s.nrows > 0][0]
	print(xlsname, "has", sheet.nrows, "rows")
	for x in range(1, sheet.nrows - 1):
	row = sheet.row_values(x)
	if re.search('\d{7}', str(row[0])):
	d = dict(zip(headers, row))
	d['year'] = year
	data.append(d)


	print("There are", len(data), 'data rows all together')

	# write a JSON
	# Note: fields that don't exist in a given layout are not included as null values.
	# they are simply left out of each dict
	jname = os.path.join(FINISHED_DIR, 'k-immune.json')
	print("Writing to JSON:", jname)
	with open(jname, "w") as jfile:
	jfile.write(json.dumps(data, indent = 4))

	# write a CSV
	cname = os.path.join(FINISHED_DIR, 'k-immune.csv')
	print("Writing to CSV:", cname)
	writer = csv.DictWriter(open(cname, 'w', encoding = 'utf-8'),
	fieldnames = set(pre_2012_headers + post_2012_headers + ['year']),
	delimiter=','
	)
	writer.writeheader()
	for d in data:
	writer.writerow(d)
	# just fetches the spreadsheets from the California site
	import re
	import os.path
	import requests
	from urllib.parse import urljoin
	from lxml import html
	from os import makedirs

	XLS_DIR = "./data-hold/xls/immunization"
	INDEX_URL = "http://www.cdph.ca.gov/programs/immunize/pages/immunizationlevels.aspx"
	makedirs(XLS_DIR, exist_ok = True)
	# Download the HTML listing
	response = requests.get(INDEX_URL)
	doc = html.fromstring(response.text)
	all_urls = doc.xpath('//a[contains(@href, "Kinder") and contains(@href, "xls")]/@href')
	for url in all_urls:
	y1, y2 = re.search('(\d{2})-(?:\d{2})?(\d{2})', url).groups()
	# y1 and y2 are the 2 digit years
	ext = os.path.splitext(url)[1]

	# Now rename to a proper year
	# e.g. ./data-hold/xls/K--2005-2006.xls
	oname = os.path.join(XLS_DIR, "K--20{0}-20{1}{2}".format(y1, y2, ext))
	full_url = urljoin(INDEX_URL, url)
	print("Downloading:\n {0}\n into: {1}".format(full_url, oname))

	# Sample output:
	# Downloading:
	# http://www.cdph.ca.gov/programs/immunize/Documents/2007-2008%20CA%20Kindergarten%20Data.xls
	# into: ./data-hold/xls/K--2007-2008.xls
	xlsfile = requests.get(full_url)
	with open(oname, 'wb') as ofile:
	ofile.write(xlsfile.content)
	# In progress...this script, when done, should produce a lookup table to find CDS code given a school code and county name/code

	# Ideally, this would run _after_ the collate.py step, which could attach CDScodes to the health data...however
	# there are a number of complexities, foremost being that school districts change over the years, and the CDE only
	# has the file for the current date. So a decision has to be made by the user on how much to care about tracking
	# historical data across different schools.

	# Landing page
	# http://www.cde.ca.gov/ds/si/ds/pubschls.asp
	# More about schools
	# http://www.cde.ca.gov/ds/si/ds/

	from os import makedirs
	from urllib.request import urlretrieve
	import csv
	import os.path
	import shutil

	DATA_DIR = './data-hold/cde'
	DATA_FILES = {
	'district': {
	"url": 'http://www.cde.ca.gov/ds/si/ds/documents/legdist2014.xls',
	'local': os.path.join(DATA_DIR, 'legdist2014.xls')
	},
	'schools': {
	"url": 'ftp://ftp.cde.ca.gov/demo/schlname/pubschls.txt',
	'local': os.path.join(DATA_DIR, 'pubschls.txt')
	}
	}
	# TODO: pubschls gets updated daily; might be worth timestamping it upon each download
	makedirs(DATA_DIR, exist_ok = True)
	for d in DATA_FILES.values():
	# download the data
	resp = urlretrieve(d['url'])
	# whatever I don't even understand urllib
	shutil.copy(resp[0], d['local'])
	print("Copied", d['url'], 'to:', d['local'])

	# Now work with the local schools file:
	# TODO
	# txt = open(DATA_FILES['schools']['local'], encoding = 'latin-1').read()
	# rows = list(csv.DictReader(txt.splitlines(), delimiter = "\t"))