Skip to content

Instantly share code, notes, and snippets.

@albertovilla
Last active December 5, 2018 19:33
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save albertovilla/ed4ec6cc7c38aad8402a7c37699c2d31 to your computer and use it in GitHub Desktop.
Save albertovilla/ed4ec6cc7c38aad8402a7c37699c2d31 to your computer and use it in GitHub Desktop.

Import libraries

import pandas as pd
import progressbar
import pprint
import requests
from bs4 import BeautifulSoup

Read postal codes database

Reference ZIP database obtained from simplemaps.com. The database contains the following fields:

  • zip: The 5-digit zip code assigned by the U.S. Post Office.
  • lat: The latitude of the zip code (learn more).
  • lng: The longitude of the zip code (learn more).
  • city: The official USPS city name.
  • state_id: The official USPS state abbreviation.
  • state_name: The state's name. Blank string for military zip codes where state_id is AA, AE, or AP.
  • zcta: TRUE if the zip code is a Zip Code Tabulation area (learn more).
  • parent_zcta: The ZCTA that contains this zip code. Only exists if zcta is FALSE. Useful for making inferences about a zip codes that is a point from the ZCTA that contains it.
  • population: An estimate of the zip code's population. Only exists if zcta is TRUE.
  • county_fips: The primary county that contains the zip code in the FIPS format.
  • county_name: The name of the county_fips.
  • county_weight: The percentage of the zip code that is within the county (by population).
  • all_county_weights: A JSON dictionary listing all counties and weights associated with the zip code.
  • imprecise: TRUE if the lat/lng has been geolocated using the city (rare).
  • military: TRUE if the zip code is used by the US Military (lat/lng not available).
# when reading the dataframe I specify that zip code is a string so the Os at the beginning of the zip code are not lost
POSTAL_CODES_PATH = './postalcodes.csv'
zip_df = pd.read_csv(POSTAL_CODES_PATH, dtype={'zip': str})

Check missing information

Looking at the data in the dataframe we can see that we have no missing information regarding the features: zip, city, state_id, imprecise and military.

pp = pprint.PrettyPrinter(indent=4)
print('Postal codes dataframe shape: ', zip_df.shape)
print('Missing information in % rounded to 2 decimals')
pp.pprint((zip_df.isna().sum() / zip_df.shape[0]).round(2))
Postal codes dataframe shape:  (41682, 14)
Missing information in % rounded to 2 decimals
zip                   0.00
lat                   0.01
lng                   0.01
city                  0.00
state_id              0.00
state_name            0.01
zcta                  0.00
parent_zcta           0.81
population            0.21
county_fips           0.21
county_name           0.21
all_county_weights    0.21
imprecise             0.00
military              0.00
dtype: float64

Remove unused columns

Most of the fields in the database will not be used therefore can be removed from the dataframe. Some of them because have not any use for the EWG Tap Water database scrapping and others such as county_name because there are too zip codes without such information, more than 20%.

unused_fields = ['zcta', 'parent_zcta', 'county_fips', 'county_name', 'all_county_weights', 'imprecise', 'military']
unused_fields = ['zcta', 'parent_zcta', 'county_fips', 'county_name', 'all_county_weights', 'imprecise', 'military']
zip_df.drop(unused_fields, axis=1, inplace=True)

Remaining structure

zip_df.head()
zip lat lng city state_id state_name population
0 00501 40.8133 -73.0476 Holtsville NY New York NaN
1 00544 40.8133 -73.0476 Holtsville NY New York NaN
2 00601 18.1800 -66.7522 Adjuntas PR Puerto Rico 18570.0
3 00602 18.3607 -67.1752 Aguada PR Puerto Rico 41520.0
4 00603 18.4544 -67.1220 Aguadilla PR Puerto Rico 54689.0

State name correction

Although the state_id is available for all zip codes the state_name is missing for some of them.

pp.pprint(zip_df.state_id.unique())
array(['NY', 'PR', 'VI', 'MA', 'RI', 'NH', 'ME', 'VT', 'CT', 'NJ', 'AE',
       'PA', 'DE', 'DC', 'VA', 'MD', 'WV', 'NC', 'SC', 'GA', 'FL', 'AA',
       'AL', 'TN', 'MS', 'KY', 'OH', 'IN', 'MI', 'IA', 'WI', 'MN', 'SD',
       'ND', 'MT', 'IL', 'MO', 'KS', 'NE', 'LA', 'AR', 'OK', 'TX', 'CO',
       'WY', 'ID', 'UT', 'AZ', 'NM', 'NV', 'CA', 'AP', 'HI', 'AS', 'GU',
       'PW', 'FM', 'MP', 'MH', 'OR', 'WA', 'AK'], dtype=object)

By creating a dictionary of state_id and state_name we can see what are the missing states. Which basically are:

  • Armed Forces Africa, Canada, Europe, Middle East = AE
  • Armed Forces Americas (except Canada) = AA
  • Armed Forces Pacific = AP
states_dictionary = dict(zip(zip_df.state_id,zip_df.state_name))
pp.pprint(states_dictionary)
{   'AA': nan,
    'AE': nan,
    'AK': 'Alaska',
    'AL': 'Alabama',
    'AP': nan,
    'AR': 'Arkansas',
    'AS': 'American Samoa',
    'AZ': 'Arizona',
    'CA': 'California',
    'CO': 'Colorado',
    'CT': 'Connecticut',
    'DC': 'District of Columbia',
    'DE': 'Delaware',
    'FL': 'Florida',
    'FM': 'Federated States of Micronesia',
    'GA': 'Georgia',
    'GU': 'Guam',
    'HI': 'Hawaii',
    'IA': 'Iowa',
    'ID': 'Idaho',
    'IL': 'Illinois',
    'IN': 'Indiana',
    'KS': 'Kansas',
    'KY': 'Kentucky',
    'LA': 'Louisiana',
    'MA': 'Massachusetts',
    'MD': 'Maryland',
    'ME': 'Maine',
    'MH': 'Marshall Islands',
    'MI': 'Michigan',
    'MN': 'Minnesota',
    'MO': 'Missouri',
    'MP': 'Northern Mariana Islands',
    'MS': 'Mississippi',
    'MT': 'Montana',
    'NC': 'North Carolina',
    'ND': 'North Dakota',
    'NE': 'Nebraska',
    'NH': 'New Hampshire',
    'NJ': 'New Jersey',
    'NM': 'New Mexico',
    'NV': 'Nevada',
    'NY': 'New York',
    'OH': 'Ohio',
    'OK': 'Oklahoma',
    'OR': 'Oregon',
    'PA': 'Pennsylvania',
    'PR': 'Puerto Rico',
    'PW': 'Palau',
    'RI': 'Rhode Island',
    'SC': 'South Carolina',
    'SD': 'South Dakota',
    'TN': 'Tennessee',
    'TX': 'Texas',
    'UT': 'Utah',
    'VA': 'Virginia',
    'VI': 'Virgin Islands',
    'VT': 'Vermont',
    'WA': 'Washington',
    'WI': 'Wisconsin',
    'WV': 'West Virginia',
    'WY': 'Wyoming'}

Let's fill the missing state_name using a dictionary

missing_state = {
    'AE': 'Armed Forces Africa, Canada, Europe, Middle East', 
    'AA': 'Armed Forces Americas (except Canada)',
    'AP': 'Armed Forces Pacific'
}

zip_df['state_name'] = zip_df['state_name'].fillna(zip_df['state_id'].map(missing_state))

We can check that the problem with state_name is solved. There are still a very few number of regarding lat and lng, around 0.1% of the total. We also have missing values on the field population which will not tackle for the moment.

zip_df.isna().sum()
zip              0
lat            579
lng            579
city             0
state_id         0
state_name       0
population    8583
dtype: int64
zip_df[zip_df['state_id'] == 'PR'].head()
zip lat lng city state_id state_name population
2 00601 18.1800 -66.7522 Adjuntas PR Puerto Rico 18570.0
3 00602 18.3607 -67.1752 Aguada PR Puerto Rico 41520.0
4 00603 18.4544 -67.1220 Aguadilla PR Puerto Rico 54689.0
5 00604 18.5006 -67.1359 Aguadilla PR Puerto Rico NaN
6 00605 18.4587 -67.1475 Aguadilla PR Puerto Rico NaN

Scrapping EWG Tap Water database

Request format to gather the initial information about a zip code is as follows:

https://www.ewg.org/tapwater/search-results.php?zip5=ZIP_CODE&searchtype=zip

Handling missing data

Checking some of the zip codes in the dataframe it seems they're not present in the EWG Tap Water Database, in particular, those referring to Puerto Rico are missing. Therefore, we need to plan to tackle the eventuality that there are no results for a specific zip code.

By inspecting the source code of the resulting error page we would look for:

<h2>No systems found that match your search</h2>

As search can also be performe by selecting a state from a list we can see that the only states which can be selected are the following ones:

Alabama, Alaska, Arizona, Arkansas, California, Colorado, Connecticut, Delaware, District Of Columbia, Florida, Georgia, Hawaii, Idaho, Illinois, Indiana, Iowa, Kansas, Kentucky, Louisiana, Maine, Maryland, Massachusetts, Michigan, Minnesota, Mississippi, Missouri, Montana, Nebraska, Nevada, New Hampshire, New Jersey, New Mexico, New York, North Carolina, North Dakota, Ohio, Oklahoma, Oregon, Pennsylvania, Rhode Island, South Carolina, South Dakota, Tennessee, Texas, Utah, Vermont, Virginia, Washington, West Virginia, Wisconsin and Wyoming

Apparently there is no information on the rest of the states included in this zip database

valid_states = [
    'Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 
    'District of Columbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 
    'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 
    'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 
    'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 
    'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 
    'West Virginia', 'Wisconsin', 'Wyoming']

We can check how many rows do we have for each of the valid_states

for state in valid_states:
    print(state, str((zip_df['state_name'] == state).sum()))
Alabama 811
Alaska 273
Arizona 524
Arkansas 704
California 2589
Colorado 643
Connecticut 425
Delaware 96
District of Columbia 291
Florida 1472
Georgia 951
Hawaii 137
Idaho 319
Illinois 1569
Indiana 957
Iowa 1055
Kansas 747
Kentucky 945
Louisiana 719
Maine 485
Maryland 603
Massachusetts 681
Michigan 1158
Minnesota 963
Mississippi 531
Missouri 1155
Montana 404
Nebraska 617
Nevada 253
New Hampshire 281
New Jersey 722
New Mexico 427
New York 2149
North Carolina 1080
North Dakota 407
Ohio 1415
Oklahoma 763
Oregon 481
Pennsylvania 2174
Rhode Island 90
South Carolina 534
South Dakota 385
Tennessee 785
Texas 2595
Utah 344
Vermont 308
Virginia 1214
Washington 715
West Virginia 851
Wisconsin 882
Wyoming 194

Everything looks reasonable, we can proceed to discard all zip codes whose state_name is not in the list of the valid_states.

print('Initial number of zip codes: ', zip_df.shape[0])
zip_df = zip_df[zip_df['state_name'].isin(valid_states)]
print('Final number of zip codes: ', zip_df.shape[0])
Initial number of zip codes:  41682
Final number of zip codes:  40873

So, we have discarded 809 zip codes whose state name was not in the list of valid states for the EWG Tap Water Database. This does not prevent that there will not be zip codes for which the webpage doesn't provide results but we've eliminated a source of problems but discarding these zip codes beforehand.

Scrapping process

BASE_URL = 'https://www.ewg.org/tapwater/'
SEARCH_URL_START = 'search-results.php?zip5='
SEARCH_URL_END = '&searchtype=zip'

url = 'https://www.ewg.org/tapwater/search-results.php?zip5=96799&searchtype=zip'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')

Error handling getting data

Error condition occurs when we can find the following h2 tag:

<h2>No systems found that match your search</h2>
def got_results_from_url(soup, url):
    error = soup.find('h2', text = 'No systems found that match your search')
    if (error):
        return False
    else:
        return True
got_results_from_url(soup, url)
False

Processing valid results

zip_df.head()
zip lat lng city state_id state_name population
0 00501 40.8133 -73.0476 Holtsville NY New York NaN
1 00544 40.8133 -73.0476 Holtsville NY New York NaN
195 01001 42.0626 -72.6259 Agawam MA Massachusetts 16769.0
196 01002 42.3749 -72.4621 Amherst MA Massachusetts 29049.0
197 01003 42.3919 -72.5248 Amherst MA Massachusetts 10372.0

1. Search results processing

def generate_url_from_zip(zip_value):
    return BASE_URL + SEARCH_URL_START + zip_value + SEARCH_URL_END

def get_population(people_served_tag):
    return int(people_served_tag.replace('Population served:', '').replace(',',''))

def get_city(element):
    return element.text.split(',')[0].strip()

def get_state(element):
    print(element.text)
    return element.text.split(',')[1].strip()

def get_city_and_state(element):
    split_element = element.text.split(',')
    if len(split_element) == 2:
        return split_element[0].strip(), split_element[1].strip()
    else:
        return split_element[0].strip(), '-'

def extract_info_from_row(elements):
    row_info = {}
    row_info['url'] = BASE_URL + elements[0].find('a')['href']
    row_info['utility_name'] = elements[0].text
    row_info['city'], row_info['state'] = get_city_and_state(elements[1])
    row_info['people_served'] = get_population(elements[2].text)
    return row_info

def process_results(results, zip_value):
    zip_results = []
    result_rows = results.find_all('tr')
    for row in result_rows:
        elements = row.find_all('td')
        if elements:
            element = extract_info_from_row(elements)
            element['zip'] = zip_value
            zip_results.append(element)
    return zip_results

def process_zip(zip_value):
    url = generate_url_from_zip(zip_value)
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html.parser')

    if got_results_from_url(soup, url):
        results = soup.find_all('table', {'class': 'search-results-table'})
        # NOTE: there are two search-results-table, first one shows the results for the 
        # largest utilities serving County, the second one is more complete and includes
        # utilities serving the searched zip and the surrounding county
        # The process will be applied only to the LARGEST UTILITIES which is the first 
        # result
        return process_results(results[0], zip_value)
    else:
        return []
zip_results = process_zip('00501')
zip_results
[{'url': 'https://www.ewg.org/tapwater/system.php?pws=NY5110526',
  'utility_name': 'Suffolk County Water Authority',
  'city': 'Hauppauge',
  'state': 'NY',
  'people_served': 1137108,
  'zip': '00501'},
 {'url': 'https://www.ewg.org/tapwater/system.php?pws=NY5103263',
  'utility_name': 'South Huntington Water Department',
  'city': 'South Huntington',
  'state': 'NY',
  'people_served': 81760,
  'zip': '00501'},
 {'url': 'https://www.ewg.org/tapwater/system.php?pws=NY5103271',
  'utility_name': 'Greenlawn WD',
  'city': 'Greenlawn',
  'state': 'NY',
  'people_served': 42000,
  'zip': '00501'},
 {'url': 'https://www.ewg.org/tapwater/system.php?pws=NY5103705',
  'utility_name': 'Riverhead WD',
  'city': 'Riverhead',
  'state': 'NY',
  'people_served': 37050,
  'zip': '00501'},
 {'url': 'https://www.ewg.org/tapwater/system.php?pws=NY5103276',
  'utility_name': 'Dix Hills Water District',
  'city': 'Huntington',
  'state': 'NY',
  'people_served': 34522,
  'zip': '00501'}]

2. Processing each utility name to gather information about contaminants

The information we are looking for is in the following HTML tags:

<ul class="contaminants-list" id="contams_above_hbl">
<ul class="contaminants-list" id="contams_other">

The first one includes "chemicals detected in 2015 for which annual utility averages exceeded an EWG-selected health guideline established by a federal or state public health authority; chemicals detected under the EPA's Unregulated Contaminant Monitoring Rule (UCMR 3) program in 2013 to 2015, for which annual utility averages exceeded a health guideline established by a federal or state public health authority; perfluorinated chemicals."

The second section includes "chemicals detected under the EPA's Unregulated Contaminant Monitoring Rule (UCMR 3) program in 2013 to 2015, for which annual utility averages were lower than an EWG-selected health guideline established by a federal or state public health authority."

def get_contaminants(soup, contaminant_type):
    section = soup.find('ul', {'class': 'contaminants-list', 'id': contaminant_type})
    contaminants_type = section.find_all('div', {'class': 'contaminant-name'})
    contaminants = []
    for contaminant in contaminants_type:
        contaminants.append(contaminant.find('h3').text)
    return contaminants

def get_contaminants_above_hbl(soup):
    return get_contaminants(soup, 'contams_above_hbl')

def get_contaminants_other(soup):
    return get_contaminants(soup, 'contams_other')    
url = 'https://www.ewg.org/tapwater/system.php?pws=NY5110526'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')

get_contaminants_above_hbl(soup)
['1,2,3-Trichloropropane', 'Chromium (hexavalent)', 'Perfluorinated chemicals']
get_contaminants_other(soup)
['1,1-Dichloroethane',
 '1,4-Dioxane',
 'Chlorate',
 'Chlorodifluoromethane',
 'Chromium (total)',
 'Cobalt',
 'Molybdenum',
 'Strontium',
 'Vanadium']

3. Try the full process on a subset of the dataframe

Let's try to apply the full process to small number of zip codes in the dataset selected randomly

zip_df_small = zip_df.sample(3, random_state=8)
def scrap_ewg_tap_water_database(df):
    data = []
    
    # Step 1: get information about the utilities in each zip code    
    for zip_code in df['zip']:
        utilities = process_zip(zip_code)
        data = data + utilities
        
    # Step 2: for each utility obtain the contaminants
    for utility in data:
        r = requests.get(utility['url'])
        soup = BeautifulSoup(r.content, 'html.parser')
        print('Getting contaminants from: ', utility['url'])
        utility['contaminants_above_hbl'] = get_contaminants_above_hbl(soup)
        utility['contaminants_other'] = get_contaminants_other(soup)
    return data
ewg_tap_water = scrap_ewg_tap_water_database(zip_df_small)
ewg_tap_water_df = pd.DataFrame(ewg_tap_water)
ewg_tap_water_df.head()
for contaminant in ewg_tap_water_df['contaminants_other']:
    print(contaminant)

4. Adjusting the process

Looking at the results of this data sample we can appreciate to things:

  • First, some of the state date is missing, we could fill it out using the zip database which is complete and omit the information scrap from the website
  • Second, the website encodes some information using the same tags as the ones used for contaminants which should be cleaned up e.g. 'The following chemicals were detected by Tell City Water Department, a supplier of water to Troy Township Water Association'; we could filter this information by checking those contaminants which are longer than a specific length e.g. > 80 characters

Let's implement these two changes and put all important source code together

IMPORTANT: the scrapping process takes a long time

def generate_url_from_zip(zip_value):
    return BASE_URL + SEARCH_URL_START + zip_value + SEARCH_URL_END

def get_population(people_served_tag):
    return int(people_served_tag.replace('Population served:', '').replace(',',''))

def get_city(element):
    return element.text.split(',')[0].strip()

def extract_info_from_row(elements):
    row_info = {}
    row_info['url'] = BASE_URL + elements[0].find('a')['href']
    row_info['utility_name'] = elements[0].text
    row_info['city'] = get_city(elements[1])
    row_info['people_served'] = get_population(elements[2].text)
    return row_info

def process_results(results, zip_value, state_id):
    zip_results = []
    result_rows = results.find_all('tr')
    for row in result_rows:
        elements = row.find_all('td')
        if elements:
            element = extract_info_from_row(elements)
            element['zip'] = zip_value
            element['state'] = state_id
            zip_results.append(element)
    return zip_results

def process_zip(zip_value, state_id):
    url = generate_url_from_zip(zip_value)
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html.parser')

    if got_results_from_url(soup, url):
        results = soup.find_all('table', {'class': 'search-results-table'})
        # NOTE: there are two search-results-table, first one shows the results for the 
        # largest utilities serving County, the second one is more complete and includes
        # utilities serving the searched zip and the surrounding county
        # The process will be applied only to the LARGEST UTILITIES which is the first 
        # result
        return process_results(results[0], zip_value, state_id)
    else:
        return []
    
def get_contaminants(soup, contaminant_type):
    section = soup.find('ul', {'class': 'contaminants-list', 'id': contaminant_type})
    if section:
        contaminants_type = section.find_all('div', {'class': 'contaminant-name'})
        contaminants = []
        for contaminant in contaminants_type:
            contaminant_name = contaminant.find('h3').text
            if len(contaminant_name) < 80:
                contaminants.append(contaminant.find('h3').text)
        return contaminants
    else:
        return []
    
def get_contaminants_above_hbl(soup):
    return get_contaminants(soup, 'contams_above_hbl')

def get_contaminants_other(soup):
    return get_contaminants(soup, 'contams_other')  

def get_all_contaminants(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html.parser')
    
    contaminants_above_hbl = get_contaminants_above_hbl(soup)
    contaminants_other = get_contaminants_other(soup)
    
    return (contaminants_above_hbl, contaminants_other)
    
def scrap_contaminants_from_df(df):
    contaminants_rows = []
   
    status = 0
    bar = progressbar.ProgressBar(max_value=df.shape[0])
    
    for index, utility in df.iterrows():
        # percentage of completion
        bar.update(status)        
        status = status + 1
        
        r = requests.get(utility['url'])
        soup = BeautifulSoup(r.content, 'html.parser')
        
        row = {}
        row['zip'] = utility['zip']
        row['city'] = utility['city']        
        row['contaminants_above_hbl'] = get_contaminants_above_hbl(soup)
        row['contaminants_other'] = get_contaminants_other(soup)
        contaminants_rows.append(row)
    bar.finish()
    
    return contaminants_rows
    
def scrap_ewg_tap_water_database(df):
    data = []
       
    status = 0
    bar = progressbar.ProgressBar(max_value=df.shape[0])
    
    # Step 1: get information about the utilities in each zip code    
    for index, row in df.iterrows():
        # percentage of completion
        bar.update(status)        
        status = status + 1
        
        utilities = process_zip(row['zip'], row['state_id'])
        data = data + utilities
    bar.finish()
    
    # Let's save this to a CSV just in case the second process does not work
    utilities_df = pd.DataFrame(data)
    utilities_df.to_csv('utilities.csv', index=False)
        
    # Step 2: for each utility obtain the contaminants
    status = 0
    bar = progressbar.ProgressBar(max_value=len(data))
    for utility in data:
        # percentage of completion
        bar.update(status)        
        status = status + 1
        
        r = requests.get(utility['url'])
        soup = BeautifulSoup(r.content, 'html.parser')
        utility['contaminants_above_hbl'] = get_contaminants_above_hbl(soup)
        utility['contaminants_other'] = get_contaminants_other(soup)
    bar.finish()
    
    return data

5. Scrap full database

# IMPORTANT NOTE: THIS PROCESS TAKES A LONG TIME - UNCOMMENT IF YOU WANT TO PROCEED
# ewg_tap_water = scrap_ewg_tap_water_database(zip_df)
# ewg_tap_water_df = pd.DataFrame(ewg_tap_water)
# ewg_tap_water_df.to_csv('ewg.csv')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment