Skip to content

Instantly share code, notes, and snippets.

@fhk
Created January 7, 2019 20:23
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save fhk/4057ff75dfd918ff4adc24c1cc0d024e to your computer and use it in GitHub Desktop.
Save fhk/4057ff75dfd918ff4adc24c1cc0d024e to your computer and use it in GitHub Desktop.
Cali_broadband
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
# coding: utf-8
# # An exploration of the broadband speeds in Cali
#
# It always interesting that there may be data but it is prohibitvely expensive or lacks the detail required.
#
# Inspired by the recent post from the [Financial Times](https://ig.ft.com/gb-broadband-speed-map/#methodology) can we develop an analogous analysis for California?
#
# Our methodology is slighly different following the steps below to disaggregate the data collected by the FCC.
#
# Could this data be a treasure trove for consumers and operators? [link](https://morningconsult.com/opinions/fcc-form-477-is-a-marketing-opportunity-not-a-regulatory-burden/)
#
# ## Create e2e test case for California
# 1. Get form 477 data
# 2. Source the census blocks
# 3. Source address points, streets, buildings etc
# 4. Join the Form 477 data to the polygon census blocks by ID - about the data, the data
# 5. Cut the data to each census block using ogr2ogr, or quad tree methodology
# 6. Develop feature extraction capability. This may be as simple as counts or as complex as distance from nearest street for addresses.
# 7. Create train, validate, test data set (No sure how we should sample as I think there will be a misrepresentation of different sizes of areas etc. will need to examine the distribution. Might need to to a research review on this? [example approach](https://www.esri.com/esri-news/arcuser/spring-2013/unequal-probability-based-spatial-sampling))
# 8. Build test model
# 9. Iterate on features
# ...
# In[1]:
# Step 1: Get the data
# Turns out there are multiple years of data, we are interested in the most
# recent but we should also examine the trends and data quality.
# It's also possible to get the data that shows satelite service to. Lets ignore that.
# The data is also somewhat latent and there are only 2 sample years at the moment.
# Hopefully we can revistit and update soon, with new data!
all_2015_dec_v2 = "http://transition.fcc.gov/form477/BroadbandData/Fixed/Dec15/Version%202/US-Fixed-without-Satellite-Dec2015.zip"
ca_2015_dec_v2 = "https://www.fcc.gov/form477/BroadbandData/Fixed/Dec15/Version%202/CA-Fixed-Dec2015.zip"
ca_2015_jun_v2 = "https://www.fcc.gov/form477/BroadbandData/Fixed/Jun15/Version%202/CA-Fixed-Jun2015.zip"
ca_2014_dec_v2 = "https://www.fcc.gov/form477/BroadbandData/Fixed/Dec14/Version2/CA-Fixed-Dec2014.zip"
# In[2]:
get_ipython().system('curl -LO "$ca_2015_dec_v2" && unzip "CA-Fixed-Dec2015.zip" && curl -LO "$ca_2015_jun_v2" && unzip "CA-Fixed-Jun2015.zip" && curl -LO "$ca_2014_dec_v2" && unzip "CA-Fixed-Dec2014.zip"')
# In[1]:
import pandas as pd
import geopandas as gpd
# In[2]:
ca_dec_15_df = pd.read_csv("CA-Fixed-Dec2015-v2.csv")
ca_jun_15_df = pd.read_csv("CA-Fixed-Jun2015-v2.csv")
ca_dec_14_df = pd.read_csv("CA-Fixed-Dec2014-v2.csv")
# In[3]:
ca_dec_15_df.head()
# In[4]:
# OK so now we have downloaded the block SHP file from the TIGER site
# https://www.census.gov/cgi-bin/geo/shapefiles/index.php?year=2018&layergroup=Blocks+%282010%29
get_ipython().run_line_magic('matplotlib', 'inline')
census_blocks_gis = gpd.read_file("tl_2018_06_tabblock10.shp")
# In[5]:
# Let's make sure that there are objects
census_blocks_gis.geometry[0]
# In[6]:
# Next let's get the address data
add_west_url = "https://s3.amazonaws.com/data.openaddresses.io/openaddr-collected-us_west.zip"
get_ipython().system('curl -LO "$add_west_url"')
# In[7]:
get_ipython().system('unzip "openaddr-collected-us_west.zip"')
# In[17]:
import os
os.getcwd()
# In[16]:
get_ipython().system('sed -n 1p ./us/ca/humboldt.csv > all_ca.csv')
# In[18]:
get_ipython().system('sed 1d ./us/ca/*.csv >> all_ca.csv')
# In[19]:
address_df = pd.read_csv("all_ca.csv")
# In[20]:
address_df.head()
# In[24]:
# now lets get some building data
building_url = "https://usbuildingdata.blob.core.windows.net/usbuildings-v1-1/California.zip"
get_ipython().system('curl -LO "$building_url" && unzip "California.zip"')
# In[27]:
# now let get the street data
north_ca_link = "https://download.geofabrik.de/north-america/us/california/norcal-latest-free.shp.zip"
south_ca_link = "https://download.geofabrik.de/north-america/us/california/socal-latest-free.shp.zip"
get_ipython().system('curl -LO "$north_ca_link" && curl -LO "$south_ca_link"')
# In[31]:
get_ipython().system('mkdir "norcal" && mkdir "socal" && unzip -o "norcal-latest-free.shp.zip" -d "norcal" && unzip -o "socal-latest-free.shp.zip" -d "socal"')
# ## Data gathering complete
#
# So now that we have the following:
#
# 1. 477 data
# 2. census blocks
# 3. address points
# 4. building outlines
# 5. OSM streets etc.
#
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment