fhk/Cali_broadband.ipynb

## Cali_broadband.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              Cali_broadband.ipynb
            
          
        Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## Cali_broadband.py

# coding: utf-8

# # An exploration of the broadband speeds in Cali
#
# It always interesting that there may be data but it is prohibitvely expensive or lacks the detail required.
#
# Inspired by the recent post from the [Financial Times](https://ig.ft.com/gb-broadband-speed-map/#methodology) can we develop an analogous analysis for California?
#
# Our methodology is slighly different following the steps below to disaggregate the data collected by the FCC.
#
# Could this data be a treasure trove for consumers and operators? [link](https://morningconsult.com/opinions/fcc-form-477-is-a-marketing-opportunity-not-a-regulatory-burden/)
#
# ## Create e2e test case for California
# 1. Get form 477 data
# 2. Source the census blocks
# 3. Source address points, streets, buildings etc
# 4. Join the Form 477 data to the polygon census blocks by ID - about the data, the data
# 5. Cut the data to each census block using ogr2ogr, or quad tree methodology
# 6. Develop feature extraction capability. This may be as simple as counts or as complex as distance from nearest street for addresses.
# 7. Create train, validate, test data set (No sure how we should sample as I think there will be a misrepresentation of different sizes of areas etc. will need to examine the distribution. Might need to to a research review on this? [example approach](https://www.esri.com/esri-news/arcuser/spring-2013/unequal-probability-based-spatial-sampling))
# 8. Build test model
# 9. Iterate on features
# ...

# In[1]:


# Step 1: Get the data

# Turns out there are multiple years of data, we are interested in the most
# recent but we should also examine the trends and data quality.
# It's also possible to get the data that shows satelite service to. Lets ignore that.
# The data is also somewhat latent and there are only 2 sample years at the moment.
# Hopefully we can revistit and update soon, with new data!

all_2015_dec_v2 = "http://transition.fcc.gov/form477/BroadbandData/Fixed/Dec15/Version%202/US-Fixed-without-Satellite-Dec2015.zip"
ca_2015_dec_v2 = "https://www.fcc.gov/form477/BroadbandData/Fixed/Dec15/Version%202/CA-Fixed-Dec2015.zip"
ca_2015_jun_v2 = "https://www.fcc.gov/form477/BroadbandData/Fixed/Jun15/Version%202/CA-Fixed-Jun2015.zip"
ca_2014_dec_v2 = "https://www.fcc.gov/form477/BroadbandData/Fixed/Dec14/Version2/CA-Fixed-Dec2014.zip"


# In[2]:


get_ipython().system('curl -LO "$ca_2015_dec_v2" && unzip "CA-Fixed-Dec2015.zip" && curl -LO "$ca_2015_jun_v2" && unzip "CA-Fixed-Jun2015.zip" && curl -LO "$ca_2014_dec_v2" && unzip "CA-Fixed-Dec2014.zip"')


# In[1]:


import pandas as pd
import geopandas as gpd


# In[2]:


ca_dec_15_df = pd.read_csv("CA-Fixed-Dec2015-v2.csv")
ca_jun_15_df = pd.read_csv("CA-Fixed-Jun2015-v2.csv")
ca_dec_14_df = pd.read_csv("CA-Fixed-Dec2014-v2.csv")


# In[3]:


ca_dec_15_df.head()


# In[4]:


# OK so now we have downloaded the block SHP file from the TIGER site
# https://www.census.gov/cgi-bin/geo/shapefiles/index.php?year=2018&layergroup=Blocks+%282010%29

get_ipython().run_line_magic('matplotlib', 'inline')
census_blocks_gis = gpd.read_file("tl_2018_06_tabblock10.shp")


# In[5]:


# Let's make sure that there are objects
census_blocks_gis.geometry[0]


# In[6]:


# Next let's get the address data
add_west_url = "https://s3.amazonaws.com/data.openaddresses.io/openaddr-collected-us_west.zip"
get_ipython().system('curl -LO "$add_west_url"')


# In[7]:


get_ipython().system('unzip "openaddr-collected-us_west.zip"')


# In[17]:


import os
os.getcwd()


# In[16]:


get_ipython().system('sed -n 1p ./us/ca/humboldt.csv > all_ca.csv')


# In[18]:


get_ipython().system('sed 1d ./us/ca/*.csv >> all_ca.csv')


# In[19]:


address_df = pd.read_csv("all_ca.csv")


# In[20]:


address_df.head()


# In[24]:


# now lets get some building data
building_url = "https://usbuildingdata.blob.core.windows.net/usbuildings-v1-1/California.zip"
get_ipython().system('curl -LO "$building_url" && unzip "California.zip"')


# In[27]:


# now let get the street data
north_ca_link = "https://download.geofabrik.de/north-america/us/california/norcal-latest-free.shp.zip"
south_ca_link = "https://download.geofabrik.de/north-america/us/california/socal-latest-free.shp.zip"

get_ipython().system('curl -LO "$north_ca_link" && curl -LO "$south_ca_link"')


# In[31]:


get_ipython().system('mkdir "norcal" && mkdir "socal" && unzip -o "norcal-latest-free.shp.zip" -d "norcal" && unzip -o "socal-latest-free.shp.zip" -d "socal"')


# ## Data gathering complete
#
# So now that we have the following:
#
# 1. 477 data
# 2. census blocks
# 3. address points
# 4. building outlines
# 5. OSM streets etc.
#

	# coding: utf-8

	# # An exploration of the broadband speeds in Cali
	#
	# It always interesting that there may be data but it is prohibitvely expensive or lacks the detail required.
	#
	# Inspired by the recent post from the [Financial Times](https://ig.ft.com/gb-broadband-speed-map/#methodology) can we develop an analogous analysis for California?
	#
	# Our methodology is slighly different following the steps below to disaggregate the data collected by the FCC.
	#
	# Could this data be a treasure trove for consumers and operators? [link](https://morningconsult.com/opinions/fcc-form-477-is-a-marketing-opportunity-not-a-regulatory-burden/)
	#
	# ## Create e2e test case for California
	# 1. Get form 477 data
	# 2. Source the census blocks
	# 3. Source address points, streets, buildings etc
	# 4. Join the Form 477 data to the polygon census blocks by ID - about the data, the data
	# 5. Cut the data to each census block using ogr2ogr, or quad tree methodology
	# 6. Develop feature extraction capability. This may be as simple as counts or as complex as distance from nearest street for addresses.
	# 7. Create train, validate, test data set (No sure how we should sample as I think there will be a misrepresentation of different sizes of areas etc. will need to examine the distribution. Might need to to a research review on this? [example approach](https://www.esri.com/esri-news/arcuser/spring-2013/unequal-probability-based-spatial-sampling))
	# 8. Build test model
	# 9. Iterate on features
	# ...

	# In[1]:


	# Step 1: Get the data

	# Turns out there are multiple years of data, we are interested in the most
	# recent but we should also examine the trends and data quality.
	# It's also possible to get the data that shows satelite service to. Lets ignore that.
	# The data is also somewhat latent and there are only 2 sample years at the moment.
	# Hopefully we can revistit and update soon, with new data!

	all_2015_dec_v2 = "http://transition.fcc.gov/form477/BroadbandData/Fixed/Dec15/Version%202/US-Fixed-without-Satellite-Dec2015.zip"
	ca_2015_dec_v2 = "https://www.fcc.gov/form477/BroadbandData/Fixed/Dec15/Version%202/CA-Fixed-Dec2015.zip"
	ca_2015_jun_v2 = "https://www.fcc.gov/form477/BroadbandData/Fixed/Jun15/Version%202/CA-Fixed-Jun2015.zip"
	ca_2014_dec_v2 = "https://www.fcc.gov/form477/BroadbandData/Fixed/Dec14/Version2/CA-Fixed-Dec2014.zip"


	# In[2]:


	get_ipython().system('curl -LO "$ca_2015_dec_v2" && unzip "CA-Fixed-Dec2015.zip" && curl -LO "$ca_2015_jun_v2" && unzip "CA-Fixed-Jun2015.zip" && curl -LO "$ca_2014_dec_v2" && unzip "CA-Fixed-Dec2014.zip"')


	# In[1]:


	import pandas as pd
	import geopandas as gpd


	# In[2]:


	ca_dec_15_df = pd.read_csv("CA-Fixed-Dec2015-v2.csv")
	ca_jun_15_df = pd.read_csv("CA-Fixed-Jun2015-v2.csv")
	ca_dec_14_df = pd.read_csv("CA-Fixed-Dec2014-v2.csv")


	# In[3]:


	ca_dec_15_df.head()


	# In[4]:


	# OK so now we have downloaded the block SHP file from the TIGER site
	# https://www.census.gov/cgi-bin/geo/shapefiles/index.php?year=2018&layergroup=Blocks+%282010%29

	get_ipython().run_line_magic('matplotlib', 'inline')
	census_blocks_gis = gpd.read_file("tl_2018_06_tabblock10.shp")


	# In[5]:


	# Let's make sure that there are objects
	census_blocks_gis.geometry[0]


	# In[6]:


	# Next let's get the address data
	add_west_url = "https://s3.amazonaws.com/data.openaddresses.io/openaddr-collected-us_west.zip"
	get_ipython().system('curl -LO "$add_west_url"')


	# In[7]:


	get_ipython().system('unzip "openaddr-collected-us_west.zip"')


	# In[17]:


	import os
	os.getcwd()


	# In[16]:


	get_ipython().system('sed -n 1p ./us/ca/humboldt.csv > all_ca.csv')


	# In[18]:


	get_ipython().system('sed 1d ./us/ca/*.csv >> all_ca.csv')


	# In[19]:


	address_df = pd.read_csv("all_ca.csv")


	# In[20]:


	address_df.head()


	# In[24]:


	# now lets get some building data
	building_url = "https://usbuildingdata.blob.core.windows.net/usbuildings-v1-1/California.zip"
	get_ipython().system('curl -LO "$building_url" && unzip "California.zip"')


	# In[27]:


	# now let get the street data
	north_ca_link = "https://download.geofabrik.de/north-america/us/california/norcal-latest-free.shp.zip"
	south_ca_link = "https://download.geofabrik.de/north-america/us/california/socal-latest-free.shp.zip"

	get_ipython().system('curl -LO "$north_ca_link" && curl -LO "$south_ca_link"')


	# In[31]:


	get_ipython().system('mkdir "norcal" && mkdir "socal" && unzip -o "norcal-latest-free.shp.zip" -d "norcal" && unzip -o "socal-latest-free.shp.zip" -d "socal"')


	# ## Data gathering complete
	#
	# So now that we have the following:
	#
	# 1. 477 data
	# 2. census blocks
	# 3. address points
	# 4. building outlines
	# 5. OSM streets etc.
	#