Skip to content

Instantly share code, notes, and snippets.

@psychemedia
Last active October 25, 2016 17:20
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save psychemedia/4f564bbbbebbe9e4e827 to your computer and use it in GitHub Desktop.
Save psychemedia/4f564bbbbebbe9e4e827 to your computer and use it in GitHub Desktop.
Old demo how to work with Companies House data
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "",
"signature": "sha256:8b928cdacc201c92c02b83f555477eb453eede774625a65853004a073935c4f8"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Working With Companies House Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Note that the data ingest stage of this activity may take your computer an hour or more to process, so make sure you take this into account as you work through the activity.*\n",
"\n",
"In the UK, *Companies House* maintains a public register of registered companies. Details of \"live\" companies can be downloaded as open data from Companies House: [http://download.companieshouse.gov.uk/en_output.html](http://download.companieshouse.gov.uk/en_output.html).\n",
"\n",
"The file is currently bundled into a set of 5 files, containing something over three and a half million records (the lower numbered zip files have 850,000 rows, the final zip file the balance).\n",
"\n",
"Whilst *pandas* can cope with quite large dataframes, keeping such a large number of items in memory represents a large overhead. In addition, having to load the whole dataset into memory each time a new python session is run is also time consuming.\n",
"\n",
"Instead, it's much easier to store these records in a persistent database such as PostgreSQL, generate efficient indexes over column we are likely to one to query, and then pull out just the records we need as we need them."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The first step requires downloading the data from the Companies House website. Let's create a directory in which to put the downloaded zip files and download the files to it."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!mkdir companydata"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#You may need to check the URLs\n",
"#http://download.companieshouse.gov.uk/en_output.html\n",
"#Download the data files and place them in the companydata directory\n",
"!wget http://download.companieshouse.gov.uk/BasicCompanyData-2014-06-01-part1_5.zip -P companydata\n",
"!wget http://download.companieshouse.gov.uk/BasicCompanyData-2014-06-01-part2_5.zip -P companydata\n",
"!wget http://download.companieshouse.gov.uk/BasicCompanyData-2014-06-01-part3_5.zip -P companydata\n",
"!wget http://download.companieshouse.gov.uk/BasicCompanyData-2014-06-01-part4_5.zip -P companydata\n",
"!wget http://download.companieshouse.gov.uk/BasicCompanyData-2014-06-01-part5_5.zip -P companydata\n",
"\n",
"#Check the files are there...\n",
"!ls companydata"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Unzip the files into the same directory\n",
"!unzip \"companydata/*.zip\" -d companydata"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Archive: companydata/BasicCompanyData-2014-06-01-part1_5.zip\r\n",
" inflating: companydata/BasicCompanyData-2014-06-01-part1_5.csv "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r\n",
"\r\n"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Archive: companydata/BasicCompanyData-2014-06-01-part2_5.zip\r\n",
" inflating: companydata/BasicCompanyData-2014-06-01-part2_5.csv "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r\n",
"\r\n"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Archive: companydata/BasicCompanyData-2014-06-01-part3_5.zip\r\n",
" inflating: companydata/BasicCompanyData-2014-06-01-part3_5.csv "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r\n",
"\r\n",
"Archive: companydata/BasicCompanyData-2014-06-01-part4_5.zip\r\n",
" inflating: companydata/BasicCompanyData-2014-06-01-part4_5.csv "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r\n",
"\r\n"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Archive: companydata/BasicCompanyData-2014-06-01-part5_5.zip\r\n"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
" inflating: companydata/BasicCompanyData-2014-06-01-part5_5.csv "
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r\n",
"\r\n",
"5 archives were successfully processed.\r\n"
]
}
],
"prompt_number": 22
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Should we tidy up further and delete the zip files too?\n",
"#Or maybe do this after we know everything is has been loaded...\n",
"#!rm companydata/*.zip"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#How many rows are there in each file?\n",
"!wc -l companydata/*.csv"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Let's have a quick peak at the head of one of the files\n",
"!head -n 3 companydata/BasicCompanyData-2014-06-01-part1_5.csv"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"CompanyName, CompanyNumber,RegAddress.CareOf,RegAddress.POBox,RegAddress.AddressLine1, RegAddress.AddressLine2,RegAddress.PostTown,RegAddress.County,RegAddress.Country,RegAddress.PostCode,CompanyCategory,CompanyStatus,CountryOfOrigin,DissolutionDate,IncorporationDate,Accounts.AccountRefDay,Accounts.AccountRefMonth,Accounts.NextDueDate,Accounts.LastMadeUpDate,Accounts.AccountCategory,Returns.NextDueDate,Returns.LastMadeUpDate,Mortgages.NumMortCharges,Mortgages.NumMortOutstanding,Mortgages.NumMortPartSatisfied,Mortgages.NumMortSatisfied,SICCode.SicText_1,SICCode.SicText_2,SICCode.SicText_3,SICCode.SicText_4,LimitedPartnerships.NumGenPartners,LimitedPartnerships.NumLimPartners,URI,PreviousName_1.CONDATE, PreviousName_1.CompanyName, PreviousName_2.CONDATE, PreviousName_2.CompanyName,PreviousName_3.CONDATE, PreviousName_3.CompanyName,PreviousName_4.CONDATE, PreviousName_4.CompanyName,PreviousName_5.CONDATE, PreviousName_5.CompanyName,PreviousName_6.CONDATE, PreviousName_6.CompanyName,PreviousName_7.CONDATE, PreviousName_7.CompanyName,PreviousName_8.CONDATE, PreviousName_8.CompanyName,PreviousName_9.CONDATE, PreviousName_9.CompanyName,PreviousName_10.CONDATE, PreviousName_10.CompanyName\r\n",
"\"! LTD\",\"08209948\",\"\",\"\",\"METROHOUSE 57 PEPPER ROAD\",\"HUNSLET\",\"LEEDS\",\"YORKSHIRE\",\"UNITED KINGDOM\",\"LS10 2RU\",\"Private Limited Company\",\"Active\",\"United Kingdom\",\"\",\"11/09/2012\",\"30\",\"9\",\"30/06/2015\",\"30/09/2013\",\"DORMANT\",\"09/10/2014\",\"11/09/2013\",\"0\",\"0\",\"0\",\"0\",\"99999 - Dormant Company\",\"\",\"\",\"\",\"0\",\"0\",\"http://business.data.gov.uk/id/company/08209948\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\"\r\n",
"\"!BIG IMPACT GRAPHICS LIMITED\",\"07382019\",\"\",\"\",\"335 ROSDEN HOUSE\",\"372 OLD STREET\",\"LONDON\",\"\",\"\",\"EC1V 9AV\",\"Private Limited Company\",\"Active\",\"United Kingdom\",\"\",\"21/09/2010\",\"30\",\"9\",\"30/06/2014\",\"30/09/2012\",\"DORMANT\",\"19/10/2014\",\"21/09/2013\",\"0\",\"0\",\"0\",\"0\",\"59112 - Video production activities\",\"59113 - Television programme production activities\",\"74100 - specialised design activities\",\"74202 - Other specialist photography\",\"0\",\"0\",\"http://business.data.gov.uk/id/company/07382019\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\"\r\n"
]
}
],
"prompt_number": 23
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Let's see how pandas sees a file\n",
"import pandas as pd\n",
"\n",
"tmp=pd.read_csv('companydata/BasicCompanyData-2014-06-01-part1_5.csv',nrows=3)\n",
"tmp.head(3)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>CompanyName</th>\n",
" <th> CompanyNumber</th>\n",
" <th>RegAddress.CareOf</th>\n",
" <th>RegAddress.POBox</th>\n",
" <th>RegAddress.AddressLine1</th>\n",
" <th> RegAddress.AddressLine2</th>\n",
" <th>RegAddress.PostTown</th>\n",
" <th>RegAddress.County</th>\n",
" <th>RegAddress.Country</th>\n",
" <th>RegAddress.PostCode</th>\n",
" <th>...</th>\n",
" <th>PreviousName_6.CONDATE</th>\n",
" <th> PreviousName_6.CompanyName</th>\n",
" <th>PreviousName_7.CONDATE</th>\n",
" <th> PreviousName_7.CompanyName</th>\n",
" <th>PreviousName_8.CONDATE</th>\n",
" <th> PreviousName_8.CompanyName</th>\n",
" <th>PreviousName_9.CONDATE</th>\n",
" <th> PreviousName_9.CompanyName</th>\n",
" <th>PreviousName_10.CONDATE</th>\n",
" <th> PreviousName_10.CompanyName</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td> ! LTD</td>\n",
" <td> 8209948</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td> METROHOUSE 57 PEPPER ROAD</td>\n",
" <td> HUNSLET</td>\n",
" <td> LEEDS</td>\n",
" <td> YORKSHIRE</td>\n",
" <td> UNITED KINGDOM</td>\n",
" <td> LS10 2RU</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td> !BIG IMPACT GRAPHICS LIMITED</td>\n",
" <td> 7382019</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td> 335 ROSDEN HOUSE</td>\n",
" <td> 372 OLD STREET</td>\n",
" <td> LONDON</td>\n",
" <td> NaN</td>\n",
" <td> NaN</td>\n",
" <td> EC1V 9AV</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td> !K7 TOURING LIMITED</td>\n",
" <td> 8937297</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td> 5TH FLOOR, DURKAN HOUSE</td>\n",
" <td> 155 EAST BARNET ROAD</td>\n",
" <td> NEW BARNET</td>\n",
" <td> HERTS</td>\n",
" <td> UNITED KINGDOM</td>\n",
" <td> EN4 8QZ</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>3 rows \u00d7 53 columns</p>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 24,
"text": [
" CompanyName CompanyNumber RegAddress.CareOf \\\n",
"0 ! LTD 8209948 NaN \n",
"1 !BIG IMPACT GRAPHICS LIMITED 7382019 NaN \n",
"2 !K7 TOURING LIMITED 8937297 NaN \n",
"\n",
" RegAddress.POBox RegAddress.AddressLine1 RegAddress.AddressLine2 \\\n",
"0 NaN METROHOUSE 57 PEPPER ROAD HUNSLET \n",
"1 NaN 335 ROSDEN HOUSE 372 OLD STREET \n",
"2 NaN 5TH FLOOR, DURKAN HOUSE 155 EAST BARNET ROAD \n",
"\n",
" RegAddress.PostTown RegAddress.County RegAddress.Country \\\n",
"0 LEEDS YORKSHIRE UNITED KINGDOM \n",
"1 LONDON NaN NaN \n",
"2 NEW BARNET HERTS UNITED KINGDOM \n",
"\n",
" RegAddress.PostCode ... PreviousName_6.CONDATE \\\n",
"0 LS10 2RU ... NaN \n",
"1 EC1V 9AV ... NaN \n",
"2 EN4 8QZ ... NaN \n",
"\n",
" PreviousName_6.CompanyName PreviousName_7.CONDATE \\\n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 NaN NaN \n",
"\n",
" PreviousName_7.CompanyName PreviousName_8.CONDATE \\\n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 NaN NaN \n",
"\n",
" PreviousName_8.CompanyName PreviousName_9.CONDATE \\\n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 NaN NaN \n",
"\n",
" PreviousName_9.CompanyName PreviousName_10.CONDATE \\\n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 NaN NaN \n",
"\n",
" PreviousName_10.CompanyName \n",
"0 NaN \n",
"1 NaN \n",
"2 NaN \n",
"\n",
"[3 rows x 53 columns]"
]
}
],
"prompt_number": 24
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!ls companydata"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"BasicCompanyData-2014-06-01-part1_5.zip\r\n",
"BasicCompanyData-2014-06-01-part2_5.zip\r\n",
"BasicCompanyData-2014-06-01-part3_5.zip\r\n",
"BasicCompanyData-2014-06-01-part4_5.zip\r\n",
"BasicCompanyData-2014-06-01-part5_5.zip\r\n"
]
}
],
"prompt_number": 4
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Let's look at the column names\n",
"tmp.columns.values"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 25,
"text": [
"array(['CompanyName', ' CompanyNumber', 'RegAddress.CareOf',\n",
" 'RegAddress.POBox', 'RegAddress.AddressLine1',\n",
" ' RegAddress.AddressLine2', 'RegAddress.PostTown',\n",
" 'RegAddress.County', 'RegAddress.Country', 'RegAddress.PostCode',\n",
" 'CompanyCategory', 'CompanyStatus', 'CountryOfOrigin',\n",
" 'DissolutionDate', 'IncorporationDate', 'Accounts.AccountRefDay',\n",
" 'Accounts.AccountRefMonth', 'Accounts.NextDueDate',\n",
" 'Accounts.LastMadeUpDate', 'Accounts.AccountCategory',\n",
" 'Returns.NextDueDate', 'Returns.LastMadeUpDate',\n",
" 'Mortgages.NumMortCharges', 'Mortgages.NumMortOutstanding',\n",
" 'Mortgages.NumMortPartSatisfied', 'Mortgages.NumMortSatisfied',\n",
" 'SICCode.SicText_1', 'SICCode.SicText_2', 'SICCode.SicText_3',\n",
" 'SICCode.SicText_4', 'LimitedPartnerships.NumGenPartners',\n",
" 'LimitedPartnerships.NumLimPartners', 'URI',\n",
" 'PreviousName_1.CONDATE', ' PreviousName_1.CompanyName',\n",
" ' PreviousName_2.CONDATE', ' PreviousName_2.CompanyName',\n",
" 'PreviousName_3.CONDATE', ' PreviousName_3.CompanyName',\n",
" 'PreviousName_4.CONDATE', ' PreviousName_4.CompanyName',\n",
" 'PreviousName_5.CONDATE', ' PreviousName_5.CompanyName',\n",
" 'PreviousName_6.CONDATE', ' PreviousName_6.CompanyName',\n",
" 'PreviousName_7.CONDATE', ' PreviousName_7.CompanyName',\n",
" 'PreviousName_8.CONDATE', ' PreviousName_8.CompanyName',\n",
" 'PreviousName_9.CONDATE', ' PreviousName_9.CompanyName',\n",
" 'PreviousName_10.CONDATE', ' PreviousName_10.CompanyName'], dtype=object)"
]
}
],
"prompt_number": 25
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looks like there may be extraneous white space in some of the column names, which we'll need to take care of..."
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Setting up the PostgreSQL Database"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We're going to be using PostgreSQL so we need to load in some drivers and handlers."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#deprecate this in favour of psqlx?\n",
"#import psycopg2 as pg\n",
"#If connecting to the default port, you can omit the port parameter\n",
"#conn = pg.connect(dbname='tm351test', host='localhost', user='test', password='test' , port=5432)\n",
"#cur = conn.cursor()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pandas as pd\n",
"pd.__version__"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 2,
"text": [
"'0.14.0'"
]
}
],
"prompt_number": 2
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!pip3 install --upgrade"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\u001b[33mYou must give at least one requirement to install (see \"pip help install\")\r\n",
"\u001b[0m"
]
}
],
"prompt_number": 6
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import sys\n",
"sys.version"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 3,
"text": [
"'3.4.0 (default, Apr 11 2014, 13:05:11) \\n[GCC 4.8.2]'"
]
}
],
"prompt_number": 3
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#psql lets us return tables as dataframes\n",
"from pandas import read_sql_query as psql\n",
"#psqlx is a convenience fucntion for executing raw queries where we aren't so bothered about the response...\n",
"#Errors should still be thrown...\n",
"from pandas.io.sql import execute as psqlx\n",
"\n",
"import pandas as pd\n",
"from sqlalchemy import create_engine"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stderr",
"text": [
"/usr/local/lib/python3.4/dist-packages/pandas/io/excel.py:626: UserWarning: Installed openpyxl is not supported at this time. Use >=1.6.1 and <2.0.0.\n",
" .format(openpyxl_compat.start_ver, openpyxl_compat.stop_ver))\n"
]
}
],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**For some reason, the psql wrapper requires LIKE % wildcard character to be escaped as %%**"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Create the database connection for pandas dataframe operations\n",
"engine = create_engine('postgresql://test:test@localhost:5432/tm351test')"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#cur.execute(\"DROP TABLE companydata;\")\n",
"#psqlx(\"DROP TABLE companydata;\",engine)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#cur.execute(\"DROP TABLE company_postcode_area;\")\n",
"#psqlx(\"DROP TABLE company_postcode_area;\",engine)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#cur.execute(\"DROP INDEX company_postcode_area_idx;\")\n",
"#psqlx(\"DROP INDEX company_postcode_area_idx;\",engine)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#To define the table, let's check the longest strings in column\n",
"#Note that this routine may take a few minutes to run\n",
"\n",
"cleanColNames=[\"CompanyName\",\"RegAddress.CareOf\",\"RegAddress.POBox\", \\\n",
" \"RegAddress.AddressLine1\",\"RegAddress.AddressLine2\",\"RegAddress.PostTown\",\"RegAddress.County\", \\\n",
" \"RegAddress.Country\",\"RegAddress.PostCode\",\"CompanyCategory\",\"CompanyStatus\", \\\n",
" \"CountryOfOrigin\"]\n",
"\n",
"def colMaxLen(df,maxlens):\n",
" df.rename(columns=lambda x: x.strip(), inplace=True)\n",
" for col in cleanColNames:\n",
" cml=df[col].astype(str).map(len).max()\n",
" if cml > maxlens[col]: maxlens[col]=cml\n",
"\n",
"import os\n",
"maxlens={}\n",
"for name in cleanColNames: maxlens[name]=0\n",
"for fname in os.listdir('companydata/'):\n",
" #if a file is a CSV file, process it\n",
" if fname.endswith('csv'):\n",
" fname=\"companydata/\"+fname\n",
" #Read in 10,000 rows at a time\n",
" chunks=pd.read_csv(fname,chunksize=10000)\n",
" for chunk in chunks:\n",
" colMaxLen(chunk,maxlens)\n",
"maxlens"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#I'm going to create a table over some of the columns - essentially, the company name, number, address and incoporation date\n",
"#We can use the maxlens to ensure we create large enough data fields\n",
"\n",
"#I wonder if we should just try to create the table from the dataframe?\n",
"#Perhaps make sure we cast the right type of each column in the dataframe first?\n",
"#It's easier using psycopg2 to just run the query (will sqlalchemy let us run a \"raw\" query? Trying psqlx over cur.execute)\n",
"#cur.execute\n",
"psqlx(\"CREATE TABLE companydata ( CompanyName VARCHAR (200), CompanyNumber VARCHAR (10) PRIMARY KEY, \\\n",
" RegAddress_CareOf VARCHAR (100), RegAddress_POBox VARCHAR (15), RegAddress_AddressLine1 VARCHAR (120), \\\n",
" RegAddress_AddressLine2 VARCHAR (100), RegAddress_PostTown VARCHAR (60), \\\n",
" RegAddress_County VARCHAR (60), RegAddress_Country VARCHAR (60), RegAddress_PostCode VARCHAR (20), \\\n",
" CompanyCategory VARCHAR (100), CompanyStatus VARCHAR (60), CountryOfOrigin VARCHAR (40), \\\n",
" IncorporationDate date);\",engine)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Even though I used mixed case column names, I didn't quote them - so PostgreSQL forces them to lower case\n",
"#The lack of quotes also meand I had to replace the . in the column names with something more acceptable... (an underscore)\n",
"psql(\"SELECT * FROM companydata LIMIT 1;\", engine)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What other tables could we create?\n",
"\n",
"SIC codes are sometimes interesting - they give a crude estimate of the libne of business a compnay may be involved in. Maybe create a table with companynumber and sic code - note that each companynumber may have several sic codes, so there may be more than one row per companynumber in a two col table of companynumber,sicCode."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Having created a crude database table, we're now going to load in some data - company records from the downloaded files for companies that have a well-specified incorporation date.\n",
"\n",
"Note that it may take some time to load the data into the database and index it, possibly an hour or more depending on the speed of your computer. So if you have some reading to do, now could be a good time to do it. Alternatively, go and catch up with what's happening in the course forums."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"companydata_cols=['companyname', 'companynumber', 'regaddress_careof', \n",
" 'regaddress_pobox', 'regaddress_addressline1',\n",
" 'regaddress_addressline2', 'regaddress_posttown',\n",
" 'regaddress_county', 'regaddress_country', 'regaddress_postcode',\n",
" 'companycategory', 'companystatus', 'countryoforigin',\n",
" 'incorporationdate']\n",
"\n",
"def addchunk(chunk):\n",
" '''Create a function to clean the companydata from the CSV file and enter it into the database table'''\n",
" \n",
" #NOTE IF WE HAVE MORE TABLES BEST TO PROCESS THEM ALL FROM A SINGLE FILE READ?\n",
" \n",
" #To conform to the column names in the table I defined, we need to make some changes to the column names\n",
" #Firstly, replace the points with underscores\n",
" chunk.rename(columns=lambda x: x.replace('.','_'), inplace=True)\n",
" #Make sure we strip off any whitespace left and right\n",
" chunk.rename(columns=lambda x: x.strip(), inplace=True)\n",
" #Force the column names to lowet case\n",
" chunk.rename(columns=lambda x: x.lower(), inplace=True)\n",
" \n",
" #This adds to the load time but it's often worth doing\n",
" for col in companydata_cols:\n",
" chunk[col] = chunk[col].str.strip()\n",
" \n",
" #We could have parsed the dates on load, but if we do it here we can keep all the cleaning steps together\n",
" chunk['incorporationdate']=pd.to_datetime(chunk['incorporationdate'], format='%d/%m/%Y',coerce=True)\n",
" #I'm actually only interested in companies with an incorporation date\n",
" chunk.dropna(subset=['incorporationdate'],inplace=True)\n",
" #NB PostgreSQL may also throw a wobbly if it doesn't see a date when it expects one!\n",
" \n",
" #We're not saving all the column names - just name, number, address, incorporation date as per the table definition\n",
" #If the table exists, which it does, append data to it\n",
" chunk[companydata_cols].to_sql('companydata', engine, index=False, if_exists='append')\n"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Here's our ingest loop\n",
"#THIS COULD TAKE SOME TIME TO RUN - MAYBE EVEN AN HOUR OR MORE...\n",
"#Maybe this is an argument in favour of having several passes for different tables?\n",
"\n",
"#Get the names of the files in the companydata directory\n",
"for fname in os.listdir('companydata/'):\n",
" #if a file is a CSV file, process it\n",
" if fname.endswith('csv'):\n",
" fname=\"companydata/\"+fname\n",
" #Read in 10,000 rows at a time\n",
" chunks=pd.read_csv(fname,chunksize=10000,dtype=str)\n",
" for chunk in chunks:\n",
" #Call the function that cleans the data and adds it to the database\n",
" addchunk(chunk)\n",
" #Do some housekeeping - remove the CSV\n",
" os.remove(fname)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#How many rows did we load in in the end?\n",
"psql(\"SELECT count(*) FROM companydata;\", engine)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"psql(\"SELECT * FROM companydata LIMIT 5;\", engine)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Getting Started Querying the Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A web search for */Milton Keynes postcode areas/* turns up a Wikipedia page - [MK Postcode Area](http://en.wikipedia.org/wiki/MK_postcode_area) - that suggests the postcode areas covering the city are *(MK1, MK2, MK3, MK4, MK5,MK6,MK7,MK8,MK9,MK19,MK11,MK12,MK13,MK14,MK15)*.\n",
"\n",
"If we are going to regularly search by postcode area, it might make sense to create a second table that includes the postcode area as an indexed term and the company number as a foreign key into the original company data table. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#cur.execute(\"CREATE TABLE company_postcode_area AS \\\n",
"# SELECT companynumber, split_part(regaddress_postcode,' ',1) as postcode_area FROM companydata;\")\n",
"psqlx(\"CREATE TABLE company_postcode_area AS \\\n",
" SELECT companynumber, split_part(regaddress_postcode,' ',1) as postcode_area FROM companydata;\",engine)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"ename": "ProgrammingError",
"evalue": "(ProgrammingError) relation \"company_postcode_area\" already exists\n \"CREATE TABLE company_postcode_area AS SELECT companynumber, split_part(regaddress_postcode,' ',1) as postcode_area FROM companydata;\" {}",
"output_type": "pyerr",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[1;31mProgrammingError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m<ipython-input-35-f25f4c40174c>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[1;31m#cur.execute(\"CREATE TABLE company_postcode_area AS \\\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2\u001b[0m \u001b[1;31m# SELECT companynumber, split_part(regaddress_postcode,' ',1) as postcode_area FROM companydata;\")\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 3\u001b[1;33m \u001b[0mpsqlx\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"CREATE TABLE company_postcode_area AS SELECT companynumber, split_part(regaddress_postcode,' ',1) as postcode_area FROM companydata;\"\u001b[0m\u001b[1;33m,\u001b[0m\u001b[0mengine\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[1;32m/usr/local/lib/python3.4/dist-packages/pandas/io/sql.py\u001b[0m in \u001b[0;36mexecute\u001b[1;34m(sql, con, cur, params)\u001b[0m\n\u001b[0;32m 102\u001b[0m \u001b[0mpandas_sql\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mpandasSQL_builder\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mcur\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mis_cursor\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mTrue\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 103\u001b[0m \u001b[0margs\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0m_convert_params\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0msql\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mparams\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 104\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mpandas_sql\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mexecute\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m*\u001b[0m\u001b[0margs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 105\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 106\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32m/usr/local/lib/python3.4/dist-packages/pandas/io/sql.py\u001b[0m in \u001b[0;36mexecute\u001b[1;34m(self, *args, **kwargs)\u001b[0m\n\u001b[0;32m 781\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0mexecute\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m*\u001b[0m\u001b[0margs\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 782\u001b[0m \u001b[1;34m\"\"\"Simple passthrough to SQLAlchemy engine\"\"\"\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 783\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mengine\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mexecute\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m*\u001b[0m\u001b[0margs\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 784\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 785\u001b[0m def read_table(self, table_name, index_col=None, coerce_float=True,\n",
"\u001b[1;32m/usr/local/lib/python3.4/dist-packages/sqlalchemy/engine/base.py\u001b[0m in \u001b[0;36mexecute\u001b[1;34m(self, statement, *multiparams, **params)\u001b[0m\n\u001b[0;32m 1680\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1681\u001b[0m \u001b[0mconnection\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mcontextual_connect\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mclose_with_result\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mTrue\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 1682\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mconnection\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mexecute\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mstatement\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m*\u001b[0m\u001b[0mmultiparams\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mparams\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 1683\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1684\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0mscalar\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mstatement\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m*\u001b[0m\u001b[0mmultiparams\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mparams\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32m/usr/local/lib/python3.4/dist-packages/sqlalchemy/engine/base.py\u001b[0m in \u001b[0;36mexecute\u001b[1;34m(self, object, *multiparams, **params)\u001b[0m\n\u001b[0;32m 710\u001b[0m \"\"\"\n\u001b[0;32m 711\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mobject\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mutil\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mstring_types\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 712\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_execute_text\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mobject\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mmultiparams\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mparams\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 713\u001b[0m \u001b[1;32mtry\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 714\u001b[0m \u001b[0mmeth\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mobject\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_execute_on_connection\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32m/usr/local/lib/python3.4/dist-packages/sqlalchemy/engine/base.py\u001b[0m in \u001b[0;36m_execute_text\u001b[1;34m(self, statement, multiparams, params)\u001b[0m\n\u001b[0;32m 859\u001b[0m \u001b[0mstatement\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 860\u001b[0m \u001b[0mparameters\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 861\u001b[1;33m \u001b[0mstatement\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mparameters\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 862\u001b[0m )\n\u001b[0;32m 863\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_has_events\u001b[0m \u001b[1;32mor\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mengine\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_has_events\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32m/usr/local/lib/python3.4/dist-packages/sqlalchemy/engine/base.py\u001b[0m in \u001b[0;36m_execute_context\u001b[1;34m(self, dialect, constructor, statement, parameters, *args)\u001b[0m\n\u001b[0;32m 945\u001b[0m \u001b[0mparameters\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 946\u001b[0m \u001b[0mcursor\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 947\u001b[1;33m context)\n\u001b[0m\u001b[0;32m 948\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 949\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_has_events\u001b[0m \u001b[1;32mor\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mengine\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_has_events\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32m/usr/local/lib/python3.4/dist-packages/sqlalchemy/engine/base.py\u001b[0m in \u001b[0;36m_handle_dbapi_exception\u001b[1;34m(self, e, statement, parameters, cursor, context)\u001b[0m\n\u001b[0;32m 1106\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mdialect\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mdbapi\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mError\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1107\u001b[0m connection_invalidated=self._is_disconnect),\n\u001b[1;32m-> 1108\u001b[1;33m \u001b[0mexc_info\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 1109\u001b[0m )\n\u001b[0;32m 1110\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32m/usr/local/lib/python3.4/dist-packages/sqlalchemy/util/compat.py\u001b[0m in \u001b[0;36mraise_from_cause\u001b[1;34m(exception, exc_info)\u001b[0m\n\u001b[0;32m 172\u001b[0m \u001b[0mexc_info\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0msys\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mexc_info\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 173\u001b[0m \u001b[0mexc_type\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mexc_value\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mexc_tb\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mexc_info\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 174\u001b[1;33m \u001b[0mreraise\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mtype\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mexception\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mexception\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mtb\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mexc_tb\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcause\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mexc_value\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 175\u001b[0m \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 176\u001b[0m exec(\"def reraise(tp, value, tb=None, cause=None):\\n\"\n",
"\u001b[1;32m/usr/local/lib/python3.4/dist-packages/sqlalchemy/util/compat.py\u001b[0m in \u001b[0;36mreraise\u001b[1;34m(tp, value, tb, cause)\u001b[0m\n\u001b[0;32m 165\u001b[0m \u001b[0mvalue\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m__cause__\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mcause\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 166\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mvalue\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m__traceback__\u001b[0m \u001b[1;32mis\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[0mtb\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 167\u001b[1;33m \u001b[1;32mraise\u001b[0m \u001b[0mvalue\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mwith_traceback\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mtb\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 168\u001b[0m \u001b[1;32mraise\u001b[0m \u001b[0mvalue\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 169\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32m/usr/local/lib/python3.4/dist-packages/sqlalchemy/engine/base.py\u001b[0m in \u001b[0;36m_execute_context\u001b[1;34m(self, dialect, constructor, statement, parameters, *args)\u001b[0m\n\u001b[0;32m 938\u001b[0m \u001b[0mstatement\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 939\u001b[0m \u001b[0mparameters\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 940\u001b[1;33m context)\n\u001b[0m\u001b[0;32m 941\u001b[0m \u001b[1;32mexcept\u001b[0m \u001b[0mException\u001b[0m \u001b[1;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 942\u001b[0m self._handle_dbapi_exception(\n",
"\u001b[1;32m/usr/local/lib/python3.4/dist-packages/sqlalchemy/engine/default.py\u001b[0m in \u001b[0;36mdo_execute\u001b[1;34m(self, cursor, statement, parameters, context)\u001b[0m\n\u001b[0;32m 433\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 434\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0mdo_execute\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcursor\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mstatement\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mparameters\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcontext\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mNone\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 435\u001b[1;33m \u001b[0mcursor\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mexecute\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mstatement\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mparameters\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 436\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 437\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0mdo_execute_no_params\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcursor\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mstatement\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcontext\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mNone\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;31mProgrammingError\u001b[0m: (ProgrammingError) relation \"company_postcode_area\" already exists\n \"CREATE TABLE company_postcode_area AS SELECT companynumber, split_part(regaddress_postcode,' ',1) as postcode_area FROM companydata;\" {}"
]
}
],
"prompt_number": 35
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#cur.execute(\"CREATE INDEX company_postcode_area_idx ON company_postcode_area (postcode_area);\")\n",
"psqlx(\"CREATE INDEX company_postcode_area_idx ON company_postcode_area (postcode_area);\",engine)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"mkCos=psql(\"SELECT * FROM company_postcode_area \\\n",
" WHERE postcode_area IN ('MK1','MK2','MK3','MK4','MK5','MK6','MK7','MK8','MK9','MK10','MK11','MK12','MK13','MK14','MK15');\",engine)\n",
"mkCos.groupby('postcode_area').companynumber.agg(len).order(ascending=False)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"psql(\"SELECT postcode_area, COUNT(*) FROM company_postcode_area \\\n",
" WHERE postcode_area IN ('MK1','MK2','MK3','MK4','MK5','MK6','MK7','MK8','MK9','MK10','MK11','MK12','MK13','MK14','MK15') \\\n",
" GROUP BY postcode_area;\",engine)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"psql(\"SELECT regaddress_postcode, COUNT(*) FROM companydata \\\n",
" GROUP BY regaddress_postcode ORDER BY count DESC LIMIT 20;\",engine)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>regaddress_postcode</th>\n",
" <th>count</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0 </th>\n",
" <td> NaN</td>\n",
" <td> 36940</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1 </th>\n",
" <td> EC1V 4PW</td>\n",
" <td> 28359</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2 </th>\n",
" <td> N12 0DR</td>\n",
" <td> 9458</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3 </th>\n",
" <td> B18 6EW</td>\n",
" <td> 9032</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4 </th>\n",
" <td> WA1 1RG</td>\n",
" <td> 8932</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5 </th>\n",
" <td> SE1 7HR</td>\n",
" <td> 7056</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6 </th>\n",
" <td> SW1Y 5EA</td>\n",
" <td> 6961</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7 </th>\n",
" <td> W1B 3HH</td>\n",
" <td> 4980</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8 </th>\n",
" <td> BN3 6HA</td>\n",
" <td> 4582</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9 </th>\n",
" <td> N14 5BP</td>\n",
" <td> 3592</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td> SK7 2DH</td>\n",
" <td> 3568</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td> SW6 3JH</td>\n",
" <td> 3509</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td> EH7 5JA</td>\n",
" <td> 3446</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td> BH12 1JY</td>\n",
" <td> 3258</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td> EC1V 4PY</td>\n",
" <td> 2872</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td> B2 4BG</td>\n",
" <td> 2734</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td> W1G 9QR</td>\n",
" <td> 2734</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td> EH3 9WJ</td>\n",
" <td> 2591</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td> BS9 3BY</td>\n",
" <td> 2470</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td> RG7 8NN</td>\n",
" <td> 2456</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 36,
"text": [
" regaddress_postcode count\n",
"0 NaN 36940\n",
"1 EC1V 4PW 28359\n",
"2 N12 0DR 9458\n",
"3 B18 6EW 9032\n",
"4 WA1 1RG 8932\n",
"5 SE1 7HR 7056\n",
"6 SW1Y 5EA 6961\n",
"7 W1B 3HH 4980\n",
"8 BN3 6HA 4582\n",
"9 N14 5BP 3592\n",
"10 SK7 2DH 3568\n",
"11 SW6 3JH 3509\n",
"12 EH7 5JA 3446\n",
"13 BH12 1JY 3258\n",
"14 EC1V 4PY 2872\n",
"15 B2 4BG 2734\n",
"16 W1G 9QR 2734\n",
"17 EH3 9WJ 2591\n",
"18 BS9 3BY 2470\n",
"19 RG7 8NN 2456"
]
}
],
"prompt_number": 36
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"psql(\"SELECT regaddress_postcode FROM companydata WHERE regaddress_postcode LIKE 'BS9%%' LIMIT 20;\",engine)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>regaddress_postcode</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0 </th>\n",
" <td> BS9 2JT</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1 </th>\n",
" <td> BS9 3BY</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2 </th>\n",
" <td> BS9 2DR</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3 </th>\n",
" <td> BS9 3BY</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4 </th>\n",
" <td> BS9 3BY</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5 </th>\n",
" <td> BS9 3TQ</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6 </th>\n",
" <td> BS9 3BH</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7 </th>\n",
" <td> BS9 3BY</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8 </th>\n",
" <td> BS9 3BY</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9 </th>\n",
" <td> BS9 1BE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td> BS9 3HQ</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td> BS9 3BY</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td> BS9 3BY</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td> BS9 3DZ</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td> BS9 1LE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td> BS9 3BY</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td> BS9 3BY</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td> BS9 3BY</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td> BS9 3BY</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td> BS9 2BB</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 48,
"text": [
" regaddress_postcode\n",
"0 BS9 2JT\n",
"1 BS9 3BY\n",
"2 BS9 2DR\n",
"3 BS9 3BY\n",
"4 BS9 3BY\n",
"5 BS9 3TQ\n",
"6 BS9 3BH\n",
"7 BS9 3BY\n",
"8 BS9 3BY\n",
"9 BS9 1BE\n",
"10 BS9 3HQ\n",
"11 BS9 3BY\n",
"12 BS9 3BY\n",
"13 BS9 3DZ\n",
"14 BS9 1LE\n",
"15 BS9 3BY\n",
"16 BS9 3BY\n",
"17 BS9 3BY\n",
"18 BS9 3BY\n",
"19 BS9 2BB"
]
}
],
"prompt_number": 48
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment