Skip to content

Instantly share code, notes, and snippets.

@psychemedia
Created August 7, 2017 23:04
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save psychemedia/924a1fcaa1663b6d0d622462b02304c5 to your computer and use it in GitHub Desktop.
Save psychemedia/924a1fcaa1663b6d0d622462b02304c5 to your computer and use it in GitHub Desktop.
Reconciling CQC and NHS DIgital GP Practice Codes
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# CQC and NHS Code Reconciliation\n",
"\n",
"CQC and NHS both maintain administrative datasets regarding a range of health and social care providers, but the identifier schemes are independent to each of them.\n",
"\n",
"This notebook explores a rough and ready way of rying to reconcile them.\n",
"\n",
"NHS administrative data retrieved from NHS Digital and added to a local database using the recipe described here: [NHS and GP Administrative Data.ipynb](https://github.com/psychemedia/openHealthDataDoodles/blob/master/notebooks/NHS%20and%20GP%20Administrative%20Data.ipynb)\n",
"\n",
"Note: for the purposes of this notebook, we could equally just download the NHS adminstrative data file `epraccurr` from [NHS Digital: GP and GP practice related data](https://digital.nhs.uk/organisation-data-service/data-downloads/gp-data)."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#!mkdir -p data\n",
"#!wget http://www.cqc.org.uk/sites/default/files/HSCA%20Active%20Locations.xlsx -P data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Download the *CQC Active Locations* bulk data file:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Location ID</th>\n",
" <th>HSCA start date</th>\n",
" <th>Care home?</th>\n",
" <th>Location Name</th>\n",
" <th>Telephone Number</th>\n",
" <th>Registered manager (note; where there is more than one manager at a location, only one is included here for ease of presentation. The full list is available if required).</th>\n",
" <th>Web Address</th>\n",
" <th>Care homes beds</th>\n",
" <th>Location Type/Sector</th>\n",
" <th>Location Primary Inspection Category</th>\n",
" <th>...</th>\n",
" <th>Service user band - Learning disabilities or autistic spectrum disorder</th>\n",
" <th>Service user band - Mental Health</th>\n",
" <th>Service user band - Older People</th>\n",
" <th>Service user band - People detained under the Mental Health Act</th>\n",
" <th>Service user band - People who misuse drugs and alcohol</th>\n",
" <th>Service user band - People with an eating disorder</th>\n",
" <th>Service user band - Physical Disability</th>\n",
" <th>Service user band - Sensory Impairment</th>\n",
" <th>Service user band - Whole Population</th>\n",
" <th>Service user band - Younger Adults</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1-1000210669</td>\n",
" <td>2013-12-12</td>\n",
" <td>Y</td>\n",
" <td>Kingswood House Nursing Home</td>\n",
" <td>01424716303</td>\n",
" <td>Turner, Patricia Anne</td>\n",
" <td>NaN</td>\n",
" <td>22.0</td>\n",
" <td>Social Care Org</td>\n",
" <td>Residential social care</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>Y</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Y</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1-1000270393</td>\n",
" <td>2013-10-16</td>\n",
" <td>N</td>\n",
" <td>Red Kite Home Care</td>\n",
" <td>NaN</td>\n",
" <td>Hall, Pearl</td>\n",
" <td>NaN</td>\n",
" <td>0.0</td>\n",
" <td>Social Care Org</td>\n",
" <td>Community based adult social care services</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>Y</td>\n",
" <td>Y</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Y</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Y</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1-1000312641</td>\n",
" <td>2013-10-18</td>\n",
" <td>N</td>\n",
" <td>Human Support Group Limited - Sale</td>\n",
" <td>01619429490</td>\n",
" <td>*</td>\n",
" <td>www.homecaresupport.co.uk</td>\n",
" <td>0.0</td>\n",
" <td>Social Care Org</td>\n",
" <td>Community based adult social care services</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>Y</td>\n",
" <td>Y</td>\n",
" <td>NaN</td>\n",
" <td>Y</td>\n",
" <td>Y</td>\n",
" <td>Y</td>\n",
" <td>Y</td>\n",
" <td>NaN</td>\n",
" <td>Y</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>3 rows × 94 columns</p>\n",
"</div>"
],
"text/plain": [
" Location ID HSCA start date Care home? \\\n",
"0 1-1000210669 2013-12-12 Y \n",
"1 1-1000270393 2013-10-16 N \n",
"2 1-1000312641 2013-10-18 N \n",
"\n",
" Location Name Telephone Number \\\n",
"0 Kingswood House Nursing Home 01424716303 \n",
"1 Red Kite Home Care NaN \n",
"2 Human Support Group Limited - Sale 01619429490 \n",
"\n",
" Registered manager (note; where there is more than one manager at a location, only one is included here for ease of presentation. The full list is available if required). \\\n",
"0 Turner, Patricia Anne \n",
"1 Hall, Pearl \n",
"2 * \n",
"\n",
" Web Address Care homes beds Location Type/Sector \\\n",
"0 NaN 22.0 Social Care Org \n",
"1 NaN 0.0 Social Care Org \n",
"2 www.homecaresupport.co.uk 0.0 Social Care Org \n",
"\n",
" Location Primary Inspection Category \\\n",
"0 Residential social care \n",
"1 Community based adult social care services \n",
"2 Community based adult social care services \n",
"\n",
" ... \\\n",
"0 ... \n",
"1 ... \n",
"2 ... \n",
"\n",
" Service user band - Learning disabilities or autistic spectrum disorder \\\n",
"0 NaN \n",
"1 NaN \n",
"2 NaN \n",
"\n",
" Service user band - Mental Health Service user band - Older People \\\n",
"0 Y NaN \n",
"1 Y Y \n",
"2 Y Y \n",
"\n",
" Service user band - People detained under the Mental Health Act \\\n",
"0 NaN \n",
"1 NaN \n",
"2 NaN \n",
"\n",
" Service user band - People who misuse drugs and alcohol \\\n",
"0 NaN \n",
"1 NaN \n",
"2 Y \n",
"\n",
" Service user band - People with an eating disorder \\\n",
"0 NaN \n",
"1 NaN \n",
"2 Y \n",
"\n",
" Service user band - Physical Disability \\\n",
"0 NaN \n",
"1 Y \n",
"2 Y \n",
"\n",
" Service user band - Sensory Impairment Service user band - Whole Population \\\n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 Y NaN \n",
"\n",
" Service user band - Younger Adults \n",
"0 Y \n",
"1 Y \n",
"2 Y \n",
"\n",
"[3 rows x 94 columns]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cqc_df=pd.read_excel('data/HSCA Active Locations.xlsx',\n",
" skiprows=6,\n",
" parse_dates=['HSCA start date','Provider HSCA start date'])\n",
"cqc_df.head(3)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['Postal Code', 'Provider - Postal Code']"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Identify the name of the postcode column\n",
"[c for c in cqc_df.columns if 'post' in c.lower()]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Access the NHS GP practice list:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Organisation Code</th>\n",
" <th>Name</th>\n",
" <th>National Grouping</th>\n",
" <th>High Level Health Geography</th>\n",
" <th>Address Line 1</th>\n",
" <th>Address Line 2</th>\n",
" <th>Address Line 3</th>\n",
" <th>Address Line 4</th>\n",
" <th>Address Line 5</th>\n",
" <th>Postcode</th>\n",
" <th>...</th>\n",
" <th>Close Date</th>\n",
" <th>Status Code</th>\n",
" <th>Organisation Sub-Type code</th>\n",
" <th>Commissioner</th>\n",
" <th>Join Provider/Purchaser Date</th>\n",
" <th>Left Provider/Purchaser Date</th>\n",
" <th>Contact Telephone Number</th>\n",
" <th>Amended Record Indicator</th>\n",
" <th>Provider/Purchaser</th>\n",
" <th>Prescribing Setting</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A81001</td>\n",
" <td>THE DENSHAM SURGERY</td>\n",
" <td>Y54</td>\n",
" <td>Q74</td>\n",
" <td>THE HEALTH CENTRE</td>\n",
" <td>LAWSON STREET</td>\n",
" <td>STOCKTON-ON-TEES</td>\n",
" <td>CLEVELAND</td>\n",
" <td>None</td>\n",
" <td>TS18 1HU</td>\n",
" <td>...</td>\n",
" <td>None</td>\n",
" <td>A</td>\n",
" <td>B</td>\n",
" <td>00K</td>\n",
" <td>2013-04-01 00:00:00</td>\n",
" <td>None</td>\n",
" <td>01642 672351</td>\n",
" <td>0</td>\n",
" <td>00K</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A81002</td>\n",
" <td>QUEENS PARK MEDICAL CENTRE</td>\n",
" <td>Y54</td>\n",
" <td>Q74</td>\n",
" <td>QUEENS PARK MEDICAL CTR</td>\n",
" <td>FARRER STREET</td>\n",
" <td>STOCKTON ON TEES</td>\n",
" <td>CLEVELAND</td>\n",
" <td>None</td>\n",
" <td>TS18 2AW</td>\n",
" <td>...</td>\n",
" <td>None</td>\n",
" <td>A</td>\n",
" <td>B</td>\n",
" <td>00K</td>\n",
" <td>2013-04-01 00:00:00</td>\n",
" <td>None</td>\n",
" <td>01642 679681</td>\n",
" <td>0</td>\n",
" <td>00K</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>A81003</td>\n",
" <td>VICTORIA MEDICAL PRACTICE</td>\n",
" <td>Y54</td>\n",
" <td>Q74</td>\n",
" <td>THE HEALTH CENTRE</td>\n",
" <td>VICTORIA ROAD</td>\n",
" <td>HARTLEPOOL</td>\n",
" <td>CLEVELAND</td>\n",
" <td>None</td>\n",
" <td>TS26 8DB</td>\n",
" <td>...</td>\n",
" <td>None</td>\n",
" <td>D</td>\n",
" <td>B</td>\n",
" <td>00K</td>\n",
" <td>2013-04-01 00:00:00</td>\n",
" <td>None</td>\n",
" <td>01429 272945</td>\n",
" <td>0</td>\n",
" <td>00K</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>A81004</td>\n",
" <td>WOODLANDS ROAD SURGERY</td>\n",
" <td>Y54</td>\n",
" <td>Q74</td>\n",
" <td>6 WOODLANDS ROAD</td>\n",
" <td>None</td>\n",
" <td>MIDDLESBROUGH</td>\n",
" <td>CLEVELAND</td>\n",
" <td>None</td>\n",
" <td>TS1 3BE</td>\n",
" <td>...</td>\n",
" <td>None</td>\n",
" <td>A</td>\n",
" <td>B</td>\n",
" <td>00M</td>\n",
" <td>2013-04-01 00:00:00</td>\n",
" <td>None</td>\n",
" <td>01642 247982</td>\n",
" <td>0</td>\n",
" <td>00M</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>A81005</td>\n",
" <td>SPRINGWOOD SURGERY</td>\n",
" <td>Y54</td>\n",
" <td>Q74</td>\n",
" <td>SPRINGWOOD SURGERY</td>\n",
" <td>RECTORY LANE</td>\n",
" <td>GUISBOROUGH</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>TS14 7DJ</td>\n",
" <td>...</td>\n",
" <td>None</td>\n",
" <td>A</td>\n",
" <td>B</td>\n",
" <td>00M</td>\n",
" <td>2013-04-01 00:00:00</td>\n",
" <td>None</td>\n",
" <td>01287 619611</td>\n",
" <td>0</td>\n",
" <td>00M</td>\n",
" <td>4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 21 columns</p>\n",
"</div>"
],
"text/plain": [
" Organisation Code Name National Grouping \\\n",
"0 A81001 THE DENSHAM SURGERY Y54 \n",
"1 A81002 QUEENS PARK MEDICAL CENTRE Y54 \n",
"2 A81003 VICTORIA MEDICAL PRACTICE Y54 \n",
"3 A81004 WOODLANDS ROAD SURGERY Y54 \n",
"4 A81005 SPRINGWOOD SURGERY Y54 \n",
"\n",
" High Level Health Geography Address Line 1 Address Line 2 \\\n",
"0 Q74 THE HEALTH CENTRE LAWSON STREET \n",
"1 Q74 QUEENS PARK MEDICAL CTR FARRER STREET \n",
"2 Q74 THE HEALTH CENTRE VICTORIA ROAD \n",
"3 Q74 6 WOODLANDS ROAD None \n",
"4 Q74 SPRINGWOOD SURGERY RECTORY LANE \n",
"\n",
" Address Line 3 Address Line 4 Address Line 5 Postcode \\\n",
"0 STOCKTON-ON-TEES CLEVELAND None TS18 1HU \n",
"1 STOCKTON ON TEES CLEVELAND None TS18 2AW \n",
"2 HARTLEPOOL CLEVELAND None TS26 8DB \n",
"3 MIDDLESBROUGH CLEVELAND None TS1 3BE \n",
"4 GUISBOROUGH None None TS14 7DJ \n",
"\n",
" ... Close Date Status Code Organisation Sub-Type code \\\n",
"0 ... None A B \n",
"1 ... None A B \n",
"2 ... None D B \n",
"3 ... None A B \n",
"4 ... None A B \n",
"\n",
" Commissioner Join Provider/Purchaser Date Left Provider/Purchaser Date \\\n",
"0 00K 2013-04-01 00:00:00 None \n",
"1 00K 2013-04-01 00:00:00 None \n",
"2 00K 2013-04-01 00:00:00 None \n",
"3 00M 2013-04-01 00:00:00 None \n",
"4 00M 2013-04-01 00:00:00 None \n",
"\n",
" Contact Telephone Number Amended Record Indicator Provider/Purchaser \\\n",
"0 01642 672351 0 00K \n",
"1 01642 679681 0 00K \n",
"2 01429 272945 0 00K \n",
"3 01642 247982 0 00M \n",
"4 01287 619611 0 00M \n",
"\n",
" Prescribing Setting \n",
"0 4 \n",
"1 4 \n",
"2 4 \n",
"3 4 \n",
"4 4 \n",
"\n",
"[5 rows x 21 columns]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import sqlite3\n",
"con = sqlite3.connect(\"nhsadmin.sqlite\")\n",
"EPRACCUR='epraccur'\n",
"nhs_df=pd.read_sql_query('SELECT * FROM {typ}'.format(typ=EPRACCUR), con)\n",
"nhs_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Pass 1 - Postcode + Exact Name Match\n",
"\n",
"Use the postcode as a crib and then also match on exact name, albeit case insensitive.\n",
"\n",
"*epraccurr* status code - `A` is *active*."
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Organisation Code</th>\n",
" <th>Name</th>\n",
" <th>Postcode</th>\n",
" <th>Location ID</th>\n",
" <th>Location Name</th>\n",
" <th>Postal Code</th>\n",
" <th>_merge</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A81001</td>\n",
" <td>THE DENSHAM SURGERY</td>\n",
" <td>TS18 1HU</td>\n",
" <td>1-540731286</td>\n",
" <td>THE DENSHAM SURGERY</td>\n",
" <td>TS18 1HU</td>\n",
" <td>both</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A81002</td>\n",
" <td>QUEENS PARK MEDICAL CENTRE</td>\n",
" <td>TS18 2AW</td>\n",
" <td>1-540740218</td>\n",
" <td>QUEENS PARK MEDICAL CENTRE</td>\n",
" <td>TS18 2AW</td>\n",
" <td>both</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>A81004</td>\n",
" <td>WOODLANDS ROAD SURGERY</td>\n",
" <td>TS1 3BE</td>\n",
" <td>1-549119442</td>\n",
" <td>WOODLANDS ROAD SURGERY</td>\n",
" <td>TS1 3BE</td>\n",
" <td>both</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>A81005</td>\n",
" <td>SPRINGWOOD SURGERY</td>\n",
" <td>TS14 7DJ</td>\n",
" <td>1-540766364</td>\n",
" <td>SPRINGWOOD SURGERY</td>\n",
" <td>TS14 7DJ</td>\n",
" <td>both</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>A81006</td>\n",
" <td>TENNANT STREET MEDICAL PRACTICE</td>\n",
" <td>TS18 2AT</td>\n",
" <td>1-540785056</td>\n",
" <td>TENNANT STREET MEDICAL PRACTICE</td>\n",
" <td>TS18 2AT</td>\n",
" <td>both</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Organisation Code Name Postcode Location ID \\\n",
"0 A81001 THE DENSHAM SURGERY TS18 1HU 1-540731286 \n",
"1 A81002 QUEENS PARK MEDICAL CENTRE TS18 2AW 1-540740218 \n",
"2 A81004 WOODLANDS ROAD SURGERY TS1 3BE 1-549119442 \n",
"3 A81005 SPRINGWOOD SURGERY TS14 7DJ 1-540766364 \n",
"4 A81006 TENNANT STREET MEDICAL PRACTICE TS18 2AT 1-540785056 \n",
"\n",
" Location Name Postal Code _merge \n",
"0 THE DENSHAM SURGERY TS18 1HU both \n",
"1 QUEENS PARK MEDICAL CENTRE TS18 2AW both \n",
"2 WOODLANDS ROAD SURGERY TS1 3BE both \n",
"3 SPRINGWOOD SURGERY TS14 7DJ both \n",
"4 TENNANT STREET MEDICAL PRACTICE TS18 2AT both "
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cqc_df['Location Name']=cqc_df['Location Name'].str.upper()\n",
"merge1=pd.merge(nhs_df[nhs_df['Status Code']=='A'][['Organisation Code','Name','Postcode']],\n",
" cqc_df[['Location ID','Location Name','Postal Code']],\n",
" left_on=['Postcode','Name'], right_on=['Postal Code','Location Name'],indicator=True,how='left')\n",
"m1=merge1\n",
"m1.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `left join` means we retain all the current practices from the NHS Digital listing.\n",
"\n",
"We can filter out the rows that didn't match by searching for records with a `left_only` desginator in the `merge` column of the merged dataframe."
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of unmatched rows: 7408\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Organisation Code</th>\n",
" <th>Name</th>\n",
" <th>Postcode</th>\n",
" <th>Location ID</th>\n",
" <th>Location Name</th>\n",
" <th>Postal Code</th>\n",
" <th>_merge</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>A81007</td>\n",
" <td>BANKHOUSE SURGERY</td>\n",
" <td>TS24 7PW</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>left_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>21</th>\n",
" <td>A81026</td>\n",
" <td>THE LINTHORPE SURGERY</td>\n",
" <td>TS5 6HA</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>left_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24</th>\n",
" <td>A81030</td>\n",
" <td>BOROUGH ROAD &amp; NUNTHORPE MEDICAL GROUP</td>\n",
" <td>TS1 3RY</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>left_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>26</th>\n",
" <td>A81032</td>\n",
" <td>THE GARTH</td>\n",
" <td>TS14 7DJ</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>left_only</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28</th>\n",
" <td>A81034</td>\n",
" <td>THORNABY &amp; BARWICK MEDICAL GROUP</td>\n",
" <td>TS17 0EE</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>left_only</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Organisation Code Name Postcode \\\n",
"5 A81007 BANKHOUSE SURGERY TS24 7PW \n",
"21 A81026 THE LINTHORPE SURGERY TS5 6HA \n",
"24 A81030 BOROUGH ROAD & NUNTHORPE MEDICAL GROUP TS1 3RY \n",
"26 A81032 THE GARTH TS14 7DJ \n",
"28 A81034 THORNABY & BARWICK MEDICAL GROUP TS17 0EE \n",
"\n",
" Location ID Location Name Postal Code _merge \n",
"5 NaN NaN NaN left_only \n",
"21 NaN NaN NaN left_only \n",
"24 NaN NaN NaN left_only \n",
"26 NaN NaN NaN left_only \n",
"28 NaN NaN NaN left_only "
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print('Number of unmatched rows: {}'.format(len(m1[m1['_merge']=='left_only'])))\n",
"m1[m1['_merge']=='left_only'].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Pass 2 - Postcode + Fuzzy Name Match\n",
"\n",
"Use the postcode as a crib and then try to fuzzy match on name, again case insensitive."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's see if we can do some partial/fuzzy matching with a relatively high degree of confidence:"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from fuzzywuzzy import fuzz\n",
"\n",
"m2=pd.merge(m1[m1['_merge']=='left_only'][['Organisation Code','Name','Postcode']],\n",
" cqc_df[['Location ID','Location Name','Postal Code']],\n",
" left_on=['Postcode'], right_on=['Postal Code'],indicator=True,how='left')\n",
"\n",
"m2['fuzz']=m2.dropna().apply(lambda x: fuzz.ratio(x['Name'],x['Location Name']),axis=1)\n",
"m2['fuzz2']=m2.dropna().apply(lambda x: fuzz.partial_ratio(x['Name'],x['Location Name']),axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"10\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Organisation Code</th>\n",
" <th>Name</th>\n",
" <th>Postcode</th>\n",
" <th>Location ID</th>\n",
" <th>Location Name</th>\n",
" <th>Postal Code</th>\n",
" <th>_merge</th>\n",
" <th>fuzz</th>\n",
" <th>fuzz2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>A81026</td>\n",
" <td>THE LINTHORPE SURGERY</td>\n",
" <td>TS5 6HA</td>\n",
" <td>1-2251705571</td>\n",
" <td>LINTHORPE SURGERY</td>\n",
" <td>TS5 6HA</td>\n",
" <td>both</td>\n",
" <td>89.0</td>\n",
" <td>100.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>A81030</td>\n",
" <td>BOROUGH ROAD &amp; NUNTHORPE MEDICAL GROUP</td>\n",
" <td>TS1 3RY</td>\n",
" <td>1-557690214</td>\n",
" <td>BOROUGH ROAD AND NUNTHORPE MEDICAL GROUP</td>\n",
" <td>TS1 3RY</td>\n",
" <td>both</td>\n",
" <td>95.0</td>\n",
" <td>92.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>A81034</td>\n",
" <td>THORNABY &amp; BARWICK MEDICAL GROUP</td>\n",
" <td>TS17 0EE</td>\n",
" <td>1-544613260</td>\n",
" <td>THORNABY AND BARWICK MEDICAL GROUP</td>\n",
" <td>TS17 0EE</td>\n",
" <td>both</td>\n",
" <td>94.0</td>\n",
" <td>91.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>A81037</td>\n",
" <td>THE ERIMUS PRACTICE</td>\n",
" <td>TS1 2NX</td>\n",
" <td>1-2265850518</td>\n",
" <td>ERIMUS PRACTICE</td>\n",
" <td>TS1 2NX</td>\n",
" <td>both</td>\n",
" <td>88.0</td>\n",
" <td>100.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24</th>\n",
" <td>A81039</td>\n",
" <td>EAGLESCLIFFE MEDICAL PRACTICE</td>\n",
" <td>TS16 9EA</td>\n",
" <td>1-542237827</td>\n",
" <td>THE EAGLESCLIFFE MEDICAL PRACTICE</td>\n",
" <td>TS16 9EA</td>\n",
" <td>both</td>\n",
" <td>94.0</td>\n",
" <td>100.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Organisation Code Name Postcode \\\n",
"5 A81026 THE LINTHORPE SURGERY TS5 6HA \n",
"7 A81030 BOROUGH ROAD & NUNTHORPE MEDICAL GROUP TS1 3RY \n",
"12 A81034 THORNABY & BARWICK MEDICAL GROUP TS17 0EE \n",
"14 A81037 THE ERIMUS PRACTICE TS1 2NX \n",
"24 A81039 EAGLESCLIFFE MEDICAL PRACTICE TS16 9EA \n",
"\n",
" Location ID Location Name Postal Code _merge \\\n",
"5 1-2251705571 LINTHORPE SURGERY TS5 6HA both \n",
"7 1-557690214 BOROUGH ROAD AND NUNTHORPE MEDICAL GROUP TS1 3RY both \n",
"12 1-544613260 THORNABY AND BARWICK MEDICAL GROUP TS17 0EE both \n",
"14 1-2265850518 ERIMUS PRACTICE TS1 2NX both \n",
"24 1-542237827 THE EAGLESCLIFFE MEDICAL PRACTICE TS16 9EA both \n",
"\n",
" fuzz fuzz2 \n",
"5 89.0 100.0 \n",
"7 95.0 92.0 \n",
"12 94.0 91.0 \n",
"14 88.0 100.0 \n",
"24 94.0 100.0 "
]
},
"execution_count": 89,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def confdisp(minconf=85,maxconf=100,typ='fuzz',lim=10):\n",
" return m2[(m2[typ]>=minconf) & (m2[typ]<=maxconf)].head(lim)\n",
"\n",
"\n",
"print(len(confdisp(85)))\n",
" \n",
"confdisp(85).head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can make an interactive explorer using `ipywidgets` to explore different levels of confidence:"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "9d6c82179a1e4121ba4f68204e84c89e"
}
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from ipywidgets import interact\n",
"import ipywidgets\n",
"\n",
"\n",
"interact(confdisp,\n",
" minconf=ipywidgets.IntSlider(min=0,max=100,step=1,value=85),\n",
" maxconf=ipywidgets.IntSlider(min=0,max=100,step=1,value=100),\n",
" typ=ipywidgets.RadioButtons(options=['fuzz', 'fuzz2']));"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also explore other fuzzy match combinations - note how we might be able to improve matters if we remove subject specific stop words e.g. *Suregry*, *Medical Centre* etc."
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Organisation Code</th>\n",
" <th>Name</th>\n",
" <th>Postcode</th>\n",
" <th>Location ID</th>\n",
" <th>Location Name</th>\n",
" <th>Postal Code</th>\n",
" <th>_merge</th>\n",
" <th>fuzz</th>\n",
" <th>fuzz2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>A81032</td>\n",
" <td>THE GARTH</td>\n",
" <td>TS14 7DJ</td>\n",
" <td>1-571280521</td>\n",
" <td>THE GARTH SURGERY</td>\n",
" <td>TS14 7DJ</td>\n",
" <td>both</td>\n",
" <td>69.0</td>\n",
" <td>100.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>21</th>\n",
" <td>A81038</td>\n",
" <td>HIRSEL MEDICAL CENTRE</td>\n",
" <td>TS3 6AL</td>\n",
" <td>1-566800129</td>\n",
" <td>KINGS MEDICAL CENTRE</td>\n",
" <td>TS3 6AL</td>\n",
" <td>both</td>\n",
" <td>83.0</td>\n",
" <td>85.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27</th>\n",
" <td>A81040</td>\n",
" <td>MARSH HOUSE MEDICAL PRACTICE</td>\n",
" <td>TS23 2DG</td>\n",
" <td>1-543965910</td>\n",
" <td>MARSH HOUSE MEDICAL CENTRE</td>\n",
" <td>TS23 2DG</td>\n",
" <td>both</td>\n",
" <td>85.0</td>\n",
" <td>85.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28</th>\n",
" <td>A81044</td>\n",
" <td>MCKENZIE HOUSE SURGERY</td>\n",
" <td>TS25 1QU</td>\n",
" <td>1-549723762</td>\n",
" <td>MCKENZIE HOUSE</td>\n",
" <td>TS25 1QU</td>\n",
" <td>both</td>\n",
" <td>78.0</td>\n",
" <td>100.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>61</th>\n",
" <td>A82021</td>\n",
" <td>CASTLEGATE AND DERWENT SURGERY</td>\n",
" <td>CA13 9HT</td>\n",
" <td>1-1522911257</td>\n",
" <td>DERWENT SURGERY</td>\n",
" <td>CA13 9HT</td>\n",
" <td>both</td>\n",
" <td>67.0</td>\n",
" <td>100.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Organisation Code Name Postcode Location ID \\\n",
"9 A81032 THE GARTH TS14 7DJ 1-571280521 \n",
"21 A81038 HIRSEL MEDICAL CENTRE TS3 6AL 1-566800129 \n",
"27 A81040 MARSH HOUSE MEDICAL PRACTICE TS23 2DG 1-543965910 \n",
"28 A81044 MCKENZIE HOUSE SURGERY TS25 1QU 1-549723762 \n",
"61 A82021 CASTLEGATE AND DERWENT SURGERY CA13 9HT 1-1522911257 \n",
"\n",
" Location Name Postal Code _merge fuzz fuzz2 \n",
"9 THE GARTH SURGERY TS14 7DJ both 69.0 100.0 \n",
"21 KINGS MEDICAL CENTRE TS3 6AL both 83.0 85.0 \n",
"27 MARSH HOUSE MEDICAL CENTRE TS23 2DG both 85.0 85.0 \n",
"28 MCKENZIE HOUSE TS25 1QU both 78.0 100.0 \n",
"61 DERWENT SURGERY CA13 9HT both 67.0 100.0 "
]
},
"execution_count": 91,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m2[(m2['fuzz']<=85) & (m2['fuzz2']>=85)].head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment