Skip to content

Instantly share code, notes, and snippets.

@rebeccabilbro
Last active February 22, 2016 01:22
Show Gist options
  • Save rebeccabilbro/d55857793011028fb3ff to your computer and use it in GitHub Desktop.
Save rebeccabilbro/d55857793011028fb3ff to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exploring Entity Resolution with Dedupe in Python \n",
"\n",
"This walk-through uses [Jupyter Notebook](http://jupyter.readthedocs.org/en/latest/install.html) and [Pandas](https://readthedocs.org/projects/pandas/) (and of course, [Dedupe](https://dedupe.readthedocs.org/en/latest/)) to explore some initial approaches to deduplication and entity resolution with the Python library Dedupe.\n",
"\n",
"Please make sure you have Jupyter and Pandas installed before we move on.\n",
"\n",
"```bash\n",
"pip install jupyter\n",
"pip install pandas\n",
"```\n",
"\n",
"## Clone repo and get started \n",
"\n",
"To get started, we'll clone a git repository with some sample text files and deduplication scripts: \n",
"\n",
"```bash\n",
"git clone https://github.com/DistrictDataLabs/dedupe-examples.git\n",
"cd dedupe-examples\n",
"```\n",
"\n",
"This first example has us working with a list of early childhood education sites in Chicago from 10 different sources.\n",
"```bash\n",
"cd csv_example \n",
"```\n",
"\n",
"## Data exploration\n",
"Before we get any further, let's use a Pandas `dataframe` to explore the dataset we'll be working with:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import pandas as pd\n",
"df = pd.read_csv(\"../dedupe-examples/csv_example/csv_example_messy_input.csv\", index_col=\"Id\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use the `shape` method to find out how many rows and columns we're dealing with:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"(3337, 31)"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next we can use the `list` method to get the column names of our dataframe:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['Source',\n",
" 'Site name',\n",
" 'Address',\n",
" 'Zip',\n",
" 'Phone',\n",
" 'Fax',\n",
" 'Program Name',\n",
" 'Length of Day',\n",
" 'IDHS Provider ID',\n",
" 'Agency',\n",
" 'Neighborhood',\n",
" 'Funded Enrollment',\n",
" 'Program Option',\n",
" 'Number per Site EHS',\n",
" 'Number per Site HS',\n",
" 'Director',\n",
" 'Head Start Fund',\n",
" 'Eearly Head Start Fund',\n",
" 'CC fund',\n",
" 'Progmod',\n",
" 'Website',\n",
" 'Executive Director',\n",
" 'Center Director',\n",
" 'ECE Available Programs',\n",
" 'NAEYC Valid Until',\n",
" 'NAEYC Program Id',\n",
" 'Email Address',\n",
" 'Ounce of Prevention Description',\n",
" 'Purple binder service type',\n",
" 'Column',\n",
" 'Column2']"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"list(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Calling `head` will give us the first few rows of the dataframe:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Source</th>\n",
" <th>Site name</th>\n",
" <th>Address</th>\n",
" <th>Zip</th>\n",
" <th>Phone</th>\n",
" <th>Fax</th>\n",
" <th>Program Name</th>\n",
" <th>Length of Day</th>\n",
" <th>IDHS Provider ID</th>\n",
" <th>Agency</th>\n",
" <th>...</th>\n",
" <th>Executive Director</th>\n",
" <th>Center Director</th>\n",
" <th>ECE Available Programs</th>\n",
" <th>NAEYC Valid Until</th>\n",
" <th>NAEYC Program Id</th>\n",
" <th>Email Address</th>\n",
" <th>Ounce of Prevention Description</th>\n",
" <th>Purple binder service type</th>\n",
" <th>Column</th>\n",
" <th>Column2</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
" <td>Salvation Army - Temple / Salvation Army</td>\n",
" <td>1 N Ogden Ave</td>\n",
" <td>NaN</td>\n",
" <td>2262649</td>\n",
" <td>NaN</td>\n",
" <td>Child Care</td>\n",
" <td>EXTENDED DAY</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
" <td>Salvation Army - Temple / Salvation Army</td>\n",
" <td>1 N Ogden Ave</td>\n",
" <td>NaN</td>\n",
" <td>2262649</td>\n",
" <td>NaN</td>\n",
" <td>Child Care</td>\n",
" <td>EXTENDED DAY</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
" <td>National Louis University - Dr. Effie O. Elli...</td>\n",
" <td>10 S Kedzie Ave</td>\n",
" <td>NaN</td>\n",
" <td>5339011</td>\n",
" <td>NaN</td>\n",
" <td>Child Care</td>\n",
" <td>EXTENDED DAY</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
" <td>National Louis University - Dr. Effie O. Elli...</td>\n",
" <td>10 S Kedzie Ave</td>\n",
" <td>NaN</td>\n",
" <td>5339011</td>\n",
" <td>NaN</td>\n",
" <td>Child Care</td>\n",
" <td>EXTENDED DAY</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
" <td>Board Trustees-City Colleges of Chicago - Oli...</td>\n",
" <td>10001 S Woodlawn Ave</td>\n",
" <td>NaN</td>\n",
" <td>2916100</td>\n",
" <td>NaN</td>\n",
" <td>Child Care</td>\n",
" <td>EXTENDED DAY</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
" <td>Board Trustees-City Colleges of Chicago - Oli...</td>\n",
" <td>10001 S Woodlawn Ave</td>\n",
" <td>NaN</td>\n",
" <td>2916100</td>\n",
" <td>NaN</td>\n",
" <td>Child Care</td>\n",
" <td>EXTENDED DAY</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
" <td>Easter Seals Society of Metropolitan Chicago ...</td>\n",
" <td>1001 W Roosevelt Rd</td>\n",
" <td>NaN</td>\n",
" <td>9395115</td>\n",
" <td>NaN</td>\n",
" <td>Child Care</td>\n",
" <td>EXTENDED DAY</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
" <td>Easter Seals Society of Metropolitan Chicago ...</td>\n",
" <td>1001 W Roosevelt Rd</td>\n",
" <td>NaN</td>\n",
" <td>9395115</td>\n",
" <td>NaN</td>\n",
" <td>Child Care</td>\n",
" <td>EXTENDED DAY</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
" <td>Hull House Association - Uptown Head Start / ...</td>\n",
" <td>1020 W Bryn Mawr Ave</td>\n",
" <td>NaN</td>\n",
" <td>7695753</td>\n",
" <td>NaN</td>\n",
" <td>Child Care</td>\n",
" <td>EXTENDED DAY</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>CPS_Early_Childhood_Portal_scrape.csv</td>\n",
" <td>Hull House Association - Child Dev. Central O...</td>\n",
" <td>1030 W Van Buren St</td>\n",
" <td>NaN</td>\n",
" <td>9068600</td>\n",
" <td>NaN</td>\n",
" <td>Child Care</td>\n",
" <td>EXTENDED DAY</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>10 rows × 31 columns</p>\n",
"</div>"
],
"text/plain": [
" Source \\\n",
"Id \n",
"0 CPS_Early_Childhood_Portal_scrape.csv \n",
"1 CPS_Early_Childhood_Portal_scrape.csv \n",
"2 CPS_Early_Childhood_Portal_scrape.csv \n",
"3 CPS_Early_Childhood_Portal_scrape.csv \n",
"4 CPS_Early_Childhood_Portal_scrape.csv \n",
"5 CPS_Early_Childhood_Portal_scrape.csv \n",
"6 CPS_Early_Childhood_Portal_scrape.csv \n",
"7 CPS_Early_Childhood_Portal_scrape.csv \n",
"8 CPS_Early_Childhood_Portal_scrape.csv \n",
"9 CPS_Early_Childhood_Portal_scrape.csv \n",
"\n",
" Site name Address \\\n",
"Id \n",
"0 Salvation Army - Temple / Salvation Army 1 N Ogden Ave \n",
"1 Salvation Army - Temple / Salvation Army 1 N Ogden Ave \n",
"2 National Louis University - Dr. Effie O. Elli... 10 S Kedzie Ave \n",
"3 National Louis University - Dr. Effie O. Elli... 10 S Kedzie Ave \n",
"4 Board Trustees-City Colleges of Chicago - Oli... 10001 S Woodlawn Ave \n",
"5 Board Trustees-City Colleges of Chicago - Oli... 10001 S Woodlawn Ave \n",
"6 Easter Seals Society of Metropolitan Chicago ... 1001 W Roosevelt Rd \n",
"7 Easter Seals Society of Metropolitan Chicago ... 1001 W Roosevelt Rd \n",
"8 Hull House Association - Uptown Head Start / ... 1020 W Bryn Mawr Ave \n",
"9 Hull House Association - Child Dev. Central O... 1030 W Van Buren St \n",
"\n",
" Zip Phone Fax Program Name Length of Day IDHS Provider ID Agency \\\n",
"Id \n",
"0 NaN 2262649 NaN Child Care EXTENDED DAY NaN NaN \n",
"1 NaN 2262649 NaN Child Care EXTENDED DAY NaN NaN \n",
"2 NaN 5339011 NaN Child Care EXTENDED DAY NaN NaN \n",
"3 NaN 5339011 NaN Child Care EXTENDED DAY NaN NaN \n",
"4 NaN 2916100 NaN Child Care EXTENDED DAY NaN NaN \n",
"5 NaN 2916100 NaN Child Care EXTENDED DAY NaN NaN \n",
"6 NaN 9395115 NaN Child Care EXTENDED DAY NaN NaN \n",
"7 NaN 9395115 NaN Child Care EXTENDED DAY NaN NaN \n",
"8 NaN 7695753 NaN Child Care EXTENDED DAY NaN NaN \n",
"9 NaN 9068600 NaN Child Care EXTENDED DAY NaN NaN \n",
"\n",
" ... Executive Director Center Director ECE Available Programs \\\n",
"Id ... \n",
"0 ... NaN NaN NaN \n",
"1 ... NaN NaN NaN \n",
"2 ... NaN NaN NaN \n",
"3 ... NaN NaN NaN \n",
"4 ... NaN NaN NaN \n",
"5 ... NaN NaN NaN \n",
"6 ... NaN NaN NaN \n",
"7 ... NaN NaN NaN \n",
"8 ... NaN NaN NaN \n",
"9 ... NaN NaN NaN \n",
"\n",
" NAEYC Valid Until NAEYC Program Id Email Address \\\n",
"Id \n",
"0 NaN NaN NaN \n",
"1 NaN NaN NaN \n",
"2 NaN NaN NaN \n",
"3 NaN NaN NaN \n",
"4 NaN NaN NaN \n",
"5 NaN NaN NaN \n",
"6 NaN NaN NaN \n",
"7 NaN NaN NaN \n",
"8 NaN NaN NaN \n",
"9 NaN NaN NaN \n",
"\n",
" Ounce of Prevention Description Purple binder service type Column Column2 \n",
"Id \n",
"0 NaN NaN NaN NaN \n",
"1 NaN NaN NaN NaN \n",
"2 NaN NaN NaN NaN \n",
"3 NaN NaN NaN NaN \n",
"4 NaN NaN NaN NaN \n",
"5 NaN NaN NaN NaN \n",
"6 NaN NaN NaN NaN \n",
"7 NaN NaN NaN NaN \n",
"8 NaN NaN NaN NaN \n",
"9 NaN NaN NaN NaN \n",
"\n",
"[10 rows x 31 columns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It appears that our entities are childcare sites, and at first glance, there do appear to be some possible duplicates. \n",
"\n",
"We can search for duplicates within the rows of a column. Let's check for duplicates in the \"Site name\" column:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Id\n",
"0 False\n",
"1 True\n",
"2 False\n",
"3 True\n",
"4 False\n",
"5 True\n",
"6 False\n",
"7 True\n",
"8 False\n",
"9 False\n",
"10 True\n",
"11 False\n",
"12 True\n",
"13 False\n",
"14 False\n",
"15 True\n",
"16 False\n",
"17 False\n",
"18 False\n",
"19 False\n",
"20 True\n",
"21 False\n",
"22 False\n",
"23 True\n",
"24 False\n",
"25 True\n",
"26 False\n",
"27 False\n",
"28 True\n",
"29 False\n",
" ... \n",
"3307 False\n",
"3308 False\n",
"3309 True\n",
"3310 False\n",
"3311 True\n",
"3312 True\n",
"3313 True\n",
"3314 True\n",
"3315 False\n",
"3316 True\n",
"3317 True\n",
"3318 False\n",
"3319 False\n",
"3320 False\n",
"3321 False\n",
"3322 False\n",
"3323 True\n",
"3324 True\n",
"3325 True\n",
"3326 True\n",
"3327 False\n",
"3328 False\n",
"3329 False\n",
"3330 False\n",
"3331 False\n",
"3332 True\n",
"3333 False\n",
"3334 True\n",
"3335 True\n",
"3336 False\n",
"Name: Site name, dtype: bool"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[\"Site name\"].duplicated() "
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"Looks like a lot of duplicates! \n",
"\n",
"## Testing out `dedupe`\n",
"\n",
"Let's experiment with using the `dedupe` library to try cleaning up our file. \n",
"\n",
"To get `dedupe` running, we'll need to install [Unidecode](https://pypi.python.org/pypi/Unidecode), [Future](https://pypi.python.org/pypi/future), and [Dedupe](https://dedupe.readthedocs.org/en/latest/).\n",
"\n",
"In your terminal:\n",
"```bash\n",
"pip install unidecode\n",
"pip install future\n",
"pip install dedupe\n",
"``` \n",
"\n",
"Then we'll run the csv_example.py file to see what dedupe can do:\n",
"\n",
"```bash\n",
"python csv_example.py\n",
"```\n",
"\n",
"You can see that `dedupe` is a command line application that will prompt the user to engage in active learning by showing pairs of entities and asking if they are the same or different.\n",
"\n",
" Do these records refer to the same thing?\n",
" (y)es / (n)o / (u)nsure / (f)inished\n",
"\n",
"Let's start training! \n",
"Use 'y', 'n' and 'u' keys to flag duplicates for active learning. \n",
"\n",
"When you are finished, enter 'f' to quit. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Questions for the DDRL Entity Resolution Lab\n",
"\n",
"1. What is dedupe doing with our [csv_example.py file](http://datamade.github.com/dedupe-examples/docs/csv_example.html)?\n",
"2. In general, how does [dedupe](http://datamade.github.io/dedupe-examples/docs/csv_example.html) work?\n",
"2. How does the active learning method work? \n",
"3. How does `dedupe` decide [which data fields](https://github.com/DistrictDataLabs/dedupe-examples/blob/master/csv_example/csv_example.py#L91) to include in the training? How would this work differently/better/worse with the dataset you have been working with for the last two weeks? \n",
"4. How does `dedupe` treat yeses, nos, and unsures? \n",
"5. What do you like about `dedupe`? \n",
"6. What would make `dedupe` better? (And what do we mean by \"better\"?) "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.10"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment