Skip to content

Instantly share code, notes, and snippets.

@pybokeh
Created June 1, 2014 14:51
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save pybokeh/6b737ce074412da5a6f7 to your computer and use it in GitHub Desktop.
Save pybokeh/6b737ce074412da5a6f7 to your computer and use it in GitHub Desktop.
Text_Mining
{
"metadata": {
"name": "",
"signature": "sha256:aa070c6abdf02e19a07747ee3236db36c253b12b62bfb02a6b6201ec51cf4a08"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# <center>Text Mining Vehicle Warranty Data</center>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This IPython notebook demonstrates how I perform text mining on vehicle warranty claims. I am often tasked with answering questions like \"what problems are customers experiencing with their vehicles?\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To get started, I have sample warranty data in a csv file that I will import using the Pandas package. I will also make use of the NLTK (Natural Language ToolKit) to help me filter out irrelevant, high frequency words like THE, AND, IS, HIS, HER, etc (prepositions and pronouns) and also the string module to remove punctuation marks. I will also use the convenient Counter class from the collections module to count the number of unique words. With only 10 claims or records, the processes that I am illustrating would be unnecessary or overkill. But if my data set was in the magnitude of hundreds or several thousands, then these processes will be more relevant. So the sample size of 10 was used for simplicity in demonstrating my processes."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pandas as pd\n",
"import nltk\n",
"from collections import Counter\n",
"from nltk.corpus import stopwords\n",
"import string\n",
"\n",
"df = pd.read_csv('/home/john_doe/Downloads/sample.csv')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 263
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is what the data looks like:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>VIN</th>\n",
" <th>CONTDESC</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td> VIN1</td>\n",
" <td> CUSTOMER STATES NOISE FROM STEERING WHEEL</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td> VIN2</td>\n",
" <td> CUSTOMER STATES SRS LIGHT IS ON</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td> VIN3</td>\n",
" <td> RATTLE FROM THE AIRBAG</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td> VIN4</td>\n",
" <td> TICKING SOUND CAN BE HEARD FROM THE DRIVER'S A...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td> VIN5</td>\n",
" <td> RATTLE NOISE FROM THE STEERING WHEEL AREA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td> VIN6</td>\n",
" <td> SRS LIGHT IS ON</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td> VIN7</td>\n",
" <td> NOISE FROM L/FRT AIRBAG</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td> VIN8</td>\n",
" <td> CUSTOMER STATES NOISE CAN BE HEARD FROM STEERI...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td> VIN9</td>\n",
" <td> TICKING NOISE FROM STEERING WHEEL</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td> VIN10</td>\n",
" <td> SRS LIGHT IS ON</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 264,
"text": [
" VIN CONTDESC\n",
"0 VIN1 CUSTOMER STATES NOISE FROM STEERING WHEEL\n",
"1 VIN2 CUSTOMER STATES SRS LIGHT IS ON\n",
"2 VIN3 RATTLE FROM THE AIRBAG\n",
"3 VIN4 TICKING SOUND CAN BE HEARD FROM THE DRIVER'S A...\n",
"4 VIN5 RATTLE NOISE FROM THE STEERING WHEEL AREA\n",
"5 VIN6 SRS LIGHT IS ON\n",
"6 VIN7 NOISE FROM L/FRT AIRBAG\n",
"7 VIN8 CUSTOMER STATES NOISE CAN BE HEARD FROM STEERI...\n",
"8 VIN9 TICKING NOISE FROM STEERING WHEEL\n",
"9 VIN10 SRS LIGHT IS ON"
]
}
],
"prompt_number": 264
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I will make a list of the words contained in the \"CONTDESC\" column:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"contdesc = df['CONTDESC']"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 265
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"words = [word for word in contdesc.values]"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 266
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"words"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 267,
"text": [
"['CUSTOMER STATES NOISE FROM STEERING WHEEL',\n",
" 'CUSTOMER STATES SRS LIGHT IS ON',\n",
" 'RATTLE FROM THE AIRBAG',\n",
" \"TICKING SOUND CAN BE HEARD FROM THE DRIVER'S AIRBAG\",\n",
" 'RATTLE NOISE FROM THE STEERING WHEEL AREA',\n",
" 'SRS LIGHT IS ON',\n",
" 'NOISE FROM L/FRT AIRBAG',\n",
" 'CUSTOMER STATES NOISE CAN BE HEARD FROM STEERING WHEEL AIRBAG',\n",
" 'TICKING NOISE FROM STEERING WHEEL',\n",
" 'SRS LIGHT IS ON']"
]
}
],
"prompt_number": 267
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see from above, the trouble is, I have a list of sentences, when I just want a list with all the individual words from all the sentences. With some looping, I was able to accomplish this:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"words_list = []\n",
"for line in words:\n",
" for character in line:\n",
" if character in string.punctuation: # string.punctuation = !\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~\n",
" line = line.replace(character,\"\")\n",
" words_list.extend([word for word in line.split()])"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 268
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, I should have a list of all the words from the sentences:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"words_list"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 269,
"text": [
"['CUSTOMER',\n",
" 'STATES',\n",
" 'NOISE',\n",
" 'FROM',\n",
" 'STEERING',\n",
" 'WHEEL',\n",
" 'CUSTOMER',\n",
" 'STATES',\n",
" 'SRS',\n",
" 'LIGHT',\n",
" 'IS',\n",
" 'ON',\n",
" 'RATTLE',\n",
" 'FROM',\n",
" 'THE',\n",
" 'AIRBAG',\n",
" 'TICKING',\n",
" 'SOUND',\n",
" 'CAN',\n",
" 'BE',\n",
" 'HEARD',\n",
" 'FROM',\n",
" 'THE',\n",
" 'DRIVERS',\n",
" 'AIRBAG',\n",
" 'RATTLE',\n",
" 'NOISE',\n",
" 'FROM',\n",
" 'THE',\n",
" 'STEERING',\n",
" 'WHEEL',\n",
" 'AREA',\n",
" 'SRS',\n",
" 'LIGHT',\n",
" 'IS',\n",
" 'ON',\n",
" 'NOISE',\n",
" 'FROM',\n",
" 'LFRT',\n",
" 'AIRBAG',\n",
" 'CUSTOMER',\n",
" 'STATES',\n",
" 'NOISE',\n",
" 'CAN',\n",
" 'BE',\n",
" 'HEARD',\n",
" 'FROM',\n",
" 'STEERING',\n",
" 'WHEEL',\n",
" 'AIRBAG',\n",
" 'TICKING',\n",
" 'NOISE',\n",
" 'FROM',\n",
" 'STEERING',\n",
" 'WHEEL',\n",
" 'SRS',\n",
" 'LIGHT',\n",
" 'IS',\n",
" 'ON']"
]
}
],
"prompt_number": 269
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But, I need to filter out the irrelevant, high frequency words:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"stop = stopwords.words('english')\n",
"filtered_words = [word for word in words_list if not word.lower() in stop]"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 270
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"filtered_words"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 271,
"text": [
"['CUSTOMER',\n",
" 'STATES',\n",
" 'NOISE',\n",
" 'STEERING',\n",
" 'WHEEL',\n",
" 'CUSTOMER',\n",
" 'STATES',\n",
" 'SRS',\n",
" 'LIGHT',\n",
" 'RATTLE',\n",
" 'AIRBAG',\n",
" 'TICKING',\n",
" 'SOUND',\n",
" 'HEARD',\n",
" 'DRIVERS',\n",
" 'AIRBAG',\n",
" 'RATTLE',\n",
" 'NOISE',\n",
" 'STEERING',\n",
" 'WHEEL',\n",
" 'AREA',\n",
" 'SRS',\n",
" 'LIGHT',\n",
" 'NOISE',\n",
" 'LFRT',\n",
" 'AIRBAG',\n",
" 'CUSTOMER',\n",
" 'STATES',\n",
" 'NOISE',\n",
" 'HEARD',\n",
" 'STEERING',\n",
" 'WHEEL',\n",
" 'AIRBAG',\n",
" 'TICKING',\n",
" 'NOISE',\n",
" 'STEERING',\n",
" 'WHEEL',\n",
" 'SRS',\n",
" 'LIGHT']"
]
}
],
"prompt_number": 271
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that I have filtered out the irrelevant, high frequency words, I can now see what are the top most common words:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"counted = Counter(filtered_words)\n",
"top10 = counted.most_common(10)\n",
"top10"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 272,
"text": [
"[('NOISE', 5),\n",
" ('STEERING', 4),\n",
" ('AIRBAG', 4),\n",
" ('WHEEL', 4),\n",
" ('STATES', 3),\n",
" ('LIGHT', 3),\n",
" ('SRS', 3),\n",
" ('CUSTOMER', 3),\n",
" ('TICKING', 2),\n",
" ('HEARD', 2)]"
]
}
],
"prompt_number": 272
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I can see that it appears that we have a noise problem. With only 10 claims, this can be easily seen by just reading the claims, but let's say you have several hundreds or thousands of claims, then this exercise becomes more relevant."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Out of curiosity, I wanted to see all the unique words that have been used:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"counted"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 273,
"text": [
"Counter({'NOISE': 5, 'STEERING': 4, 'AIRBAG': 4, 'WHEEL': 4, 'STATES': 3, 'LIGHT': 3, 'SRS': 3, 'CUSTOMER': 3, 'TICKING': 2, 'HEARD': 2, 'RATTLE': 2, 'DRIVERS': 1, 'AREA': 1, 'LFRT': 1, 'SOUND': 1})"
]
}
],
"prompt_number": 273
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see, there are quite a few different words being used to describe noise like RATTLE, TICKING, or SOUND. Wouldn't it be nice to lump these together and just label or categorize them as just noise? I see that we also have a SRS light coming on problem also. Again, I would like to lump those together in their own separate category like LIGHT ON. Then, I can summarize based on those categories."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To accomplish this, I decided to us Pandas' apply() method which allows me to create a new column and apply a function to it. I will create a \"COMPLAINT\" column whereby it will contain the label or category based on words found in the \"CONTDESC\" or complaint field."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"noise_words = ['NOIS','NIOS','RATT','RATL','HEARD','HEARS','SQ','VIB','TICK','TAP','WHIN','HOWL','WHIRL','LOUD','SOUND']\n",
"light_on_words = ['LIGHT','LAMP','INDICATOR','LITE',' LIT', 'LT ON']\n",
"inop_words = ['INOP','T WORK','T COME','T TURN']\n",
"stalled_words = ['STALLED WH','STALLS WH']\n",
"\n",
"def complaint(series):\n",
" for word in noise_words:\n",
" if word in series[\"CONTDESC\"]:\n",
" return \"NOISE\"\n",
" \n",
" for word in inop_words:\n",
" if word in series[\"CONTDESC\"]:\n",
" return \"INOP\"\n",
" \n",
" for word in light_on_words:\n",
" if word in series[\"CONTDESC\"]:\n",
" return \"LIGHT ON\""
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 274
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df[\"COMPLAINT\"] = df.apply(complaint,axis=1)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 275
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Below is what the new \"COMPLAINT\" column looks like:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>VIN</th>\n",
" <th>CONTDESC</th>\n",
" <th>COMPLAINT</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td> VIN1</td>\n",
" <td> CUSTOMER STATES NOISE FROM STEERING WHEEL</td>\n",
" <td> NOISE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td> VIN2</td>\n",
" <td> CUSTOMER STATES SRS LIGHT IS ON</td>\n",
" <td> LIGHT ON</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td> VIN3</td>\n",
" <td> RATTLE FROM THE AIRBAG</td>\n",
" <td> NOISE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td> VIN4</td>\n",
" <td> TICKING SOUND CAN BE HEARD FROM THE DRIVER'S A...</td>\n",
" <td> NOISE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td> VIN5</td>\n",
" <td> RATTLE NOISE FROM THE STEERING WHEEL AREA</td>\n",
" <td> NOISE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td> VIN6</td>\n",
" <td> SRS LIGHT IS ON</td>\n",
" <td> LIGHT ON</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td> VIN7</td>\n",
" <td> NOISE FROM L/FRT AIRBAG</td>\n",
" <td> NOISE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td> VIN8</td>\n",
" <td> CUSTOMER STATES NOISE CAN BE HEARD FROM STEERI...</td>\n",
" <td> NOISE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td> VIN9</td>\n",
" <td> TICKING NOISE FROM STEERING WHEEL</td>\n",
" <td> NOISE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td> VIN10</td>\n",
" <td> SRS LIGHT IS ON</td>\n",
" <td> LIGHT ON</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 276,
"text": [
" VIN CONTDESC COMPLAINT\n",
"0 VIN1 CUSTOMER STATES NOISE FROM STEERING WHEEL NOISE\n",
"1 VIN2 CUSTOMER STATES SRS LIGHT IS ON LIGHT ON\n",
"2 VIN3 RATTLE FROM THE AIRBAG NOISE\n",
"3 VIN4 TICKING SOUND CAN BE HEARD FROM THE DRIVER'S A... NOISE\n",
"4 VIN5 RATTLE NOISE FROM THE STEERING WHEEL AREA NOISE\n",
"5 VIN6 SRS LIGHT IS ON LIGHT ON\n",
"6 VIN7 NOISE FROM L/FRT AIRBAG NOISE\n",
"7 VIN8 CUSTOMER STATES NOISE CAN BE HEARD FROM STEERI... NOISE\n",
"8 VIN9 TICKING NOISE FROM STEERING WHEEL NOISE\n",
"9 VIN10 SRS LIGHT IS ON LIGHT ON"
]
}
],
"prompt_number": 276
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now I will use Pandas' groupby() to group the records by COMPLAINT:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df1 = df.groupby(['COMPLAINT'])[\"VIN\"].count()"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 278
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df1"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 279,
"text": [
"COMPLAINT\n",
"LIGHT ON 3\n",
"NOISE 7\n",
"Name: VIN, dtype: int64"
]
}
],
"prompt_number": 279
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now I will make a separate Pandas dataframe"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"d = {\"complaint\":df1.values}\n",
"df2 = pd.DataFrame(d,index=df1.index)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 280
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df2"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>complaint</th>\n",
" </tr>\n",
" <tr>\n",
" <th>COMPLAINT</th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>LIGHT ON</th>\n",
" <td> 3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>NOISE</th>\n",
" <td> 7</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 281,
"text": [
" complaint\n",
"COMPLAINT \n",
"LIGHT ON 3\n",
"NOISE 7"
]
}
],
"prompt_number": 281
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"...and I will add a percent of total column also:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df2[\"PercOfTotal\"] = df2[\"complaint\"] / df2[\"complaint\"].sum() * 100"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 282
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df2"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>complaint</th>\n",
" <th>PercOfTotal</th>\n",
" </tr>\n",
" <tr>\n",
" <th>COMPLAINT</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>LIGHT ON</th>\n",
" <td> 3</td>\n",
" <td> 30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>NOISE</th>\n",
" <td> 7</td>\n",
" <td> 70</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 283,
"text": [
" complaint PercOfTotal\n",
"COMPLAINT \n",
"LIGHT ON 3 30\n",
"NOISE 7 70"
]
}
],
"prompt_number": 283
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, I can say that 70% of the complaints are due to some kind of noise and 30% are due to SRS light coming on."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Fellow data analysts using Python, is there a better way?"
]
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment