Skip to content

Instantly share code, notes, and snippets.

@DouglasSherk
Created September 6, 2017 18:45
Show Gist options
  • Save DouglasSherk/2249493d8ccff242e153180096f987df to your computer and use it in GitHub Desktop.
Save DouglasSherk/2249493d8ccff242e153180096f987df to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Craigslist Post Classifier\n",
"\n",
"In this post, we're going to build a classifier to \n",
"\n",
"### Outline\n",
"1. Introduction\n",
" 1. Motivation for building this\n",
" 2. Exact goals\n",
"2. The challenges that I went through\n",
" 1. Rushing to deep learning and TensorFlow\n",
" 2. Andrew Ng machine learning course\n",
" 3. Using eBay data\n",
"3. \n",
" \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Import goop\n",
"\n",
"Most of the lines below are imports for functions and libraries that we'll be using. We import two libraries for this project from the `lib` folder in the repository."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"from __future__ import print_function\n",
"\n",
"import sys\n",
"sys.path.append('..')\n",
"\n",
"# Libraries functions that were built for this project\n",
"# or copied and pasted from elsewhere\n",
"from lib.item_selector import ItemSelector\n",
"from lib.model_performance_plotter import plot_learning_curve\n",
"\n",
"import json\n",
"import pandas\n",
"from pprint import pprint\n",
"from sklearn.base import BaseEstimator\n",
"from sklearn.externals import joblib\n",
"from sklearn.feature_extraction.text import CountVectorizer\n",
"from sklearn.feature_extraction.text import TfidfTransformer\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.model_selection import GridSearchCV\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.pipeline import FeatureUnion\n",
"from sklearn.pipeline import Pipeline\n",
"from time import time\n",
"\n",
"\"\"\"File to load category mapping from\"\"\"\n",
"CATEGORY_FILE = 'data/categories.json'\n",
"\"\"\"File to load data set from\"\"\"\n",
"DATA_FILE = 'data/cl_posts.csv'\n",
"\"\"\"File to save the complete model into\"\"\"\n",
"MODEL_FILE = 'out/cl_model.pkl'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load and explore the data\n",
"\n",
"Use *pandas* to load Craigslist posts from a CSV file, then drop all examples within that have any `N/A` or `NaN` fields."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"data = pandas.read_csv(DATA_FILE)\n",
"data = data.dropna()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>title</th>\n",
" <th>description</th>\n",
" <th>category</th>\n",
" <th>url</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Are You a Married Woman Looking for Two Guys?</td>\n",
" <td>We're two fun, discreet married white professi...</td>\n",
" <td>men seeking women</td>\n",
" <td>https://chicago.craigslist.org/chc/m4w/d/are-y...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Producing Consultant</td>\n",
" <td>Building effective and collaborative relations...</td>\n",
" <td>business/mgmt</td>\n",
" <td>https://iowacity.craigslist.org/bus/d/producin...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Need ride to tampa thursday for court!!</td>\n",
" <td>I am a single mother fighting for custody of m...</td>\n",
" <td>rideshare</td>\n",
" <td>https://ocala.craigslist.org/rid/d/need-ride-t...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Corsair GS 800 Desktop Power Supply</td>\n",
" <td>Selling my Corsair GS 800 Desktop Power Supply...</td>\n",
" <td>computer parts - by owner</td>\n",
" <td>https://blacksburg.craigslist.org/sop/d/corsai...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Free MCAT Quiz for premed students: Can you th...</td>\n",
" <td>Free MCAT Quiz for premed students: Can you th...</td>\n",
" <td>lessons &amp; tutoring</td>\n",
" <td>https://albuquerque.craigslist.org/lss/d/free-...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Wanted Classic Cars and Trucks Any Condition..</td>\n",
" <td>Call/text 1.765.613.313one Price Pending Condi...</td>\n",
" <td>wanted - by owner</td>\n",
" <td>https://richmondin.craigslist.org/wan/d/wanted...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Massage Therapist Wanted</td>\n",
" <td>Ontario Family Chiropractic is a holistic base...</td>\n",
" <td>healthcare</td>\n",
" <td>https://rochester.craigslist.org/hea/d/massage...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>Lease Take Over at Manchester Motorworks 1 bed...</td>\n",
" <td>Manchester Motorworks is offering a 1 bedroom ...</td>\n",
" <td>sublets &amp; temporary</td>\n",
" <td>https://richmond.craigslist.org/sub/d/lease-ta...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>🚗 DENVER CAR OWNERS: PAY FOR YOUR CAR BY RENTI...</td>\n",
" <td>Turo is a peer-to-peer car sharing marketplace...</td>\n",
" <td>et cetera</td>\n",
" <td>https://denver.craigslist.org/etc/d/denver-car...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>Trunk Mounted Bike Rack w/ 3 Spaces - Universa...</td>\n",
" <td>Trunk Mounted Bike Rack w/ 3 Spaces - Universa...</td>\n",
" <td>bicycle parts - by owner</td>\n",
" <td>https://cosprings.craigslist.org/bop/d/trunk-m...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" title \\\n",
"0 Are You a Married Woman Looking for Two Guys? \n",
"1 Producing Consultant \n",
"2 Need ride to tampa thursday for court!! \n",
"3 Corsair GS 800 Desktop Power Supply \n",
"4 Free MCAT Quiz for premed students: Can you th... \n",
"5 Wanted Classic Cars and Trucks Any Condition.. \n",
"6 Massage Therapist Wanted \n",
"7 Lease Take Over at Manchester Motorworks 1 bed... \n",
"8 🚗 DENVER CAR OWNERS: PAY FOR YOUR CAR BY RENTI... \n",
"9 Trunk Mounted Bike Rack w/ 3 Spaces - Universa... \n",
"\n",
" description \\\n",
"0 We're two fun, discreet married white professi... \n",
"1 Building effective and collaborative relations... \n",
"2 I am a single mother fighting for custody of m... \n",
"3 Selling my Corsair GS 800 Desktop Power Supply... \n",
"4 Free MCAT Quiz for premed students: Can you th... \n",
"5 Call/text 1.765.613.313one Price Pending Condi... \n",
"6 Ontario Family Chiropractic is a holistic base... \n",
"7 Manchester Motorworks is offering a 1 bedroom ... \n",
"8 Turo is a peer-to-peer car sharing marketplace... \n",
"9 Trunk Mounted Bike Rack w/ 3 Spaces - Universa... \n",
"\n",
" category \\\n",
"0 men seeking women \n",
"1 business/mgmt \n",
"2 rideshare \n",
"3 computer parts - by owner \n",
"4 lessons & tutoring \n",
"5 wanted - by owner \n",
"6 healthcare \n",
"7 sublets & temporary \n",
"8 et cetera \n",
"9 bicycle parts - by owner \n",
"\n",
" url \n",
"0 https://chicago.craigslist.org/chc/m4w/d/are-y... \n",
"1 https://iowacity.craigslist.org/bus/d/producin... \n",
"2 https://ocala.craigslist.org/rid/d/need-ride-t... \n",
"3 https://blacksburg.craigslist.org/sop/d/corsai... \n",
"4 https://albuquerque.craigslist.org/lss/d/free-... \n",
"5 https://richmondin.craigslist.org/wan/d/wanted... \n",
"6 https://rochester.craigslist.org/hea/d/massage... \n",
"7 https://richmond.craigslist.org/sub/d/lease-ta... \n",
"8 https://denver.craigslist.org/etc/d/denver-car... \n",
"9 https://cosprings.craigslist.org/bop/d/trunk-m... "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Display the first few examples of the data set\n",
"data.head(10)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['title', 'description', 'category', 'url']"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Display the fields of the data set\n",
"list(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Map Craigslist categories to our application categories\n",
"\n",
"The categories that Craigslist "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"\"\"\"\n",
"Load category map to convert from Craigslist categories to our own\n",
"local app categories.\n",
"\"\"\"\n",
"with open(CATEGORY_FILE) as handle:\n",
" category_map = json.loads(handle.read())\n",
"\n",
"\"\"\"Load example data using Pandas\"\"\"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"\n",
"# data, _ = train_test_split(data, test_size=0.5)\n",
"\n",
"\"\"\"Remove all examples with null fields\"\"\"\n",
"\n",
"\n",
"\"\"\"Strip out all \"X - by owner\", etc. text.\"\"\"\n",
"data['category'], _ = data['category'].str.split(' -', 1).str\n",
"\n",
"\"\"\"Remap all Craigslist categories to the categories for our use case\"\"\"\n",
"data['category'].replace(to_replace=category_map, inplace=True)\n",
"\n",
"\"\"\"\n",
"Drop all examples with null fields again; this time the categories that\n",
"we're skipping.\n",
"\"\"\"\n",
"data = data.dropna()\n",
"\n",
"print('All categories:\\n', data.category.value_counts())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Training and test data split\n",
"\n",
"GridSearchCV already splits a cross validation data set from the training set."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"train, test = train_test_split(data, test_size=0.1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data pipeline\n",
"\n",
"Pipeline the process to make it more clear what's going on, use less\n",
"memory, and enable faster insertion of new steps.\n",
"\n",
"### FeatureUnion\n",
"\n",
"A FeatureUnion allows for unifying multiple input features so that\n",
"the model trains itself on all of them.\n",
"\n",
"### selector\n",
"Select this column only for the purposes of this step of the\n",
"pipeline.\n",
"\n",
"Example:\n",
"```json\n",
"{\n",
" 'title': 'Lagavulin 16',\n",
" 'description': 'A fine bottle this is.',\n",
" 'category': 'Alcohol & Spirits'\n",
"}\n",
"```\n",
"=> `'Lagavulin 16'`\n",
"\n",
"### vect\n",
"Embed the words in text using a matrix of token counts.\n",
"\n",
"Example:\n",
"```json\n",
"[\"dog cat fish\", \"dog cat\", \"fish bird\", \"bird\"]\n",
"```\n",
"=>\n",
"```json\n",
"[[0, 1, 1, 1],\n",
" [0, 2, 1, 0],\n",
" [1, 0, 0, 1],\n",
" [1, 0, 0, 1]]\n",
"```\n",
"\n",
"### tfidf\n",
"Deprioritize words that appear very often, such as \"the\", \"an\", \"craigslist\", etc.\n",
"\n",
"Example:\n",
"```json\n",
"[[3, 0, 1],\n",
" [2, 0, 0],\n",
" [3, 0, 0]]\n",
"```\n",
"=>\n",
"```json\n",
"[[ 0.81940995, 0. , 0.57320793],\n",
" [ 1. , 0. , 0. ],\n",
" [ 1. , 0. , 0. ]]\n",
"```\n",
"\n",
"### clf\n",
"`clf` is the classifier that we feed the data from the data pipeline into. In this case, we choose `LogisticRegression` since it's one of the known best ones for text classification. The others are 1) `LinearSVC`, which is effectively just a linear regression and is similar to `LogisticRegression`, and 2) neural nets, which without very complicated convolutional and recurrent networks don't give us much of an advantage over more classic methods."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"pipeline = Pipeline([\n",
" ('union', FeatureUnion(\n",
" transformer_list=[\n",
" ('title', Pipeline([\n",
" ('selector', ItemSelector(key='title')),\n",
" ('vect', CountVectorizer(stop_words='english',\n",
" decode_error='replace',\n",
" strip_accents='ascii',\n",
" max_df=0.8)),\n",
" ('tfidf', TfidfTransformer(smooth_idf=False))\n",
" ])),\n",
" ('description', Pipeline([\n",
" ('selector', ItemSelector(key='description')),\n",
" ('vect', CountVectorizer(stop_words='english',\n",
" decode_error='replace',\n",
" strip_accents='ascii',\n",
" binary=True,\n",
" max_df=0.8,\n",
" min_df=10)),\n",
" ('tfidf', TfidfTransformer(smooth_idf=False))\n",
" ]))\n",
" ]\n",
" )),\n",
" ('clf', LogisticRegression(C=5, dual=False, class_weight='balanced'))\n",
"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Pipeline parameters\n",
"\n",
"We can optionally set our pipeline parameters to get more control over each step. In the code above, the optimal parameters are already filled out."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"parameters = {\n",
" # Controls on regression model.\n",
" # 'clf__C': [0.1, 0.3, 1, 3, 5, 10, 30, 100, 300, 1000]\n",
" # 'clf__class_weight': [None, 'balanced'],\n",
" # 'clf__dual': [True, False],\n",
"\n",
" # Controls on word vectorization.\n",
" # 'union__title__vect__max_df': [0.8, 0.85, 0.9, 0.95, 1],\n",
" # 'union__title__vect__min_df': [1, 10],\n",
" # 'union__title__vect__ngram_range': [(1, 1), (1, 2)],\n",
" # 'union__description__vect__ngram_range': [(1, 1), (1, 2)],\n",
" # 'union__description__vect__max_df': [0.8, 0.85, 0.9, 0.95, 1],\n",
" # 'union__description__vect__min_df': [1, 10, 100],\n",
"\n",
" # Controls on TfIdf normalization.\n",
" # 'union__title__tfidf__norm': [None, 'l1', 'l2'],\n",
" # 'union__title__tfidf__use_idf': [True, False],\n",
" # 'union__title__tfidf__smooth_idf': [True, False],\n",
" # 'union__title__tfidf__sublinear_tf': [False, True],\n",
" # 'union__description__tfidf__norm': [None, 'l1', 'l2'],\n",
" # 'union__description__tfidf__use_idf': [True, False],\n",
" # 'union__description__tfidf__smooth_idf': [True, False],\n",
" # 'union__description__tfidf__sublinear_tf': [False, True],\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Train the model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Performing grid search...\n",
"Pipeline: ['union', 'clf']\n",
"Parameters: \n",
"{}\n",
"Fitting 3 folds for each of 1 candidates, totalling 3 fits\n",
"[CV] ................................................................\n",
"[CV] ................................................................\n",
"[CV] ................................................................\n",
"[CV] ....................... , score=0.7874693401652609, total=33.0min\n",
"[CV] ....................... , score=0.7887705122605603, total=33.1min\n",
"[CV] ....................... , score=0.7877126558241279, total=34.9min\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=-1)]: Done 3 out of 3 | elapsed: 36.5min remaining: 0.0s\n",
"[Parallel(n_jobs=-1)]: Done 3 out of 3 | elapsed: 36.5min finished\n"
]
}
],
"source": [
"grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=10)\n",
"\n",
"print('Performing grid search...')\n",
"print('Pipeline: ', [name for name, __ in pipeline.steps])\n",
"print('Parameters: ')\n",
"pprint(parameters)\n",
"t0 = time()\n",
"grid_search.fit(train[['title', 'description']], train['category'])\n",
"print(\"Done in %0.3fs\" % (time() - t0))\n",
"print()\n",
"\n",
"print(\"Best score: %0.3f\" % grid_search.best_score_)\n",
"print(\"Best parameters set:\")\n",
"best_parameters = grid_search.best_estimator_.get_params()\n",
"for param_name in sorted(parameters.keys()):\n",
" print(\"\\t%s: %r\" % (param_name, best_parameters[param_name]))\n",
"\n",
"score = grid_search.score(test[['title', 'description']], test['category'])\n",
"print(\"Test accuracy: %f\" % score)\n",
"\n",
"joblib.dump(grid_search, MODEL_FILE)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Plot the data "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"plot_learning_curve(grid_search.best_estimator_,\n",
" 'Item Categorizer',\n",
" train[['title', 'description']],\n",
" train['category'])\n",
"plt.show()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment