Skip to content

Instantly share code, notes, and snippets.

@maheshakya
Last active February 3, 2016 08:06
Show Gist options
  • Save maheshakya/ebb1a8d2e7015b634ca4 to your computer and use it in GitHub Desktop.
Save maheshakya/ebb1a8d2e7015b634ca4 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# [KDD cup 2014 - Predict excitement of projects](https://www.kaggle.com/c/kdd-cup-2014-predicting-excitement-at-donors-choose)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Target: Identify projects that are exceptionally exciting to the business, at the time of posting.\n",
"##### Category: Binary classification\n",
"##### Evaluation metric: Area under ROC curve"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Works with:\n",
"- scikit-learn - version 0.17\n",
"- numpy - version 1.10 \n",
"- pandas - version 0.17.1\n",
"- matplotlib - version 2.0\n",
"- Jupyter (Ipython notebook latest) obviously"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Following features will be covered in this session\n",
"- fit/predict/transform model\n",
"- train test split\n",
"- K-fold cross validation\n",
"- Hyper-parameter tuning with grid search\n",
"- Behavior or random forest, logistic regression classifiers\n",
"- Incremental learning with SGD classifier\n",
"- Label encoding\n",
"- One hot encoding\n",
"- Digitization of numerical attributes\n",
"- Area under ROC curve, scoring\n",
"- Simple plotting with matplotlib\n",
"- Term frequency - inverse document frequency vectorizer\n",
"- PCA\n",
"- Data manupulation with pandas and numpy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Download projects.csv, essays.csv and outcomes.csv from [Get the data](https://www.kaggle.com/c/kdd-cup-2014-predicting-excitement-at-donors-choose/data) page and save those files in /data folder (in working directory)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# To plot inline \n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Importing required features and libraries"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from sklearn.cross_validation import train_test_split, KFold, cross_val_score\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.linear_model import LogisticRegression, SGDClassifier\n",
"from sklearn.metrics import roc_auc_score, roc_curve, auc\n",
"from sklearn.preprocessing import LabelEncoder, OneHotEncoder\n",
"from sklearn.grid_search import GridSearchCV\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"from sklearn.decomposition import PCA"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Helper functions"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Get difference between 2 lists\n",
"def diff(a, b):\n",
" b = set(b)\n",
" return [aa for aa in a if aa not in b]\n",
"\n",
"# Plot ROC curve\n",
"def plot_roc(false_positive_rate, true_positive_rate, auc):\n",
" plt.title('Receiver Operating Characteristic')\n",
" plt.plot(false_positive_rate, true_positive_rate, 'b', label='AUC = %0.2f'% auc)\n",
" plt.legend(loc='lower right')\n",
" plt.plot([0,1],[0,1],'r--')\n",
" plt.xlim([-0.1,1.2])\n",
" plt.ylim([-0.1,1.2])\n",
" plt.ylabel('True Positive Rate')\n",
" plt.xlabel('False Positive Rate')\n",
" plt.show()\n",
" \n",
"# Evaluation metrics\n",
"def evaluate_model(y_true, y_preds, y_preds_proba):\n",
" # Calculate parameters for ROC curve\n",
" fpr, tpr, thresholds = roc_curve(y_true, y_preds_proba[:, 1])\n",
" auc_score = auc(fpr, tpr)\n",
"\n",
" # Plot ROC curve\n",
" plot_roc(fpr, tpr, auc_score)\n",
"\n",
" # Area under ROC curve score with actual proabilities\n",
" print \"ROC AUC score with probabilites: \", roc_auc_score(y_true, y_preds_proba[:, 1])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Analysis"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Loading CSV files\n",
"projects = pd.read_csv('data/projects.csv')\n",
"outcomes = pd.read_csv('data/outcomes.csv')\n",
"\n",
"# Sort by project ID\n",
"projects = projects.sort('projectid')\n",
"outcomes = outcomes.sort('projectid')"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"We will analyze only training data from the data set. Training data will be divided into a train set and a test set. Evaluations will be carried out on the test set."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#Filling missing values\n",
"projects = projects.fillna(method='pad') #'pad' filling is a naive way. We have better methods.\n",
"\n",
"# Extracting training data indices\n",
"dates = np.array(projects.date_posted)\n",
"train_idx = np.where((dates < '2014-01-01') & (dates > '2012-01-01'))[0]\n",
"\n",
"# Get training data\n",
"training_data = projects.iloc[train_idx].sort('projectid')\n",
"training_outcomes = outcomes[outcomes.projectid.isin(training_data.projectid)].sort('projectid')\n",
"\n",
"# Get labels\n",
"labels = np.array(training_outcomes.is_exciting) == 't'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Train test split\n",
"X_train_ids, X_test_ids, y_train, y_test = train_test_split(training_data.projectid, labels,\n",
" test_size=0.33, random_state=42)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### A simple random forest model with only categorical attributes"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Checking attribute infomation of the training data\n",
"training_data.info()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Extract only categorical columns\n",
"projects_numeric_columns = ['school_latitude', 'school_longitude',\n",
" 'fulfillment_labor_materials',\n",
" 'total_price_excluding_optional_support',\n",
" 'total_price_including_optional_support']\n",
"\n",
"\n",
"projects_id_columns = ['projectid' ,'teacher_acctid', 'schoolid', 'school_ncesid']\n",
"projects_categorial_columns = diff(diff(diff(list(training_data.columns), projects_id_columns),\n",
" projects_numeric_columns), ['date_posted'])\n",
"\n",
"projects_categorial_values = np.array(training_data[projects_categorial_columns])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Encode labels\n",
"label_encoder = LabelEncoder()\n",
"categorical_data = label_encoder.fit_transform(projects_categorial_values[:, 0])\n",
"\n",
"for i in range(1, projects_categorial_values.shape[1]):\n",
" label_encoder = LabelEncoder()\n",
" categorical_data = np.column_stack((categorical_data, label_encoder.fit_transform(projects_categorial_values[:,i])))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Keep project ID to track train test split\n",
"project_ids = np.array(training_data.projectid)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"X_train = categorical_data[np.searchsorted(project_ids, X_train_ids)]\n",
"X_test = categorical_data[np.searchsorted(project_ids, X_test_ids)]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Train a random forest classifier (default parameters) with traning set\n",
"clf = RandomForestClassifier()\n",
"clf.fit(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Predict values and probabilities\n",
"preds = clf.predict(X_test)\n",
"pred_probs = clf.predict_proba(X_test)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Get evaluations of the random forest model\n",
"evaluate_model(y_test, preds, pred_probs)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Initilialize K-fold CV and run\n",
"kfold_cv = KFold(X_train.shape[0], n_folds=5, shuffle=True, random_state=42)\n",
"print \"n-jobs = 1\"\n",
"cross_val_score(clf, X_train, y_train, scoring='roc_auc', cv=kfold_cv, n_jobs = 1, verbose=3)\n",
"print \"n-jobs = 4\"\n",
"cross_val_score(clf, X_train, y_train, scoring='roc_auc', cv=kfold_cv, n_jobs = 4, verbose=3)\n",
"print \"n-jobs = -1\"\n",
"cross_val_score(clf, X_train, y_train, scoring='roc_auc', cv=kfold_cv, n_jobs = -1, verbose=3)\n",
"print \"end...\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Model Selection - Find optimial hyper-parameters for Random Forest classifier with grid search"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Initialize parameters grid\n",
"param_grid = {'n_estimators': [5, 10, 25]}\n",
"\n",
"# Initilize grid search CV\n",
"grid_search = GridSearchCV(RandomForestClassifier(), param_grid=param_grid, scoring='roc_auc', verbose=1, n_jobs=4)\n",
"\n",
"# Fit data to grid search\n",
"grid_search.fit(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Best hyper parameters\n",
"print \"Best n_estimators: \", grid_search.best_estimator_.n_estimators"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Verify with test data\n",
"clf = RandomForestClassifier(n_estimators=25)\n",
"clf.fit(X_train, y_train)\n",
"preds = clf.predict(X_test)\n",
"pred_probs = clf.predict_proba(X_test)\n",
"evaluate_model(y_test, preds, pred_probs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Logistic regression with the same features"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Train a logistic regression classifier (default parameters) with traning set\n",
"clf = LogisticRegression()\n",
"clf.fit(X_train, y_train)\n",
"\n",
"# Predict values and probabilities\n",
"preds = clf.predict(X_test)\n",
"pred_probs = clf.predict_proba(X_test)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Get evaluations of the logistic regression model\n",
"evaluate_model(y_test, preds, pred_probs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Logistic regression with one hot encoded features"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# One hot encoding!\n",
"enc = OneHotEncoder()\n",
"enc.fit(categorical_data)\n",
"X_train_ohe = enc.transform(X_train)\n",
"X_test_ohe = enc.transform(X_test)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print \"Number of features before one hot encoding: \", X_train.shape[1]\n",
"print \"Number of features after one hot encoding: \", X_train_ohe.shape[1]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Train a logistic regression classifier (default parameters) with one hot encoded traning set\n",
"clf = LogisticRegression()\n",
"clf.fit(X_train_ohe, y_train)\n",
"\n",
"# Predict values and probabilities\n",
"preds = clf.predict(X_test_ohe)\n",
"pred_probs = clf.predict_proba(X_test_ohe)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Get evaluations of the logistic regression model with one hot encoded data\n",
"evaluate_model(y_test, preds, pred_probs)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"### Handling numerical columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### PCA"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print projects_numeric_columns"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"numerical_data = np.array(training_data[projects_numeric_columns])\n",
"\n",
"# initiate PCA and classifier\n",
"pca = PCA(n_components=3)\n",
"pca.fit(numerical_data)\n",
"\n",
"X_train = pca.transform(numerical_data[np.searchsorted(project_ids,\n",
" X_train_ids)])\n",
"X_test = pca.transform(numerical_data[np.searchsorted(project_ids,\n",
" X_test_ids)])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print \"number of features after dimensionality reductions: \", X_train.shape[1]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Train a logistic regression classifier (default parameters) with traning set\n",
"clf = LogisticRegression()\n",
"clf.fit(X_train, y_train)\n",
"\n",
"# Predict values and probabilities\n",
"preds = clf.predict(X_test)\n",
"pred_probs = clf.predict_proba(X_test)\n",
"\n",
"# Get evaluations of the logistic regression model\n",
"evaluate_model(y_test, preds, pred_probs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Binning"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Binning numerical data\n",
"numerical_dataframe = training_data[projects_numeric_columns]\n",
"numerical_data = np.empty(shape=numerical_dataframe.shape[0])\n",
"\n",
"# Number of bins = 2-\n",
"number_of_bins = 20\n",
"for col in projects_numeric_columns:\n",
" digitized_column = np.digitize(numerical_dataframe[col],\n",
" bins=np.linspace(np.min(numerical_dataframe[col]),\n",
" np.max(numerical_dataframe[col]), num=number_of_bins))\n",
" numerical_data = np.column_stack((numerical_data, digitized_column))\n",
"\n",
"numerical_data = numerical_data[:, 1:]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"X_train = numerical_data[np.searchsorted(project_ids, X_train_ids)]\n",
"X_test = numerical_data[np.searchsorted(project_ids, X_test_ids)]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# One hot encoding!\n",
"enc = OneHotEncoder()\n",
"enc.fit(numerical_data)\n",
"X_train_ohe = enc.transform(X_train)\n",
"X_test_ohe = enc.transform(X_test)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Train a logistic regression classifier (default parameters) with one hot encoded traning set\n",
"clf = LogisticRegression()\n",
"clf.fit(X_train_ohe, y_train)\n",
"\n",
"# Predict values and probabilities\n",
"preds = clf.predict(X_test_ohe)\n",
"pred_probs = clf.predict_proba(X_test_ohe)\n",
"\n",
"# Get evaluations of the logistic regression model with one hot encoded data\n",
"evaluate_model(y_test, preds, pred_probs)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# delete overall projects data (to save memory)\n",
"# reset_selective projects"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using Essay data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Load essays data file\n",
"essays = pd.read_csv('data/essays.csv')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Extract training data from essays data file\n",
"training_essays = essays[essays.projectid.isin(training_data.projectid)].sort('projectid').fillna(method='pad')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# delete overall essay data (to save memory)\n",
"# reset_selective essays"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Initialize and fit TF-IDF vectorizer\n",
"tfidf_vectorizer = TfidfVectorizer()\n",
"tfidf_vectorizer.fit(training_essays.essay)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Transform training and test data of essays\n",
"X_train = tfidf_vectorizer.transform(np.array(training_essays.essay)[np.searchsorted(project_ids,\n",
" X_train_ids)])\n",
"X_test = tfidf_vectorizer.transform(np.array(training_essays.essay)[np.searchsorted(project_ids,\n",
" X_test_ids)])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print \"number of features of vectorized essays: \", X_train.shape[1]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Train a logistic regression classifier (default parameters) with traning set\n",
"clf = LogisticRegression()\n",
"clf.fit(X_train, y_train)\n",
"\n",
"# Predict values and probabilities\n",
"preds = clf.predict(X_test)\n",
"pred_probs = clf.predict_proba(X_test)\n",
"\n",
"# Get evaluations of the logistic regression model\n",
"evaluate_model(y_test, preds, pred_probs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Incremental learning"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Divide data into two sections\n",
"X_train_1 = X_train[:X_train.shape[0]/2, :]\n",
"X_train_2 = X_train[X_train.shape[0]/2:, :]\n",
"y_train_1 = y_train[:y_train.shape[0]/2]\n",
"y_train_2 = y_train[y_train.shape[0]/2:]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Train a logistic regression classifier (default parameters) with traning set\n",
"clf = SGDClassifier(loss='log')\n",
"clf.fit(X_train_1, y_train_1)\n",
"\n",
"# Predict values and probabilities\n",
"preds = clf.predict(X_test)\n",
"pred_probs = clf.predict_proba(X_test)\n",
"\n",
"# Get evaluations of the logistic regression model\n",
"evaluate_model(y_test, preds, pred_probs)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Partial fit to the trained classifier\n",
"clf.partial_fit(X_train_2, y_train_2)\n",
"\n",
"# Predict values and probabilities\n",
"preds = clf.predict(X_test)\n",
"pred_probs = clf.predict_proba(X_test)\n",
"\n",
"# Get evaluations of the logistic regression model\n",
"evaluate_model(y_test, preds, pred_probs)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment