{{ message }}

Instantly share code, notes, and snippets.

# careeningspace/Generic Hiring Analysis.ipynb

Created Nov 13, 2018
Hiring Recommendation Analyst
 { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

Hiring Data Analysis Excercise

\n", "
Author: Elliott Klug
\n", "

For this exercise, a sample data file is provided which contains information about \"recommendations\". A \"recommendation\" is the group of workers from which the Client can choose one to hire for a Job.

\n", "

Executive Summary

\n", "
\n", "
1. There are 2100 unique recommendation sets
2. \n", "
3. There are 1 to 15 Workers per Recommendation:
4. \n", "
\n", "
• The Average (Mean) number of Workers shown per recommendation is 14.29
• \n", "
• The Median number of Workers shown per recommendation is 15
• \n", "
\n", "
5. There are 830 unique Workers
6. \n", "
7. Worker Exposure:
8. \n", "
\n", "
• Worker #1014508755 was shown the most at 608 times
• \n", "
• 68 Workers were only recommended once
• \n", "
\n", "
9. Worker Hiring:
10. \n", "
\n", "
• Worker #1012043028 was hired the most at 59 times
• \n", "
• 518 Workers were not hired at all
• \n", "
\n", "
11. 6 Workers have a 100% conversion rate (# times hired/# shown)
12. \n", "
13. Can every Worker have a 100% conversion rate?
14. \n", "
\n", " In order for all recommendations be hired per job they have been recommended for, it would require\n", " either: (a) allowing a client multiple hires per job or (b) reducing the recommendations \n", " to one Worker per Recommendation.\n", "
\n", "
15. Average Position in Recommendation Results by Category for Hired Workers
16. \n", "
\n", "
• Handy Work: 3.61
• \n", "
• Painting: 4.60
• \n", "
• Moving Help: 4.15
• \n", "
\n", "
17. Average Hourly Rates and Completed Jobs by Category for Hired Workers:
18. \n", "
\n", "
• Average Hourly Rate:
• \n", "
\n", "
• Handywork: 38.70
• \n", "
• Painting: 50.15
• \n", "
• Moving: 63.01
• \n", "
\n", "
• Average Number of Completed Tasks:
• \n", "
\n", "
• Handywork: 249
• \n", "
• Painting: 284
• \n", "
• Moving: 274
• \n", "
\n", "
\n", "
19. The Question: How do we go about using this market data to suggest hourly \n", " rates to Workers that would maximize their opportunity to be hired?
20. \n", "
\n", " This data set has insufficient information to categorically provide a \n", " recommendation on hourly rates\n", "
\n", "
• Worker features such as customer generated performance ratings, and \n", " 'Worker' or 'best value' rankings are not present\n", "
• \n", "
• Some jobs in the categories of Handywork and Moving may require access\n", " to a vehicle, and a Worker having a vehicle should allow for a higher hourly rate. \n", " This dataset does not provide this information.\n", "
• \n", "
• Some workers represent additional labor in their bid. For example: some\n", " Workers bidding on Moving jobs may be providing support labor. \n", " If the Worker bids 90 for a job and three people will be working the job, \n", " the effective hourly rate is 30 per Worker.\n", "
• \n", "
• \n", " Since Position in Recommendation appears to be highly correlated to being \n", " hired, it is not clear what the client may be seeing. This rating could \n", " be from the final sort as selected by the client, or the initial \n", " recommendation results provided.\n", "
• \n", "
\n", " In order to identify ideal hourly rates, we will need to do some \n", " Exploratory Data Analysis (EDA) and light statistical \n", " analysis to see if there are any visible trends or correlations within the data.\n", "
\n", "
• Hourly rate appears to have a sweet spot of 30-60. The Moving \n", " category may be skewed by some Workers employing additional labor in their bid.\n", "
• \n", "
• Most Workers hired are in the top four recommendation slots
• \n", "
• There is a medium correlation of Job experience to hourly rate, \n", " with the Mounting category having the highest correlation. Individual Workers\n", " seem to vary their hourly rate without respect to how many jobs they have \n", " completed. This implies that either our sample is too small or that there other \n", " factors driving hourly rating decisions not captured in this data.\n", "
• \n", "
• There is a relationship between recommendation ranking vs hourly rate, \n", " but as the hourly rate increases, the relationship appears to weaken.\n", "
• \n", "
\n", " Each Category required a unique approach in providing recommendations.\n", "
\n", "
• Moving Polynomial Fit (3 degrees) \n", " for Hourly Rates less than 60\n", "
• \n", "
• Handywork Polynomial Fit (4 degrees)\n", "
• \n", "
• Painting Polynomial Fit (3 degrees)\n", "
• \n", "
\n", "

Hourly Rate Recommendations by Category

\n", " \n", "
Experience is the number of completed tasks\n", "
ExperienceMovingHandyworkPainting
035.5928.3933.94
1036.3829.2536.31
5039.1732.4443.78
10041.8635.9149.46
20044.9741.2353.16
30045.8444.5452.50
40045.3746.1751.83
50044.4546.5753.05
60044.0046.3556.26
70044.9146.2560.37
80048.1047.1563.78
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", " \n", " \n", "

Sample.csv Import/Check

\n", "

\n", " Setting up environment, basic file check.
\n", " The columns are as follows:
\n", "

• recommendation_id unique identifier for this recommendation, or set of Workers shown
• \n", "
• created_at when this recommendation was shown to the client
• \n", "
• worker_id unique identifier for the Workers
• \n", "
• position the position of the Worker in the recommendation set, 1 - first, 2 - second, etc.
• \n", "
• hourly_rate the hourly rate for the Worker when they were shown
• \n", "
• experience_count the number of Jobs the Worker had completed in that category, when they were shown
• \n", "
• hired was the Worker hired or not? Only 1 worker out of a set of recommendations can be hired
• \n", "
• category the category of work the Client needs help with
• \n", "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Return to Summary" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Shape of: hiring_data.csv\n", "(30000, 8)\n", "\n", "Types of: hiring_data.csv\n", "recommendation_id object\n", "created_at object\n", "worker_id int64\n", "position int64\n", "hourly_rate int64\n", "experience_count int64\n", "hired int64\n", "category object\n", "dtype: object\n", "\n", "Header of: hiring_data.csv\n", " recommendation_id created_at worker_id position hourly_rate \\\n", "0 37af-70cf97d7-901c 2012-06-20 0:32:25 2011195661 1 38 \n", "1 37af-70cf97d7-901c 2012-06-20 0:32:25 2008902668 2 40 \n", "2 37af-70cf97d7-901c 2012-06-20 0:32:25 2014034265 3 28 \n", "3 37af-70cf97d7-901c 2012-06-20 0:32:25 2011743826 4 43 \n", "4 37af-70cf97d7-901c 2012-06-20 0:32:25 2015589582 5 29 \n", "\n", " experience_count hired category \n", "0 151 0 Handywork \n", "1 193 0 Handywork \n", "2 0 0 Handywork \n", "3 303 0 Handywork \n", "4 39 0 Handywork \n" ] } ], "source": [ "import random\n", "import operator\n", "import thinkstats2\n", "import thinkplot\n", "import numpy as np\n", "import pylab as pl\n", "import pandas as pd\n", "import seaborn as sns\n", "import statsmodels.api as sm\n", "import statsmodels.formula.api as smf\n", "\n", "import matplotlib.pyplot as plt\n", "\n", "from collections import Counter\n", "from scipy import stats, optimize\n", "from scipy.stats import norm\n", "from sklearn.decomposition import PCA, KernelPCA\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import StandardScaler, RobustScaler,\\\n", "Normalizer, PowerTransformer, QuantileTransformer, PolynomialFeatures\n", "\n", "file = 'hiring_data.csv'\n", "df = pd.read_csv(file)\n", "\n", "print(\"Shape of: \" + file)\n", "print(df.shape)\n", "print()\n", "print(\"Types of: \" + file)\n", "print(df.dtypes)\n", "print()\n", "print(\"Header of: \" + file)\n", "print(df.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Unique index count

" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "recommendation_id 2100\n", "created_at 2093\n", "worker_id 830\n", "position 15\n", "hourly_rate 82\n", "experience_count 964\n", "hired 2\n", "category 3\n", "dtype: int64" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.nunique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", " Based on a simple unique count of the sample.csv, there are:\n", "

• 2100 unique recommendation sets (Jobs)
• \n", "
• 830 unique Workers
• \n", "
• 3 categories
• \n", "

Worker Recommendation Stats:

" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean # of Workers Shown per recommendation: position 14.291905\n", "dtype: float64\n", "Median # of Workers Shown per recommendation: position 15.0\n", "dtype: float64\n" ] } ], "source": [ "print(\"Mean # of Workers Shown per recommendation: \"\\\n", " , (df.groupby('recommendation_id').agg({'position':'max'}).mean()))\n", "print(\"Median # of Workers Shown per recommendation: \"\\\n", " , df.groupby('recommendation_id').agg({'position':'max'}).median())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", "

• The Average (Mean) number of Workers shown per recommendation is 14.29
• \n", "
• The Median number of Workers shown per recommendation is 15
• \n", "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Worker Stats:

" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Worker showed the most: \n", "hired 7.00\n", "recommendations 608.00\n", "conversion_rate 0.01\n", "ave_position 5.06\n", "ave_hourly_rate 49.10\n", "Name: 2016519064, dtype: float64\n", "\n", "Number of Workers who have only been recommended once: 68\n", "\n", "Worker hired the most: \n", "hired 59.00\n", "recommendations 438.00\n", "conversion_rate 0.13\n", "ave_position 5.82\n", "ave_hourly_rate 34.83\n", "Name: 2014053337, dtype: float64\n", "\n", "Number of Workers who have not been hired: 518\n", "\n", "Number of Workers with 100% conversion rate: 6\n", " hired recommendations conversion_rate ave_position \\\n", "worker_id \n", "2009491221 1 1 1.0 1.00 \n", "2010104729 2 2 1.0 1.00 \n", "2010872050 9 9 1.0 4.67 \n", "2013996277 1 1 1.0 3.00 \n", "2014379995 2 2 1.0 1.50 \n", "2016489082 1 1 1.0 14.00 \n", "\n", " ave_hourly_rate \n", "worker_id \n", "2009491221 35.0 \n", "2010104729 68.0 \n", "2010872050 70.0 \n", "2013996277 32.0 \n", "2014379995 35.0 \n", "2016489082 40.0 \n" ] } ], "source": [ "df_recommended= pd.DataFrame(list(zip(df.worker_id.values, df.hired))\\\n", " , columns=['worker_id', 'hired'])\n", "worker_df = df_recommended.groupby('worker_id').sum()\n", "worker_df['recommendations'] = df['worker_id'].value_counts()\n", "worker_df['conversion_rate'] = worker_df.hired / worker_df.recommendations\n", "worker_df['ave_position'] = df.groupby('worker_id').agg({'position':'mean'})\n", "worker_df['ave_hourly_rate'] = df.groupby('worker_id')\\\n", ".agg({'hourly_rate':'mean'})\n", "min_hired = worker_df[worker_df.hired == 0]\n", "min_recommended = worker_df[worker_df.recommendations == 1]\n", "max_conversion = worker_df[worker_df.conversion_rate == 1]\n", "\n", "print(\"Worker showed the most: \")\n", "print(worker_df.loc[worker_df.recommendations.idxmax()].round(2))\n", "print()\n", "print(\"Number of Workers who have only been recommended once: \"\\\n", " , len(min_recommended.index))\n", "print()\n", "print(\"Worker hired the most: \")\n", "print(worker_df.loc[worker_df.hired.idxmax()].round(2))\n", "print()\n", "print(\"Number of Workers who have not been hired: \"\\\n", " , len(min_hired.index))\n", "print()\n", "print(\"Number of Workers with 100% conversion rate: \"\\\n", " , len(max_conversion.index))\n", "print(max_conversion.round(2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Return to Summary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Worker Stats Summary

\n", "

\n", "

• Worker #1014508755 was shown the most at 608 times
• \n", "
• 68 Workers were only recommended once
• \n", "
• Worker #1012043028 was hired the most at 59 times
• \n", "
• 518 Workers were not hired at all
• \n", "
• 6 Workers have a conversion rate of 100%
• \n", "

Can every Worker have a 100% conversion rate?

\n", " It would require that all recommendations be hired per job which would require \n", " multiple hires per task or reducing the recommendations to one Worker per task recommendation.\n", "

Category Stats

\n", "

Average (Mean) by Category

" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "
positionhourly_rateexperience_count
category
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Handywork3.6138.70249.02
Moving4.1563.01273.88
Painting4.6050.15284.10
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "" ], "text/plain": [ " position hourly_rate experience_count\n", "category \n", "Handywork 3.61 38.70 249.02\n", "Moving 4.15 63.01 273.88\n", "Painting 4.60 50.15 284.10" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hired_df = df[df.hired == 1]\n", "hired_df.groupby('category').agg({'position':'mean'\\\n", " , 'hourly_rate':'mean', 'experience_count':'mean'}).round(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Median by Category

" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "
positionhourly_rateexperience_count
category
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Handywork238131.5
Moving349147.0
Painting350190.0
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "" ], "text/plain": [ " position hourly_rate experience_count\n", "category \n", "Handywork 2 38 131.5\n", "Moving 3 49 147.0\n", "Painting 3 50 190.0" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hired_df.groupby('category').agg({'position':'median'\\\n", " , 'hourly_rate':'median', 'experience_count':'median'}).round(2)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " position hourly_rate experience_count\n", "PC-1 0.182817 0.861871 0.473030\n", "PC-2 0.198531 0.438864 -0.876347\n", "PC-3 -0.962893 0.254122 -0.090876\n", " 0\n", "position 0.627685\n", "hourly_rate 0.197281\n", "experience_count 0.175034\n", "[0.62768528 0.19728121 0.1750335 ]\n" ] } ], "source": [ "x = hired_df.drop(['hired', 'category', 'recommendation_id'\\\n", " , 'worker_id', 'created_at'], axis=1)\n", "y = hired_df['category']\n", "#print(x)\n", "\n", "X_train, X_test, y_train, y_test = train_test_split\\\n", " (x, y, test_size=0.2, random_state=0)\n", "\n", "rs = RobustScaler()\n", "X_train = rs.fit_transform(X_train) \n", "X_test = rs.transform(X_test) \n", "\n", "pca = PCA()\n", "X_train = pca.fit_transform(X_train) \n", "X_test = pca.transform(X_test) \n", "\n", "print(pd.DataFrame(pca.components_,columns=x.columns\\\n", " , index = ['PC-1','PC-2', 'PC-3']))\n", "explained_variance = pca.explained_variance_ratio_\n", "print(pd.DataFrame(pca.explained_variance_ratio_\\\n", " ,index=x.columns, columns = [0]))\n", "print(explained_variance)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Return to Summary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

The Big Question: How to Improve Worker Hiring Probability

\n", "

How can we use market data to suggest hourly rates to Workers that would\n", " maximize their opportunity to be hired?\n", "

\n", "

Performing Exploratory Data Analysis (EDA) on the hiring data