Skip to content

Instantly share code, notes, and snippets.

@patternproject
Created March 18, 2020 12:35
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save patternproject/d6e6dc65ff7c3b8048a74d0a44843c25 to your computer and use it in GitHub Desktop.
Save patternproject/d6e6dc65ff7c3b8048a74d0a44843c25 to your computer and use it in GitHub Desktop.
Wk4_Submission.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "Wk4_Submission.ipynb",
"provenance": [],
"toc_visible": true,
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/patternproject/d6e6dc65ff7c3b8048a74d0a44843c25/wk4_submission.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ExrcEKOJqRMS",
"colab_type": "text"
},
"source": [
"Manning LP \"Using Online Job Postings to Improve Your Data Science Resume\" "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "GTXpVFDmqdGi",
"colab_type": "text"
},
"source": [
"Week 4 - Finding Missing Skills From Our Resume"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "cA2s2jr6uJhM",
"colab_type": "text"
},
"source": [
"# 4.1 Determine Missing Resume Skills"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6EWGJZG3uhs2",
"colab_type": "text"
},
"source": [
"## Objective: \n",
"\n",
"Optimize our resume by finding which skills we are missing. We will do this by finding skills missing from our resume that are in the least-similar skill requirement clusters."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jpAg086UuvRn",
"colab_type": "text"
},
"source": [
"## Workflow:\n",
"\n",
"1. Combine skill requirement cluster texts and calculate cosine similarity between our resume skills and the skill requirement clusters.\n",
"2. Rank the skill clusters by similarity to our resume skill list and visualize the results of how similar the clusters are to our skills.\n",
"3. Determine the skills that are missing from our resume that we could learn and/or add to our resume.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-GSEEXdWOKpl",
"colab_type": "text"
},
"source": [
"# 4.2 Action Starts Here ... "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "X1LYPAywxnbp",
"colab_type": "text"
},
"source": [
"## 1.Importing Libraries"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "RkZ4_KV2ZlPL",
"colab_type": "text"
},
"source": [
"Note that we are installing the tqdm package here. This is a package that shows progress bars for loops. It's useful to see how long it takes to loop through various k-means cluster values. Since it's not in the environment file, we install it on-the-fly in the notebook. Preceeding a command with an exclamation point (!) runs the command on the underlying OS. Giving the flag -y ignores the prompt from conda asking if we want to install the package and it's prerequisites."
]
},
{
"cell_type": "code",
"metadata": {
"id": "jGwlBbYnZtQG",
"colab_type": "code",
"outputId": "ff529be1-340a-4754-a29d-053a957c6b74",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"#!pip install tqdm -y\n",
"!pip install tqdm "
],
"execution_count": 3,
"outputs": [
{
"output_type": "stream",
"text": [
"Requirement already satisfied: tqdm in /usr/local/lib/python3.6/dist-packages (4.28.1)\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "xXHEd5UKxnFe",
"colab_type": "code",
"outputId": "23dea7b4-6342-4a92-b5fa-939c98c06fd0",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 87
}
},
"source": [
"#from tqdm.notebook import tqdm\n",
"from tqdm import tqdm\n",
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline\n",
"\n",
"\n",
"# string processing, reg expressions\n",
"import re, string\n",
"\n",
"# for freq vectorizer\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"\n",
"# for DIM Red\n",
"from sklearn.decomposition import TruncatedSVD\n",
"\n",
"from sklearn.pipeline import make_pipeline\n",
"from sklearn.preprocessing import Normalizer\n",
"\n",
"# for Metrics\n",
"from sklearn import metrics\n",
"from sklearn.metrics import silhouette_score, silhouette_samples\n",
"\n",
"# for Clustering\n",
"from sklearn.cluster import KMeans\n",
"\n",
"# for Text Processing \n",
"import nltk\n",
"from nltk.stem import PorterStemmer\n",
"from nltk.corpus import stopwords\n",
"nltk.download('stopwords')\n",
"nltk.download('punkt')\n",
"\n",
"from pylab import *\n",
"\n",
"# for elbow plot (Visualizing)\n",
"import scipy.spatial.distance as scdist\n",
"\n",
"# for Counting the objects in each K-Mean Cluster\n",
"from collections import Counter\n",
"\n",
"# for Word Cloud Visulization\n",
"from wordcloud import WordCloud\n",
"\n",
"# Compute Cosine Similarity\n",
"from sklearn.metrics.pairwise import cosine_similarity\n",
"\n",
"from sklearn.decomposition import PCA\n",
"from sklearn.manifold import TSNE"
],
"execution_count": 4,
"outputs": [
{
"output_type": "stream",
"text": [
"[nltk_data] Downloading package stopwords to /root/nltk_data...\n",
"[nltk_data] Package stopwords is already up-to-date!\n",
"[nltk_data] Downloading package punkt to /root/nltk_data...\n",
"[nltk_data] Package punkt is already up-to-date!\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "GBWZzknYOPLJ",
"colab_type": "text"
},
"source": [
"## 2.Loading the Dataset"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "cJLIzNX6fbmo",
"colab_type": "text"
},
"source": [
"### Downloaded Resume"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "WGEuvw8jaBC1",
"colab_type": "text"
},
"source": [
"We also load the DataFrame from step 2 which holds the most similar job postings to our resume."
]
},
{
"cell_type": "code",
"metadata": {
"id": "asdB9SEnOOs-",
"colab_type": "code",
"colab": {}
},
"source": [
"df_pkl = pd.read_pickle('step2_df.pk')"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "KQ3RKcloxzw9",
"colab_type": "code",
"outputId": "b06bbbf3-682e-492f-b425-afeb296d7fe5",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 197
}
},
"source": [
"df_pkl.head()"
],
"execution_count": 6,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>title</th>\n",
" <th>body</th>\n",
" <th>bullets</th>\n",
" <th>cosine_similarity</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Institutional Data and Research Analyst (6948U...</td>\n",
" <td>Institutional Data and Research Analyst (6948U...</td>\n",
" <td>()</td>\n",
" <td>0.143349</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Data Science Health Innovation Fellow Job - BI...</td>\n",
" <td>Data Science Health Innovation Fellow Job - BI...</td>\n",
" <td>(Demonstrated ability to propose, initiate, an...</td>\n",
" <td>0.125523</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Machine Learning Postdoctoral Fellow - San Fra...</td>\n",
" <td>Machine Learning Postdoctoral Fellow - San Fra...</td>\n",
" <td>(Design and develop distributed machine learni...</td>\n",
" <td>0.121162</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Data Analyst (6256U) 1737 - 1737 - Berkeley, C...</td>\n",
" <td>Data Analyst (6256U) 1737 - 1737 - Berkeley, C...</td>\n",
" <td>()</td>\n",
" <td>0.117481</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Senior Data Systems Analyst (0599U) - 1668 - 1...</td>\n",
" <td>Senior Data Systems Analyst (0599U) - 1668 - 1...</td>\n",
" <td>()</td>\n",
" <td>0.113083</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" title ... cosine_similarity\n",
"0 Institutional Data and Research Analyst (6948U... ... 0.143349\n",
"1 Data Science Health Innovation Fellow Job - BI... ... 0.125523\n",
"2 Machine Learning Postdoctoral Fellow - San Fra... ... 0.121162\n",
"3 Data Analyst (6256U) 1737 - 1737 - Berkeley, C... ... 0.117481\n",
"4 Senior Data Systems Analyst (0599U) - 1668 - 1... ... 0.113083\n",
"\n",
"[5 rows x 4 columns]"
]
},
"metadata": {
"tags": []
},
"execution_count": 6
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "xgZXH3gfx-up",
"colab_type": "code",
"outputId": "f55fb3df-d034-4820-d598-e86c073689ba",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 265
}
},
"source": [
"print(df_pkl.describe)"
],
"execution_count": 7,
"outputs": [
{
"output_type": "stream",
"text": [
"<bound method NDFrame.describe of title ... cosine_similarity\n",
"0 Institutional Data and Research Analyst (6948U... ... 0.143349\n",
"1 Data Science Health Innovation Fellow Job - BI... ... 0.125523\n",
"2 Machine Learning Postdoctoral Fellow - San Fra... ... 0.121162\n",
"3 Data Analyst (6256U) 1737 - 1737 - Berkeley, C... ... 0.117481\n",
"4 Senior Data Systems Analyst (0599U) - 1668 - 1... ... 0.113083\n",
".. ... ... ...\n",
"70 Data Scientist - Pittsburgh, PA 15206 ... 0.057230\n",
"71 AI Research Scientist - Natural Language Proce... ... 0.056983\n",
"72 Senior Data Scientist - Authorship - Oakland, CA ... 0.056819\n",
"73 Senior Data Scientist - Palo Alto, CA ... 0.056781\n",
"74 Data Scientist - Denver, CO 80221 ... 0.056697\n",
"\n",
"[75 rows x 4 columns]>\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "IpfKfXv7yFis",
"colab_type": "code",
"outputId": "0e029ef5-71b8-43d4-974b-3c8071d39da7",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 105
}
},
"source": [
"df_pkl.dtypes"
],
"execution_count": 8,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"title object\n",
"body object\n",
"bullets object\n",
"cosine_similarity float64\n",
"dtype: object"
]
},
"metadata": {
"tags": []
},
"execution_count": 8
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "_0N5_-yfyMn2",
"colab_type": "code",
"outputId": "e7a688fb-27c9-4350-81ba-8222c60ce84e",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 176
}
},
"source": [
"df_pkl.info()"
],
"execution_count": 9,
"outputs": [
{
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 75 entries, 0 to 74\n",
"Data columns (total 4 columns):\n",
"title 75 non-null object\n",
"body 75 non-null object\n",
"bullets 75 non-null object\n",
"cosine_similarity 75 non-null float64\n",
"dtypes: float64(1), object(3)\n",
"memory usage: 2.5+ KB\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "CxCcmszPyXq4",
"colab_type": "code",
"outputId": "470557ac-a045-4018-b416-b0464674da48",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 105
}
},
"source": [
"df_pkl.infer_objects().dtypes"
],
"execution_count": 10,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"title object\n",
"body object\n",
"bullets object\n",
"cosine_similarity float64\n",
"dtype: object"
]
},
"metadata": {
"tags": []
},
"execution_count": 10
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "Xdln35hF0d2y",
"colab_type": "code",
"outputId": "a10e1401-9cbb-4805-9c39-830f19de7755",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 303
}
},
"source": [
"df_pkl['bullets'][1]"
],
"execution_count": 11,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"('Demonstrated ability to propose, initiate, and carry out ambitious data-intensive research projects.',\n",
" 'Entrepreneurial abilities: demonstrated skills in finding unmet needs, translating those into tractable solutions that can be implemented, and working with few specifically assigned resources.',\n",
" 'Self-motivated and works well both independently and as part of a team.',\n",
" 'Strong collaboration skills with highly technical researchers and ability to engage across a variety of methodological fields (e.g., computer science, mathematics, and statistics), computational platforms, and ideally, research domains (e.g., health sciences, life sciences, social sciences, and medicine).',\n",
" 'Excellent and demonstrated ability to regularly, effectively communicate with management teams.',\n",
" 'Ability to communicate data insights that translate into significant impact in a clear and effective manner to technical and non-technical personnel at various levels in the organization and to external research and education audiences.',\n",
" 'In depth skills and experience with independently resolving complex computing / data / CI problems using introductory and / or intermediate principles.',\n",
" 'Ability to curate/clean/organize large and messy datasets; to write code to query and transform both unstructured and structured data.',\n",
" 'Strong programming skills in scripting languages such as Python, Java/Scala, and SQL and comfort with advanced analytics tools such as R, Spark, and/or Tableau, in addition to in programming languages (e.g. C/C++). Strong working knowledge of Hadoop, extraction/transformation/loading (ETL). Record of prototyping, developing and scaling software.',\n",
" 'Fluent in using scientific computing environments, e.g. HPC cluster (CPUs and GPUs) or cloud (AWS, Azure, Google, Salesforce, IBM, or VMWare).',\n",
" 'Highly advanced skills, and extensive experience associated multiple of the following: data modeling; data mining; mathematical modeling; artificial intelligence; machine learning; deep learning models (e.g., 2/3d CNN, LSTM/GRU), architecture (e.g., Resnet, U-net), and frameworks (e.g.,Tensorflow, pytorch, keras); natural language processing; knowledge graphs; reinforcement learning; data representation, and optimization.',\n",
" 'BS with 7+ years or MS with 6+ years or PhD with 3+ years of applicable experience is expected. Degree(s) should be in a technical discipline such as Computer Science, Engineering, Statistics, Physic, Math or other related fields.',\n",
" 'Experience working in health and biomedical sectors preferred, but not required.',\n",
" 'This is a two-year contract position. Contract positions may be extended based on operational demand. Contract positions are eligible to participate in the health and welfare programs offered by UC Berkeley.',\n",
" 'The salary range designated for this position: $117,800 - $130,000; however, starting salary will be commensurate with experience.')"
]
},
"metadata": {
"tags": []
},
"execution_count": 11
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "tAudyKjTfgpn",
"colab_type": "text"
},
"source": [
"### Standard Resume"
]
},
{
"cell_type": "code",
"metadata": {
"id": "75kAw5vkfkYC",
"colab_type": "code",
"colab": {}
},
"source": [
"f_p = '/content/Liveproject Resume.txt'\n",
"\n",
"list_of_lists = []\n",
"\n",
"with open(f_p) as f:\n",
" for line in f:\n",
" inner_list = [line.strip() for line in line.split(' ')]\n",
" list_of_lists.append(inner_list)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "R7YnsI2UfvZX",
"colab_type": "code",
"colab": {}
},
"source": [
"l_resume = []\n",
"\n",
"# removing all non-alpha numeric\n",
"\n",
"# flattening all into a single list\n",
"\n",
"for l in list_of_lists:\n",
" for e in l:\n",
" l_resume.append(re.sub('[\\W_]+', '', e))"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "FHU8ZRuJf00c",
"colab_type": "code",
"colab": {}
},
"source": [
"# # removing empty strings\n",
"\n",
"l_test = [x for x in l_resume if x != '']\n",
"#l_test"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "508XTEeKf7Ec",
"colab_type": "code",
"colab": {}
},
"source": [
"# combining all strings to form a long sentence\n",
"l_nonempty = ' '.join(l_test)\n",
"#l_nonempty"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "_VIF0ahT2-lU",
"colab_type": "text"
},
"source": [
"## Helper Functions"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZAxfIXr_3Qpu",
"colab_type": "text"
},
"source": [
"### cluster_to_skills()"
]
},
{
"cell_type": "code",
"metadata": {
"id": "1ef-SBtD3I9o",
"colab_type": "code",
"colab": {}
},
"source": [
"def cluster_to_skills(df_cluster, max_words=15): \n",
" \n",
" # From our previous IDF values, compute the tf-idf scores for documents within this cluster\n",
" s_temp = df_cluster['Bullet'].str.cat(sep=',')\n",
" s_temp=[s_temp]\n",
" \n",
" tfidf_matrix_all = bullet_vectorizer.transform(s_temp)\n",
"\n",
" num_samples, num_features = tfidf_matrix_all.shape\n",
" print(\"#samples: %d, #features: %d\" % (num_samples, num_features))\n",
" \n",
" #get the top n scores\n",
" df = pd.DataFrame(tfidf_matrix_all.T.todense(), index=bullet_vectorizer.get_feature_names(), columns=[\"tfidf\"])\n",
" \n",
" df_top_n = df.sort_values(by=[\"tfidf\"],ascending=False).head(max_words)\n",
" df_top_n = df_top_n.apply(lambda x: np.round(x, decimals=2))\n",
" \n",
" # setting index name explicity, required by to_dict()\n",
" df_top_n.index.name = 'feat'\n",
"\n",
" # getting a dictionary from df\n",
" dict_from_df = df_top_n.T.to_dict('list')\n",
"\n",
" # from list to float\n",
" words_to_score = {k:v[0] for k, v in dict_from_df.items()}\n",
" #words_to_score\n",
"\n",
" # transposing it\n",
" df_transposed = df_top_n.T \n",
" \n",
" # Word Cloud Generator\n",
" # cloud_generator = WordCloud(background_color='white', color_func=_color_func, random_state=1)\n",
" # wordcloud_image = cloud_generator.fit_words(words_to_score)\n",
" \n",
" # calculate cosine\n",
" v_cos_sim = cosine_similarity(df_transposed,df_resume_skills_transposed)\n",
"\n",
" # somehow the returned type from cosine_similarity is overlycomplex, simplyfying using squeeze\n",
" v_simple_cos = v_cos_sim.squeeze()\n",
"\n",
" #return wordcloud_image\n",
" return v_simple_cos"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "HDjzreg-DIPD",
"colab_type": "text"
},
"source": [
"### resume_to_skills()"
]
},
{
"cell_type": "code",
"metadata": {
"id": "az3_-73zCTJA",
"colab_type": "code",
"colab": {}
},
"source": [
"def resume_to_skills(resume, max_words=15): \n",
" \n",
" \n",
" # From our previous IDF values, compute the tf-idf scores for documents within this cluster\n",
" #s_temp = resume.str.cat(sep=',')\n",
" #s_temp=[s_temp]\n",
" #tfidf_matrix_all = bullet_vectorizer.fit_transform(s_temp)\n",
"\n",
" #num_samples, num_features = tfidf_matrix_all.shape\n",
" #print(\"#samples: %d, #features: %d\" % (num_samples, num_features))\n",
" \n",
" tfidf_matrix_all = resume\n",
"\n",
" #get the top n scores\n",
" df = pd.DataFrame(tfidf_matrix_all.T.todense(), index=bullet_vectorizer.get_feature_names(), columns=[\"tfidf\"])\n",
" #df_top_n = df.sort_values(by=[\"tfidf\"],ascending=False)\n",
" df_top_n = df.sort_values(by=[\"tfidf\"],ascending=False).head(max_words)\n",
" df_top_n = df_top_n.apply(lambda x: np.round(x, decimals=2))\n",
" #print(df_top_n.head(2))\n",
" \n",
" # setting index name explicity, required by to_dict()\n",
" df_top_n.index.name = 'feat'\n",
"\n",
" # getting a dictionary from df\n",
" dict_from_df = df_top_n.T.to_dict('list')\n",
"\n",
" # from list to float\n",
" words_to_score = {k:v[0] for k, v in dict_from_df.items()}\n",
" #words_to_score\n",
"\n",
"\n",
" # Word Cloud Generator\n",
" # cloud_generator = WordCloud(background_color='white', color_func=_color_func, random_state=1)\n",
" # wordcloud_image = cloud_generator.fit_words(words_to_score)\n",
" \n",
" #return wordcloud_image\n",
" return df_top_n"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "lo7U4coKaX39",
"colab_type": "text"
},
"source": [
"## 3.Transform bullet points into TFIDF vectors "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TOEWaxdVgFus",
"colab_type": "text"
},
"source": [
"### Downloaded Resume"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "SVMRWg6BahCv",
"colab_type": "text"
},
"source": [
"We want to cluster the individual bullet points, so we create one large list of them here. This nested list comprehension in the next code cell is equivalent to:"
]
},
{
"cell_type": "code",
"metadata": {
"id": "8eXSQKnPakFk",
"colab_type": "code",
"colab": {}
},
"source": [
"bullet_points = []\n",
"for sublist in df_pkl['bullets']:\n",
" for item in sublist:\n",
" bullet_points.append(item)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "sYsv805Va6fz",
"colab_type": "code",
"outputId": "b0fa9299-eb09-4a59-8b92-dcd82ceaebda",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"len(bullet_points)"
],
"execution_count": 19,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"1123"
]
},
"metadata": {
"tags": []
},
"execution_count": 19
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "SokjMYYSa93B",
"colab_type": "text"
},
"source": [
"Once again, it's important to remove stopwords from our TFIDF vectors. Otherwise, the top words in our clusters will be words like 'and'."
]
},
{
"cell_type": "code",
"metadata": {
"id": "f-hB8L8oa_9y",
"colab_type": "code",
"colab": {}
},
"source": [
"bullet_vectorizer = TfidfVectorizer(stop_words='english')\n",
"tfidf_skills = bullet_vectorizer.fit_transform(bullet_points)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "9Sx27b72bEgZ",
"colab_type": "code",
"outputId": "17d3bb40-13c6-449f-ca16-0427285e4f19",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"tfidf_skills.shape"
],
"execution_count": 21,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(1123, 2050)"
]
},
"metadata": {
"tags": []
},
"execution_count": 21
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "gAq-BOdIx1w3",
"colab_type": "code",
"outputId": "7df0ac97-1cee-4404-f749-348ff00aa869",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 426
}
},
"source": [
"# Convert Sparse Matrix to Pandas Dataframe to see the word frequencies.\n",
"doc_term_matrix = tfidf_skills.todense()\n",
"df_temp_skills = pd.DataFrame(doc_term_matrix, \n",
" columns=bullet_vectorizer.get_feature_names()) \n",
"df_temp_skills"
],
"execution_count": 22,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>000</th>\n",
" <th>10</th>\n",
" <th>100mm</th>\n",
" <th>10pm</th>\n",
" <th>117</th>\n",
" <th>12</th>\n",
" <th>130</th>\n",
" <th>15</th>\n",
" <th>15am</th>\n",
" <th>15pm</th>\n",
" <th>20</th>\n",
" <th>200</th>\n",
" <th>24</th>\n",
" <th>25</th>\n",
" <th>2800</th>\n",
" <th>30</th>\n",
" <th>30am</th>\n",
" <th>30pm</th>\n",
" <th>3d</th>\n",
" <th>401</th>\n",
" <th>45pm</th>\n",
" <th>800</th>\n",
" <th>87653</th>\n",
" <th>87654</th>\n",
" <th>87655</th>\n",
" <th>abilities</th>\n",
" <th>ability</th>\n",
" <th>able</th>\n",
" <th>abreast</th>\n",
" <th>absence</th>\n",
" <th>academic</th>\n",
" <th>acceleration</th>\n",
" <th>acceptable</th>\n",
" <th>access</th>\n",
" <th>accessible</th>\n",
" <th>accomplishments</th>\n",
" <th>according</th>\n",
" <th>accountability</th>\n",
" <th>accredited</th>\n",
" <th>accuracy</th>\n",
" <th>...</th>\n",
" <th>vitae</th>\n",
" <th>vmware</th>\n",
" <th>voice</th>\n",
" <th>volume</th>\n",
" <th>volumes</th>\n",
" <th>walk</th>\n",
" <th>walnut</th>\n",
" <th>warehouse</th>\n",
" <th>way</th>\n",
" <th>ways</th>\n",
" <th>web</th>\n",
" <th>wed</th>\n",
" <th>week</th>\n",
" <th>weekday</th>\n",
" <th>weighing</th>\n",
" <th>weka</th>\n",
" <th>welcome</th>\n",
" <th>welfare</th>\n",
" <th>willing</th>\n",
" <th>willingness</th>\n",
" <th>windows</th>\n",
" <th>word</th>\n",
" <th>work</th>\n",
" <th>workers</th>\n",
" <th>workflow</th>\n",
" <th>workflows</th>\n",
" <th>working</th>\n",
" <th>works</th>\n",
" <th>world</th>\n",
" <th>wrangle</th>\n",
" <th>write</th>\n",
" <th>writing</th>\n",
" <th>written</th>\n",
" <th>wsj</th>\n",
" <th>xgboost</th>\n",
" <th>xponent</th>\n",
" <th>year</th>\n",
" <th>years</th>\n",
" <th>zeppelin</th>\n",
" <th>zr</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.00000</td>\n",
" <td>0.213497</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.27894</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.182367</td>\n",
" <td>0.00000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.00000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.52339</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.00000</td>\n",
" <td>0.115922</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.00000</td>\n",
" <td>0.260376</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1118</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.00000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1119</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.00000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.539787</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1120</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.00000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1121</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.00000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1122</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.00000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>1123 rows × 2050 columns</p>\n",
"</div>"
],
"text/plain": [
" 000 10 100mm 10pm 117 ... xponent year years zeppelin zr\n",
"0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0\n",
"1 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0\n",
"2 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0\n",
"3 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0\n",
"4 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0\n",
"... ... ... ... ... ... ... ... ... ... ... ...\n",
"1118 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0\n",
"1119 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0\n",
"1120 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0\n",
"1121 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0\n",
"1122 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0\n",
"\n",
"[1123 rows x 2050 columns]"
]
},
"metadata": {
"tags": []
},
"execution_count": 22
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Ramf7RhCgI-_",
"colab_type": "text"
},
"source": [
"### Standard Resume"
]
},
{
"cell_type": "code",
"metadata": {
"id": "eTA8quFkgLjK",
"colab_type": "code",
"colab": {}
},
"source": [
"# note the [] around the input string, as transform expects a list\n",
"\n",
"my_resume = bullet_vectorizer.transform([l_nonempty])"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "bpKjqDcXjDy8",
"colab_type": "code",
"outputId": "7c5205df-2ca9-4b5b-cd01-91da10f0d921",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"my_resume.shape"
],
"execution_count": 24,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(1, 2050)"
]
},
"metadata": {
"tags": []
},
"execution_count": 24
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "9NWiQK7IxnRo",
"colab_type": "code",
"outputId": "bbfa4a7b-04b2-4f69-dd83-8237a02256ac",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 126
}
},
"source": [
"# Convert Sparse Matrix to Pandas Dataframe to see the word frequencies.\n",
"doc_term_matrix = my_resume.todense()\n",
"df_temp_resume = pd.DataFrame(doc_term_matrix, \n",
" columns=bullet_vectorizer.get_feature_names()) \n",
"df_temp_resume"
],
"execution_count": 25,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>000</th>\n",
" <th>10</th>\n",
" <th>100mm</th>\n",
" <th>10pm</th>\n",
" <th>117</th>\n",
" <th>12</th>\n",
" <th>130</th>\n",
" <th>15</th>\n",
" <th>15am</th>\n",
" <th>15pm</th>\n",
" <th>20</th>\n",
" <th>200</th>\n",
" <th>24</th>\n",
" <th>25</th>\n",
" <th>2800</th>\n",
" <th>30</th>\n",
" <th>30am</th>\n",
" <th>30pm</th>\n",
" <th>3d</th>\n",
" <th>401</th>\n",
" <th>45pm</th>\n",
" <th>800</th>\n",
" <th>87653</th>\n",
" <th>87654</th>\n",
" <th>87655</th>\n",
" <th>abilities</th>\n",
" <th>ability</th>\n",
" <th>able</th>\n",
" <th>abreast</th>\n",
" <th>absence</th>\n",
" <th>academic</th>\n",
" <th>acceleration</th>\n",
" <th>acceptable</th>\n",
" <th>access</th>\n",
" <th>accessible</th>\n",
" <th>accomplishments</th>\n",
" <th>according</th>\n",
" <th>accountability</th>\n",
" <th>accredited</th>\n",
" <th>accuracy</th>\n",
" <th>...</th>\n",
" <th>vitae</th>\n",
" <th>vmware</th>\n",
" <th>voice</th>\n",
" <th>volume</th>\n",
" <th>volumes</th>\n",
" <th>walk</th>\n",
" <th>walnut</th>\n",
" <th>warehouse</th>\n",
" <th>way</th>\n",
" <th>ways</th>\n",
" <th>web</th>\n",
" <th>wed</th>\n",
" <th>week</th>\n",
" <th>weekday</th>\n",
" <th>weighing</th>\n",
" <th>weka</th>\n",
" <th>welcome</th>\n",
" <th>welfare</th>\n",
" <th>willing</th>\n",
" <th>willingness</th>\n",
" <th>windows</th>\n",
" <th>word</th>\n",
" <th>work</th>\n",
" <th>workers</th>\n",
" <th>workflow</th>\n",
" <th>workflows</th>\n",
" <th>working</th>\n",
" <th>works</th>\n",
" <th>world</th>\n",
" <th>wrangle</th>\n",
" <th>write</th>\n",
" <th>writing</th>\n",
" <th>written</th>\n",
" <th>wsj</th>\n",
" <th>xgboost</th>\n",
" <th>xponent</th>\n",
" <th>year</th>\n",
" <th>years</th>\n",
" <th>zeppelin</th>\n",
" <th>zr</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>1 rows × 2050 columns</p>\n",
"</div>"
],
"text/plain": [
" 000 10 100mm 10pm 117 ... xponent year years zeppelin zr\n",
"0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0\n",
"\n",
"[1 rows x 2050 columns]"
]
},
"metadata": {
"tags": []
},
"execution_count": 25
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "FmF_QPI6bNsV",
"colab_type": "text"
},
"source": [
"## 4.Reduce dimensions of the TFIDF vectors with SVD/LSA"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "rknw9MDRg9pV",
"colab_type": "text"
},
"source": [
"### Downloaded Resume"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "dsKssvVub-dB",
"colab_type": "text"
},
"source": [
"\n",
"We'll use 100 for the n_components as we found this accounts for almost 50% of the explained variance in the data with the SVD, and is the value suggested by the documentation. However, you can also use a lower value of 50, which should make the 'topics' in the LSA a bit more broad."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "oLWJY50bbxNh",
"colab_type": "text"
},
"source": [
"We also normalize the results of the SVD as recommended in the example:\n",
"\n",
"\"[TFIDF] Vectorizer results are normalized, which makes KMeans behave as spherical k-means for better results. Since LSA/SVD results are not normalized, we have to redo the normalization."
]
},
{
"cell_type": "code",
"metadata": {
"id": "mnSEmgb-bw5Z",
"colab_type": "code",
"colab": {}
},
"source": [
"\n",
"svd = TruncatedSVD(n_components=100)\n",
"#lsa = svd.fit_transform(tfidf_skills)\n",
"#norm = Normalizer().fit_transform(lsa)\n",
"\n",
"\n",
"normalizer = Normalizer(copy=False)\n",
"lsa = make_pipeline(svd, normalizer)\n",
"norm = lsa.fit_transform(tfidf_skills)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "2vNsjHsaeUgT",
"colab_type": "code",
"colab": {}
},
"source": [
"#norm"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "H0Fw2oclYpBW",
"colab_type": "code",
"outputId": "1b2af515-8015-4563-aa44-e5ce58f51295",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"print(np.shape(norm))"
],
"execution_count": 28,
"outputs": [
{
"output_type": "stream",
"text": [
"(1123, 100)\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "05JD1EpYhBNz",
"colab_type": "text"
},
"source": [
"### Standard Resume"
]
},
{
"cell_type": "code",
"metadata": {
"id": "AbqAZFq_hD1m",
"colab_type": "code",
"outputId": "4f7fb8cd-b310-4ca6-cc6d-7343041f17e3",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 52
}
},
"source": [
"norm_standard = lsa.fit_transform(my_resume)"
],
"execution_count": 29,
"outputs": [
{
"output_type": "stream",
"text": [
"/usr/local/lib/python3.6/dist-packages/sklearn/decomposition/_truncated_svd.py:194: RuntimeWarning: invalid value encountered in true_divide\n",
" self.explained_variance_ratio_ = exp_var / full_var\n"
],
"name": "stderr"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "m_5uuPoSkTiB",
"colab_type": "text"
},
"source": [
"It is just one sample and the algorithm has no idea how to decompose it into a lower dimension because there are not other samples to compare it with."
]
},
{
"cell_type": "code",
"metadata": {
"id": "sCa2SMg9jgP_",
"colab_type": "code",
"outputId": "92c3679c-5b8f-4dd2-bb5a-7abe728cec6a",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"print(np.shape(norm_standard))"
],
"execution_count": 30,
"outputs": [
{
"output_type": "stream",
"text": [
"(1, 1)\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TVaORR-jnuz_",
"colab_type": "text"
},
"source": [
"Not required to process standard resume"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "t3NAFPSHcF3k",
"colab_type": "text"
},
"source": [
"## 5.Use k-means to cluster the SVD-transformed data, and decide on an optimal number of clusters"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "yGpvCdWNhKv4",
"colab_type": "text"
},
"source": [
"### Downloaded Resume"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "gj3gmIRqc7Fs",
"colab_type": "text"
},
"source": [
"We create a DataFrame with the cluster labels in order to index our TFIDF matrix for further processing"
]
},
{
"cell_type": "code",
"metadata": {
"id": "t1KR06Ngc9tJ",
"colab_type": "code",
"outputId": "0d6fcd52-b446-4b74-9fa7-9c39a71fcb9d",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 0
}
},
"source": [
"clusters = 6\n",
"km = KMeans(n_clusters=clusters, random_state=42)\n",
"km.fit(norm)\n",
"#cluster_labels_df = pd.DataFrame({'Cluster': km.labels_})"
],
"execution_count": 31,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,\n",
" n_clusters=6, n_init=10, n_jobs=None, precompute_distances='auto',\n",
" random_state=42, tol=0.0001, verbose=0)"
]
},
"metadata": {
"tags": []
},
"execution_count": 31
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "JDKvC5SmhSqD",
"colab_type": "text"
},
"source": [
"### Standard Resume"
]
},
{
"cell_type": "code",
"metadata": {
"id": "Sb6GbaCmhVO3",
"colab_type": "code",
"colab": {}
},
"source": [
"# Not required"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "U-6kGDmJjlEV",
"colab_type": "text"
},
"source": [
"## 6.Combine skill requirement cluster texts "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "lk7NYdUkdGL7",
"colab_type": "text"
},
"source": [
"### Setting Things ... "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "wiEOPq1Mbyk2",
"colab_type": "text"
},
"source": [
"Combining bullets (as these are set as individual docs for tf-idf) with their corresponding cluster ID"
]
},
{
"cell_type": "code",
"metadata": {
"id": "F1VsFBQ0aWoK",
"colab_type": "code",
"colab": {}
},
"source": [
"labels_df = pd.DataFrame({'Bullet': bullet_points,'Cluster': km.labels_})"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "TxCQepEPai2B",
"colab_type": "code",
"outputId": "596c47c4-884d-4a4b-a852-bf33da88b6b4",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"labels_df.shape"
],
"execution_count": 34,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(1123, 2)"
]
},
"metadata": {
"tags": []
},
"execution_count": 34
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "pnKPWHvlamLH",
"colab_type": "code",
"outputId": "0f83f8e3-f285-455a-9f2b-6006058b7bb9",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 107
}
},
"source": [
"labels_df.head(2)"
],
"execution_count": 35,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Bullet</th>\n",
" <th>Cluster</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Demonstrated ability to propose, initiate, and...</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Entrepreneurial abilities: demonstrated skills...</td>\n",
" <td>4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Bullet Cluster\n",
"0 Demonstrated ability to propose, initiate, and... 4\n",
"1 Entrepreneurial abilities: demonstrated skills... 4"
]
},
"metadata": {
"tags": []
},
"execution_count": 35
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "qrb60rGRXRVq",
"colab_type": "text"
},
"source": [
"Extracting Cluster 0 "
]
},
{
"cell_type": "code",
"metadata": {
"id": "92hqMsYmH5zn",
"colab_type": "code",
"colab": {}
},
"source": [
"# cluster 0\n",
"df_0 = list(labels_df.groupby('Cluster'))[0][1]\n",
"\n",
"# cluster 2\n",
"df_2 = list(labels_df.groupby('Cluster'))[2][1]"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "8ZQLKYvcW4V_",
"colab_type": "text"
},
"source": [
"### Converting Std Resume for Cosine Sim"
]
},
{
"cell_type": "code",
"metadata": {
"id": "NufP2JlCEbBS",
"colab_type": "code",
"outputId": "fcca3eb1-880f-4f41-98f6-e57a3e05b9d7",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"np.shape(bullet_vectorizer.get_feature_names())"
],
"execution_count": 37,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(2050,)"
]
},
"metadata": {
"tags": []
},
"execution_count": 37
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "XW3O2j0PC_M0",
"colab_type": "code",
"colab": {}
},
"source": [
"df_resume_skills = resume_to_skills(my_resume)\n",
"#df_resume_skills"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "_XFrTB8-GdKV",
"colab_type": "code",
"colab": {}
},
"source": [
"# transposing it\n",
"df_resume_skills_transposed = df_resume_skills.T # or df1.transpose()\n",
"#df_resume_skills_transposed"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "BCjhNmhQgR0f",
"colab_type": "text"
},
"source": [
"## 7.Calculate cosine similarity between our resume skills and the skill requirement clusters."
]
},
{
"cell_type": "code",
"metadata": {
"id": "qVFG0LmWTzpX",
"colab_type": "code",
"outputId": "32a838ff-1cdb-4e47-e6ca-d7d4b3844837",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 52
}
},
"source": [
"cluster_to_skills(df_0)"
],
"execution_count": 40,
"outputs": [
{
"output_type": "stream",
"text": [
"#samples: 1, #features: 2050\n"
],
"name": "stdout"
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array(0.71436174)"
]
},
"metadata": {
"tags": []
},
"execution_count": 40
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "MVmrjqB6jrYC",
"colab_type": "text"
},
"source": [
"Grouping by Cluster column and using split/apply/combine loop"
]
},
{
"cell_type": "code",
"metadata": {
"id": "Pwqias0JLI4C",
"colab_type": "code",
"colab": {}
},
"source": [
"gb_df = labels_df.groupby('Cluster',as_index=False)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "S-K2R7_5N196",
"colab_type": "code",
"outputId": "48eef7ed-a296-45b0-bb3d-52cf4e60dc45",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 123
}
},
"source": [
"cos_result = gb_df.apply(cluster_to_skills)"
],
"execution_count": 42,
"outputs": [
{
"output_type": "stream",
"text": [
"#samples: 1, #features: 2050\n",
"#samples: 1, #features: 2050\n",
"#samples: 1, #features: 2050\n",
"#samples: 1, #features: 2050\n",
"#samples: 1, #features: 2050\n",
"#samples: 1, #features: 2050\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "_gBF6CN-Ycu3",
"colab_type": "code",
"outputId": "b8b09737-98be-439c-e40a-aabe8c4b7398",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 141
}
},
"source": [
"cos_result"
],
"execution_count": 43,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0 0.7143617445537217\n",
"1 0.8811555211886989\n",
"2 0.9769363805300825\n",
"3 0.8506619288194702\n",
"4 0.930230370636883\n",
"5 0.9575181329314076\n",
"dtype: object"
]
},
"metadata": {
"tags": []
},
"execution_count": 43
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "71ypVKDQj2nc",
"colab_type": "text"
},
"source": [
"output is a pd.series"
]
},
{
"cell_type": "code",
"metadata": {
"id": "NLeUKB0wYj0d",
"colab_type": "code",
"outputId": "9424256c-e92c-42ba-f551-9bf0d27aad4e",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"print(cos_result.shape)"
],
"execution_count": 44,
"outputs": [
{
"output_type": "stream",
"text": [
"(6,)\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "Iw0LuhShZSG8",
"colab_type": "code",
"outputId": "2e9eb250-cb68-415b-8778-c35c7f7f84b6",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"type(cos_result)"
],
"execution_count": 45,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"pandas.core.series.Series"
]
},
"metadata": {
"tags": []
},
"execution_count": 45
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "sEUgBAAmj6qW",
"colab_type": "text"
},
"source": [
"convert from pd.series to pd.dataframe"
]
},
{
"cell_type": "code",
"metadata": {
"id": "QUpGDYlxc2j8",
"colab_type": "code",
"colab": {}
},
"source": [
"df_cosine = cos_result.to_frame(name=\"cosine\")"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "-jn-AWIHdCGH",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 227
},
"outputId": "15138c8e-53ad-4efa-b14b-7fb7ada1f4f0"
},
"source": [
"df_cosine"
],
"execution_count": 47,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>cosine</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.7143617445537217</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.8811555211886989</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.9769363805300825</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.8506619288194702</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.930230370636883</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>0.9575181329314076</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" cosine\n",
"0 0.7143617445537217\n",
"1 0.8811555211886989\n",
"2 0.9769363805300825\n",
"3 0.8506619288194702\n",
"4 0.930230370636883\n",
"5 0.9575181329314076"
]
},
"metadata": {
"tags": []
},
"execution_count": 47
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "QuvGtBSEdgo1",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"outputId": "ca6ff383-53fb-49c7-96ad-890e7b0a530f"
},
"source": [
"df_cosine.index"
],
"execution_count": 48,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"RangeIndex(start=0, stop=6, step=1)"
]
},
"metadata": {
"tags": []
},
"execution_count": 48
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "at0MnYHOj-zT",
"colab_type": "text"
},
"source": [
"adding cluster as index"
]
},
{
"cell_type": "code",
"metadata": {
"id": "oYlOVyJqdfKy",
"colab_type": "code",
"colab": {}
},
"source": [
"df_cosine['cluster'] = df_cosine.index"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "cmT0fexMdsHM",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 227
},
"outputId": "a3406cd4-3576-4857-a3b9-54eafffc23f4"
},
"source": [
"df_cosine"
],
"execution_count": 50,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>cosine</th>\n",
" <th>cluster</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.7143617445537217</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.8811555211886989</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.9769363805300825</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.8506619288194702</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.930230370636883</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>0.9575181329314076</td>\n",
" <td>5</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" cosine cluster\n",
"0 0.7143617445537217 0\n",
"1 0.8811555211886989 1\n",
"2 0.9769363805300825 2\n",
"3 0.8506619288194702 3\n",
"4 0.930230370636883 4\n",
"5 0.9575181329314076 5"
]
},
"metadata": {
"tags": []
},
"execution_count": 50
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "gJRezKBIgZ9v",
"colab_type": "text"
},
"source": [
"## 8.Rank the skill clusters by similarity to our resume skill list"
]
},
{
"cell_type": "code",
"metadata": {
"id": "h2_UsWFMgbFd",
"colab_type": "code",
"outputId": "c5eb6701-248d-4537-c3b0-aa6d0c531d9d",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 227
}
},
"source": [
"df_cosine.sort_values(by=['cosine'],ascending=False)"
],
"execution_count": 51,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>cosine</th>\n",
" <th>cluster</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.9769363805300825</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>0.9575181329314076</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.930230370636883</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.8811555211886989</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.8506619288194702</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.7143617445537217</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" cosine cluster\n",
"2 0.9769363805300825 2\n",
"5 0.9575181329314076 5\n",
"4 0.930230370636883 4\n",
"1 0.8811555211886989 1\n",
"3 0.8506619288194702 3\n",
"0 0.7143617445537217 0"
]
},
"metadata": {
"tags": []
},
"execution_count": 51
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "kpNMv82YlVOZ",
"colab_type": "text"
},
"source": [
"Highest similarity is with cluster 2"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jbD5rcLGgdz8",
"colab_type": "text"
},
"source": [
"## 9.Visualize the results of how similar the clusters are to our skills."
]
},
{
"cell_type": "code",
"metadata": {
"id": "UzDrHYaYkFuN",
"colab_type": "code",
"colab": {}
},
"source": [
"# TBD"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "FmWoieUKggwb",
"colab_type": "text"
},
"source": [
"## 10.Determine the skills that are missing from our resume that we could learn and/or add to our resume."
]
},
{
"cell_type": "code",
"metadata": {
"id": "7WaR6Js-ghwI",
"colab_type": "code",
"colab": {}
},
"source": [
"# TBD"
],
"execution_count": 0,
"outputs": []
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment