patternproject/wk4_submission.ipynb

## wk4_submission.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Wk4_Submission.ipynb",
      "provenance": [],
      "toc_visible": true,
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/patternproject/d6e6dc65ff7c3b8048a74d0a44843c25/wk4_submission.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ExrcEKOJqRMS",
        "colab_type": "text"
      },
      "source": [
        "Manning LP \"Using Online Job Postings to Improve Your Data Science Resume\" "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GTXpVFDmqdGi",
        "colab_type": "text"
      },
      "source": [
        "Week 4 - Finding Missing Skills From Our Resume"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cA2s2jr6uJhM",
        "colab_type": "text"
      },
      "source": [
        "# 4.1 Determine Missing Resume Skills"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6EWGJZG3uhs2",
        "colab_type": "text"
      },
      "source": [
        "## Objective: \n",
        "\n",
        "Optimize our resume by finding which skills we are missing. We will do this by finding skills missing from our resume that are in the least-similar skill requirement clusters."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "jpAg086UuvRn",
        "colab_type": "text"
      },
      "source": [
        "## Workflow:\n",
        "\n",
        "1.   Combine skill requirement cluster texts and calculate cosine similarity between our resume skills and the skill requirement clusters.\n",
        "2.   Rank the skill clusters by similarity to our resume skill list and visualize the results of how similar the clusters are to our skills.\n",
        "3.   Determine the skills that are missing from our resume that we could learn and/or add to our resume.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-GSEEXdWOKpl",
        "colab_type": "text"
      },
      "source": [
        "# 4.2 Action Starts Here ... "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "X1LYPAywxnbp",
        "colab_type": "text"
      },
      "source": [
        "## 1.Importing Libraries"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RkZ4_KV2ZlPL",
        "colab_type": "text"
      },
      "source": [
        "Note that we are installing the tqdm package here. This is a package that shows progress bars for loops. It's useful to see how long it takes to loop through various k-means cluster values. Since it's not in the environment file, we install it on-the-fly in the notebook. Preceeding a command with an exclamation point (!) runs the command on the underlying OS. Giving the flag -y ignores the prompt from conda asking if we want to install the package and it's prerequisites."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "jGwlBbYnZtQG",
        "colab_type": "code",
        "outputId": "ff529be1-340a-4754-a29d-053a957c6b74",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "source": [
        "#!pip install tqdm -y\n",
        "!pip install tqdm "
      ],
      "execution_count": 3,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Requirement already satisfied: tqdm in /usr/local/lib/python3.6/dist-packages (4.28.1)\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "xXHEd5UKxnFe",
        "colab_type": "code",
        "outputId": "23dea7b4-6342-4a92-b5fa-939c98c06fd0",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 87
        }
      },
      "source": [
        "#from tqdm.notebook import tqdm\n",
        "from tqdm import tqdm\n",
        "import pandas as pd\n",
        "import numpy as np\n",
        "\n",
        "import matplotlib.pyplot as plt\n",
        "%matplotlib inline\n",
        "\n",
        "\n",
        "# string processing, reg expressions\n",
        "import re, string\n",
        "\n",
        "# for freq vectorizer\n",
        "from sklearn.feature_extraction.text import TfidfVectorizer\n",
        "\n",
        "# for DIM Red\n",
        "from sklearn.decomposition import TruncatedSVD\n",
        "\n",
        "from sklearn.pipeline import make_pipeline\n",
        "from sklearn.preprocessing import Normalizer\n",
        "\n",
        "# for Metrics\n",
        "from sklearn import metrics\n",
        "from sklearn.metrics import silhouette_score, silhouette_samples\n",
        "\n",
        "# for Clustering\n",
        "from sklearn.cluster import KMeans\n",
        "\n",
        "# for Text Processing \n",
        "import nltk\n",
        "from nltk.stem import PorterStemmer\n",
        "from nltk.corpus import stopwords\n",
        "nltk.download('stopwords')\n",
        "nltk.download('punkt')\n",
        "\n",
        "from pylab import *\n",
        "\n",
        "# for elbow plot (Visualizing)\n",
        "import scipy.spatial.distance as scdist\n",
        "\n",
        "# for Counting the objects in each K-Mean Cluster\n",
        "from collections import Counter\n",
        "\n",
        "# for Word Cloud Visulization\n",
        "from wordcloud import WordCloud\n",
        "\n",
        "# Compute Cosine Similarity\n",
        "from sklearn.metrics.pairwise import cosine_similarity\n",
        "\n",
        "from sklearn.decomposition import PCA\n",
        "from sklearn.manifold import TSNE"
      ],
      "execution_count": 4,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "[nltk_data] Downloading package stopwords to /root/nltk_data...\n",
            "[nltk_data]   Package stopwords is already up-to-date!\n",
            "[nltk_data] Downloading package punkt to /root/nltk_data...\n",
            "[nltk_data]   Package punkt is already up-to-date!\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GBWZzknYOPLJ",
        "colab_type": "text"
      },
      "source": [
        "## 2.Loading the Dataset"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cJLIzNX6fbmo",
        "colab_type": "text"
      },
      "source": [
        "### Downloaded Resume"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WGEuvw8jaBC1",
        "colab_type": "text"
      },
      "source": [
        "We also load the DataFrame from step 2 which holds the most similar job postings to our resume."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "asdB9SEnOOs-",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "df_pkl = pd.read_pickle('step2_df.pk')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "KQ3RKcloxzw9",
        "colab_type": "code",
        "outputId": "b06bbbf3-682e-492f-b425-afeb296d7fe5",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 197
        }
      },
      "source": [
        "df_pkl.head()"
      ],
      "execution_count": 6,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>title</th>\n",
              "      <th>body</th>\n",
              "      <th>bullets</th>\n",
              "      <th>cosine_similarity</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Institutional Data and Research Analyst (6948U...</td>\n",
              "      <td>Institutional Data and Research Analyst (6948U...</td>\n",
              "      <td>()</td>\n",
              "      <td>0.143349</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>Data Science Health Innovation Fellow Job - BI...</td>\n",
              "      <td>Data Science Health Innovation Fellow Job - BI...</td>\n",
              "      <td>(Demonstrated ability to propose, initiate, an...</td>\n",
              "      <td>0.125523</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>Machine Learning Postdoctoral Fellow - San Fra...</td>\n",
              "      <td>Machine Learning Postdoctoral Fellow - San Fra...</td>\n",
              "      <td>(Design and develop distributed machine learni...</td>\n",
              "      <td>0.121162</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>Data Analyst (6256U) 1737 - 1737 - Berkeley, C...</td>\n",
              "      <td>Data Analyst (6256U) 1737 - 1737 - Berkeley, C...</td>\n",
              "      <td>()</td>\n",
              "      <td>0.117481</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>Senior Data Systems Analyst (0599U) - 1668 - 1...</td>\n",
              "      <td>Senior Data Systems Analyst (0599U) - 1668 - 1...</td>\n",
              "      <td>()</td>\n",
              "      <td>0.113083</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "                                               title  ... cosine_similarity\n",
              "0  Institutional Data and Research Analyst (6948U...  ...          0.143349\n",
              "1  Data Science Health Innovation Fellow Job - BI...  ...          0.125523\n",
              "2  Machine Learning Postdoctoral Fellow - San Fra...  ...          0.121162\n",
              "3  Data Analyst (6256U) 1737 - 1737 - Berkeley, C...  ...          0.117481\n",
              "4  Senior Data Systems Analyst (0599U) - 1668 - 1...  ...          0.113083\n",
              "\n",
              "[5 rows x 4 columns]"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 6
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "xgZXH3gfx-up",
        "colab_type": "code",
        "outputId": "f55fb3df-d034-4820-d598-e86c073689ba",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 265
        }
      },
      "source": [
        "print(df_pkl.describe)"
      ],
      "execution_count": 7,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "<bound method NDFrame.describe of                                                 title  ... cosine_similarity\n",
            "0   Institutional Data and Research Analyst (6948U...  ...          0.143349\n",
            "1   Data Science Health Innovation Fellow Job - BI...  ...          0.125523\n",
            "2   Machine Learning Postdoctoral Fellow - San Fra...  ...          0.121162\n",
            "3   Data Analyst (6256U) 1737 - 1737 - Berkeley, C...  ...          0.117481\n",
            "4   Senior Data Systems Analyst (0599U) - 1668 - 1...  ...          0.113083\n",
            "..                                                ...  ...               ...\n",
            "70              Data Scientist - Pittsburgh, PA 15206  ...          0.057230\n",
            "71  AI Research Scientist - Natural Language Proce...  ...          0.056983\n",
            "72   Senior Data Scientist - Authorship - Oakland, CA  ...          0.056819\n",
            "73              Senior Data Scientist - Palo Alto, CA  ...          0.056781\n",
            "74                  Data Scientist - Denver, CO 80221  ...          0.056697\n",
            "\n",
            "[75 rows x 4 columns]>\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "IpfKfXv7yFis",
        "colab_type": "code",
        "outputId": "0e029ef5-71b8-43d4-974b-3c8071d39da7",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 105
        }
      },
      "source": [
        "df_pkl.dtypes"
      ],
      "execution_count": 8,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "title                 object\n",
              "body                  object\n",
              "bullets               object\n",
              "cosine_similarity    float64\n",
              "dtype: object"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 8
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "_0N5_-yfyMn2",
        "colab_type": "code",
        "outputId": "e7a688fb-27c9-4350-81ba-8222c60ce84e",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 176
        }
      },
      "source": [
        "df_pkl.info()"
      ],
      "execution_count": 9,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "<class 'pandas.core.frame.DataFrame'>\n",
            "RangeIndex: 75 entries, 0 to 74\n",
            "Data columns (total 4 columns):\n",
            "title                75 non-null object\n",
            "body                 75 non-null object\n",
            "bullets              75 non-null object\n",
            "cosine_similarity    75 non-null float64\n",
            "dtypes: float64(1), object(3)\n",
            "memory usage: 2.5+ KB\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "CxCcmszPyXq4",
        "colab_type": "code",
        "outputId": "470557ac-a045-4018-b416-b0464674da48",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 105
        }
      },
      "source": [
        "df_pkl.infer_objects().dtypes"
      ],
      "execution_count": 10,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "title                 object\n",
              "body                  object\n",
              "bullets               object\n",
              "cosine_similarity    float64\n",
              "dtype: object"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 10
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Xdln35hF0d2y",
        "colab_type": "code",
        "outputId": "a10e1401-9cbb-4805-9c39-830f19de7755",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 303
        }
      },
      "source": [
        "df_pkl['bullets'][1]"
      ],
      "execution_count": 11,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "('Demonstrated ability to propose, initiate, and carry out ambitious data-intensive research projects.',\n",
              " 'Entrepreneurial abilities: demonstrated skills in finding unmet needs, translating those into tractable solutions that can be implemented, and working with few specifically assigned resources.',\n",
              " 'Self-motivated and works well both independently and as part of a team.',\n",
              " 'Strong collaboration skills with highly technical researchers and ability to engage across a variety of methodological fields (e.g., computer science, mathematics, and statistics), computational platforms, and ideally, research domains (e.g., health sciences, life sciences, social sciences, and medicine).',\n",
              " 'Excellent and demonstrated ability to regularly, effectively communicate with management teams.',\n",
              " 'Ability to communicate data insights that translate into significant impact in a clear and effective manner to technical and non-technical personnel at various levels in the organization and to external research and education audiences.',\n",
              " 'In depth skills and experience with independently resolving complex computing / data / CI problems using introductory and / or intermediate principles.',\n",
              " 'Ability to curate/clean/organize large and messy datasets; to write code to query and transform both unstructured and structured data.',\n",
              " 'Strong programming skills in scripting languages such as Python, Java/Scala, and SQL and comfort with advanced analytics tools such as R, Spark, and/or Tableau, in addition to in programming languages (e.g. C/C++). Strong working knowledge of Hadoop, extraction/transformation/loading (ETL). Record of prototyping, developing and scaling software.',\n",
              " 'Fluent in using scientific computing environments, e.g. HPC cluster (CPUs and GPUs) or cloud (AWS, Azure, Google, Salesforce, IBM, or VMWare).',\n",
              " 'Highly advanced skills, and extensive experience associated multiple of the following: data modeling; data mining; mathematical modeling; artificial intelligence; machine learning; deep learning models (e.g., 2/3d CNN, LSTM/GRU), architecture (e.g., Resnet, U-net), and frameworks (e.g.,Tensorflow, pytorch, keras); natural language processing; knowledge graphs; reinforcement learning; data representation, and optimization.',\n",
              " 'BS with 7+ years or MS with 6+ years or PhD with 3+ years of applicable experience is expected. Degree(s) should be in a technical discipline such as Computer Science, Engineering, Statistics, Physic, Math or other related fields.',\n",
              " 'Experience working in health and biomedical sectors preferred, but not required.',\n",
              " 'This is a two-year contract position. Contract positions may be extended based on operational demand. Contract positions are eligible to participate in the health and welfare programs offered by UC Berkeley.',\n",
              " 'The salary range designated for this position: $117,800 - $130,000; however, starting salary will be commensurate with experience.')"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 11
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tAudyKjTfgpn",
        "colab_type": "text"
      },
      "source": [
        "### Standard Resume"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "75kAw5vkfkYC",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "f_p = '/content/Liveproject Resume.txt'\n",
        "\n",
        "list_of_lists = []\n",
        "\n",
        "with open(f_p) as f:\n",
        "  for line in f:\n",
        "    inner_list = [line.strip() for line in line.split(' ')]\n",
        "    list_of_lists.append(inner_list)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "R7YnsI2UfvZX",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "l_resume = []\n",
        "\n",
        "# removing all non-alpha numeric\n",
        "\n",
        "# flattening all into a single list\n",
        "\n",
        "for l in list_of_lists:\n",
        "  for e in l:\n",
        "   l_resume.append(re.sub('[\\W_]+', '', e))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "FHU8ZRuJf00c",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# # removing empty strings\n",
        "\n",
        "l_test = [x for x in l_resume if x != '']\n",
        "#l_test"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "508XTEeKf7Ec",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# combining all strings to form a long sentence\n",
        "l_nonempty = ' '.join(l_test)\n",
        "#l_nonempty"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_VIF0ahT2-lU",
        "colab_type": "text"
      },
      "source": [
        "## Helper Functions"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ZAxfIXr_3Qpu",
        "colab_type": "text"
      },
      "source": [
        "### cluster_to_skills()"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "1ef-SBtD3I9o",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def cluster_to_skills(df_cluster, max_words=15): \n",
        "  \n",
        "  # From our previous IDF values, compute the tf-idf scores for documents within this cluster\n",
        "  s_temp = df_cluster['Bullet'].str.cat(sep=',')\n",
        "  s_temp=[s_temp]\n",
        "  \n",
        "  tfidf_matrix_all = bullet_vectorizer.transform(s_temp)\n",
        "\n",
        "  num_samples, num_features = tfidf_matrix_all.shape\n",
        "  print(\"#samples: %d, #features: %d\" % (num_samples, num_features))\n",
        " \n",
        "  #get the top n scores\n",
        "  df = pd.DataFrame(tfidf_matrix_all.T.todense(), index=bullet_vectorizer.get_feature_names(), columns=[\"tfidf\"])\n",
        " \n",
        "  df_top_n = df.sort_values(by=[\"tfidf\"],ascending=False).head(max_words)\n",
        "  df_top_n = df_top_n.apply(lambda x: np.round(x, decimals=2))\n",
        "   \n",
        "  # setting index name explicity, required by to_dict()\n",
        "  df_top_n.index.name = 'feat'\n",
        "\n",
        "  # getting a dictionary from df\n",
        "  dict_from_df = df_top_n.T.to_dict('list')\n",
        "\n",
        "  # from list to float\n",
        "  words_to_score = {k:v[0] for k, v in dict_from_df.items()}\n",
        "  #words_to_score\n",
        "\n",
        "  # transposing it\n",
        "  df_transposed = df_top_n.T \n",
        "  \n",
        "  # Word Cloud Generator\n",
        "  # cloud_generator = WordCloud(background_color='white', color_func=_color_func, random_state=1)\n",
        "  # wordcloud_image = cloud_generator.fit_words(words_to_score)\n",
        "  \n",
        "  # calculate cosine\n",
        "  v_cos_sim = cosine_similarity(df_transposed,df_resume_skills_transposed)\n",
        "\n",
        "  # somehow the returned type from cosine_similarity is overlycomplex, simplyfying using squeeze\n",
        "  v_simple_cos = v_cos_sim.squeeze()\n",
        "\n",
        "  #return wordcloud_image\n",
        "  return v_simple_cos"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "HDjzreg-DIPD",
        "colab_type": "text"
      },
      "source": [
        "### resume_to_skills()"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "az3_-73zCTJA",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def resume_to_skills(resume, max_words=15): \n",
        "  \n",
        " \n",
        "  # From our previous IDF values, compute the tf-idf scores for documents within this cluster\n",
        "  #s_temp = resume.str.cat(sep=',')\n",
        "  #s_temp=[s_temp]\n",
        "  #tfidf_matrix_all = bullet_vectorizer.fit_transform(s_temp)\n",
        "\n",
        "  #num_samples, num_features = tfidf_matrix_all.shape\n",
        "  #print(\"#samples: %d, #features: %d\" % (num_samples, num_features))\n",
        " \n",
        "  tfidf_matrix_all = resume\n",
        "\n",
        "  #get the top n scores\n",
        "  df = pd.DataFrame(tfidf_matrix_all.T.todense(), index=bullet_vectorizer.get_feature_names(), columns=[\"tfidf\"])\n",
        "  #df_top_n = df.sort_values(by=[\"tfidf\"],ascending=False)\n",
        "  df_top_n = df.sort_values(by=[\"tfidf\"],ascending=False).head(max_words)\n",
        "  df_top_n = df_top_n.apply(lambda x: np.round(x, decimals=2))\n",
        "  #print(df_top_n.head(2))\n",
        "  \n",
        "  # setting index name explicity, required by to_dict()\n",
        "  df_top_n.index.name = 'feat'\n",
        "\n",
        "  # getting a dictionary from df\n",
        "  dict_from_df = df_top_n.T.to_dict('list')\n",
        "\n",
        "  # from list to float\n",
        "  words_to_score = {k:v[0] for k, v in dict_from_df.items()}\n",
        "  #words_to_score\n",
        "\n",
        "\n",
        "  # Word Cloud Generator\n",
        "  # cloud_generator = WordCloud(background_color='white', color_func=_color_func, random_state=1)\n",
        "  # wordcloud_image = cloud_generator.fit_words(words_to_score)\n",
        "  \n",
        "  #return wordcloud_image\n",
        "  return df_top_n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "lo7U4coKaX39",
        "colab_type": "text"
      },
      "source": [
        "## 3.Transform bullet points into TFIDF vectors "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TOEWaxdVgFus",
        "colab_type": "text"
      },
      "source": [
        "### Downloaded Resume"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SVMRWg6BahCv",
        "colab_type": "text"
      },
      "source": [
        "We want to cluster the individual bullet points, so we create one large list of them here. This nested list comprehension in the next code cell is equivalent to:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "8eXSQKnPakFk",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "bullet_points = []\n",
        "for sublist in df_pkl['bullets']:\n",
        "    for item in sublist:\n",
        "        bullet_points.append(item)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "sYsv805Va6fz",
        "colab_type": "code",
        "outputId": "b0fa9299-eb09-4a59-8b92-dcd82ceaebda",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "source": [
        "len(bullet_points)"
      ],
      "execution_count": 19,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "1123"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 19
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SokjMYYSa93B",
        "colab_type": "text"
      },
      "source": [
        "Once again, it's important to remove stopwords from our TFIDF vectors. Otherwise, the top words in our clusters will be words like 'and'."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "f-hB8L8oa_9y",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "bullet_vectorizer = TfidfVectorizer(stop_words='english')\n",
        "tfidf_skills = bullet_vectorizer.fit_transform(bullet_points)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "9Sx27b72bEgZ",
        "colab_type": "code",
        "outputId": "17d3bb40-13c6-449f-ca16-0427285e4f19",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "source": [
        "tfidf_skills.shape"
      ],
      "execution_count": 21,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "(1123, 2050)"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 21
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "gAq-BOdIx1w3",
        "colab_type": "code",
        "outputId": "7df0ac97-1cee-4404-f749-348ff00aa869",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 426
        }
      },
      "source": [
        "# Convert Sparse Matrix to Pandas Dataframe to see the word frequencies.\n",
        "doc_term_matrix = tfidf_skills.todense()\n",
        "df_temp_skills = pd.DataFrame(doc_term_matrix, \n",
        "                  columns=bullet_vectorizer.get_feature_names()) \n",
        "df_temp_skills"
      ],
      "execution_count": 22,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>000</th>\n",
              "      <th>10</th>\n",
              "      <th>100mm</th>\n",
              "      <th>10pm</th>\n",
              "      <th>117</th>\n",
              "      <th>12</th>\n",
              "      <th>130</th>\n",
              "      <th>15</th>\n",
              "      <th>15am</th>\n",
              "      <th>15pm</th>\n",
              "      <th>20</th>\n",
              "      <th>200</th>\n",
              "      <th>24</th>\n",
              "      <th>25</th>\n",
              "      <th>2800</th>\n",
              "      <th>30</th>\n",
              "      <th>30am</th>\n",
              "      <th>30pm</th>\n",
              "      <th>3d</th>\n",
              "      <th>401</th>\n",
              "      <th>45pm</th>\n",
              "      <th>800</th>\n",
              "      <th>87653</th>\n",
              "      <th>87654</th>\n",
              "      <th>87655</th>\n",
              "      <th>abilities</th>\n",
              "      <th>ability</th>\n",
              "      <th>able</th>\n",
              "      <th>abreast</th>\n",
              "      <th>absence</th>\n",
              "      <th>academic</th>\n",
              "      <th>acceleration</th>\n",
              "      <th>acceptable</th>\n",
              "      <th>access</th>\n",
              "      <th>accessible</th>\n",
              "      <th>accomplishments</th>\n",
              "      <th>according</th>\n",
              "      <th>accountability</th>\n",
              "      <th>accredited</th>\n",
              "      <th>accuracy</th>\n",
              "      <th>...</th>\n",
              "      <th>vitae</th>\n",
              "      <th>vmware</th>\n",
              "      <th>voice</th>\n",
              "      <th>volume</th>\n",
              "      <th>volumes</th>\n",
              "      <th>walk</th>\n",
              "      <th>walnut</th>\n",
              "      <th>warehouse</th>\n",
              "      <th>way</th>\n",
              "      <th>ways</th>\n",
              "      <th>web</th>\n",
              "      <th>wed</th>\n",
              "      <th>week</th>\n",
              "      <th>weekday</th>\n",
              "      <th>weighing</th>\n",
              "      <th>weka</th>\n",
              "      <th>welcome</th>\n",
              "      <th>welfare</th>\n",
              "      <th>willing</th>\n",
              "      <th>willingness</th>\n",
              "      <th>windows</th>\n",
              "      <th>word</th>\n",
              "      <th>work</th>\n",
              "      <th>workers</th>\n",
              "      <th>workflow</th>\n",
              "      <th>workflows</th>\n",
              "      <th>working</th>\n",
              "      <th>works</th>\n",
              "      <th>world</th>\n",
              "      <th>wrangle</th>\n",
              "      <th>write</th>\n",
              "      <th>writing</th>\n",
              "      <th>written</th>\n",
              "      <th>wsj</th>\n",
              "      <th>xgboost</th>\n",
              "      <th>xponent</th>\n",
              "      <th>year</th>\n",
              "      <th>years</th>\n",
              "      <th>zeppelin</th>\n",
              "      <th>zr</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.00000</td>\n",
              "      <td>0.213497</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.000000</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>...</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.000000</td>\n",
              "      <td>0.00000</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.27894</td>\n",
              "      <td>0.000000</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.000000</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>...</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.182367</td>\n",
              "      <td>0.00000</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.00000</td>\n",
              "      <td>0.000000</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.000000</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>...</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.000000</td>\n",
              "      <td>0.52339</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.00000</td>\n",
              "      <td>0.115922</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.000000</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>...</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.000000</td>\n",
              "      <td>0.00000</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.00000</td>\n",
              "      <td>0.260376</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.000000</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>...</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.000000</td>\n",
              "      <td>0.00000</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>...</th>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1118</th>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.00000</td>\n",
              "      <td>0.000000</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.000000</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>...</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.000000</td>\n",
              "      <td>0.00000</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1119</th>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.00000</td>\n",
              "      <td>0.000000</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.539787</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>...</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.000000</td>\n",
              "      <td>0.00000</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1120</th>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.00000</td>\n",
              "      <td>0.000000</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.000000</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>...</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.000000</td>\n",
              "      <td>0.00000</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1121</th>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.00000</td>\n",
              "      <td>0.000000</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.000000</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>...</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.000000</td>\n",
              "      <td>0.00000</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1122</th>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.00000</td>\n",
              "      <td>0.000000</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.000000</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>...</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.000000</td>\n",
              "      <td>0.00000</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "<p>1123 rows × 2050 columns</p>\n",
              "</div>"
            ],
            "text/plain": [
              "      000   10  100mm  10pm  117  ...  xponent  year  years  zeppelin   zr\n",
              "0     0.0  0.0    0.0   0.0  0.0  ...      0.0   0.0    0.0       0.0  0.0\n",
              "1     0.0  0.0    0.0   0.0  0.0  ...      0.0   0.0    0.0       0.0  0.0\n",
              "2     0.0  0.0    0.0   0.0  0.0  ...      0.0   0.0    0.0       0.0  0.0\n",
              "3     0.0  0.0    0.0   0.0  0.0  ...      0.0   0.0    0.0       0.0  0.0\n",
              "4     0.0  0.0    0.0   0.0  0.0  ...      0.0   0.0    0.0       0.0  0.0\n",
              "...   ...  ...    ...   ...  ...  ...      ...   ...    ...       ...  ...\n",
              "1118  0.0  0.0    0.0   0.0  0.0  ...      0.0   0.0    0.0       0.0  0.0\n",
              "1119  0.0  0.0    0.0   0.0  0.0  ...      0.0   0.0    0.0       0.0  0.0\n",
              "1120  0.0  0.0    0.0   0.0  0.0  ...      0.0   0.0    0.0       0.0  0.0\n",
              "1121  0.0  0.0    0.0   0.0  0.0  ...      0.0   0.0    0.0       0.0  0.0\n",
              "1122  0.0  0.0    0.0   0.0  0.0  ...      0.0   0.0    0.0       0.0  0.0\n",
              "\n",
              "[1123 rows x 2050 columns]"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 22
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Ramf7RhCgI-_",
        "colab_type": "text"
      },
      "source": [
        "### Standard Resume"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "eTA8quFkgLjK",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# note the [] around the input string, as transform expects a list\n",
        "\n",
        "my_resume = bullet_vectorizer.transform([l_nonempty])"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "bpKjqDcXjDy8",
        "colab_type": "code",
        "outputId": "7c5205df-2ca9-4b5b-cd01-91da10f0d921",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "source": [
        "my_resume.shape"
      ],
      "execution_count": 24,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "(1, 2050)"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 24
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "9NWiQK7IxnRo",
        "colab_type": "code",
        "outputId": "bbfa4a7b-04b2-4f69-dd83-8237a02256ac",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 126
        }
      },
      "source": [
        "# Convert Sparse Matrix to Pandas Dataframe to see the word frequencies.\n",
        "doc_term_matrix = my_resume.todense()\n",
        "df_temp_resume = pd.DataFrame(doc_term_matrix, \n",
        "                  columns=bullet_vectorizer.get_feature_names()) \n",
        "df_temp_resume"
      ],
      "execution_count": 25,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>000</th>\n",
              "      <th>10</th>\n",
              "      <th>100mm</th>\n",
              "      <th>10pm</th>\n",
              "      <th>117</th>\n",
              "      <th>12</th>\n",
              "      <th>130</th>\n",
              "      <th>15</th>\n",
              "      <th>15am</th>\n",
              "      <th>15pm</th>\n",
              "      <th>20</th>\n",
              "      <th>200</th>\n",
              "      <th>24</th>\n",
              "      <th>25</th>\n",
              "      <th>2800</th>\n",
              "      <th>30</th>\n",
              "      <th>30am</th>\n",
              "      <th>30pm</th>\n",
              "      <th>3d</th>\n",
              "      <th>401</th>\n",
              "      <th>45pm</th>\n",
              "      <th>800</th>\n",
              "      <th>87653</th>\n",
              "      <th>87654</th>\n",
              "      <th>87655</th>\n",
              "      <th>abilities</th>\n",
              "      <th>ability</th>\n",
              "      <th>able</th>\n",
              "      <th>abreast</th>\n",
              "      <th>absence</th>\n",
              "      <th>academic</th>\n",
              "      <th>acceleration</th>\n",
              "      <th>acceptable</th>\n",
              "      <th>access</th>\n",
              "      <th>accessible</th>\n",
              "      <th>accomplishments</th>\n",
              "      <th>according</th>\n",
              "      <th>accountability</th>\n",
              "      <th>accredited</th>\n",
              "      <th>accuracy</th>\n",
              "      <th>...</th>\n",
              "      <th>vitae</th>\n",
              "      <th>vmware</th>\n",
              "      <th>voice</th>\n",
              "      <th>volume</th>\n",
              "      <th>volumes</th>\n",
              "      <th>walk</th>\n",
              "      <th>walnut</th>\n",
              "      <th>warehouse</th>\n",
              "      <th>way</th>\n",
              "      <th>ways</th>\n",
              "      <th>web</th>\n",
              "      <th>wed</th>\n",
              "      <th>week</th>\n",
              "      <th>weekday</th>\n",
              "      <th>weighing</th>\n",
              "      <th>weka</th>\n",
              "      <th>welcome</th>\n",
              "      <th>welfare</th>\n",
              "      <th>willing</th>\n",
              "      <th>willingness</th>\n",
              "      <th>windows</th>\n",
              "      <th>word</th>\n",
              "      <th>work</th>\n",
              "      <th>workers</th>\n",
              "      <th>workflow</th>\n",
              "      <th>workflows</th>\n",
              "      <th>working</th>\n",
              "      <th>works</th>\n",
              "      <th>world</th>\n",
              "      <th>wrangle</th>\n",
              "      <th>write</th>\n",
              "      <th>writing</th>\n",
              "      <th>written</th>\n",
              "      <th>wsj</th>\n",
              "      <th>xgboost</th>\n",
              "      <th>xponent</th>\n",
              "      <th>year</th>\n",
              "      <th>years</th>\n",
              "      <th>zeppelin</th>\n",
              "      <th>zr</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>...</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "      <td>0.0</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "<p>1 rows × 2050 columns</p>\n",
              "</div>"
            ],
            "text/plain": [
              "   000   10  100mm  10pm  117  ...  xponent  year  years  zeppelin   zr\n",
              "0  0.0  0.0    0.0   0.0  0.0  ...      0.0   0.0    0.0       0.0  0.0\n",
              "\n",
              "[1 rows x 2050 columns]"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 25
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "FmF_QPI6bNsV",
        "colab_type": "text"
      },
      "source": [
        "## 4.Reduce dimensions of the TFIDF vectors with SVD/LSA"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rknw9MDRg9pV",
        "colab_type": "text"
      },
      "source": [
        "### Downloaded Resume"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dsKssvVub-dB",
        "colab_type": "text"
      },
      "source": [
        "\n",
        "We'll use 100 for the n_components as we found this accounts for almost 50% of the explained variance in the data with the SVD, and is the value suggested by the documentation. However, you can also use a lower value of 50, which should make the 'topics' in the LSA a bit more broad."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "oLWJY50bbxNh",
        "colab_type": "text"
      },
      "source": [
        "We also normalize the results of the SVD as recommended in the example:\n",
        "\n",
        "\"[TFIDF] Vectorizer results are normalized, which makes KMeans behave as spherical k-means for better results. Since LSA/SVD results are not normalized, we have to redo the normalization."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "mnSEmgb-bw5Z",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "\n",
        "svd = TruncatedSVD(n_components=100)\n",
        "#lsa = svd.fit_transform(tfidf_skills)\n",
        "#norm = Normalizer().fit_transform(lsa)\n",
        "\n",
        "\n",
        "normalizer = Normalizer(copy=False)\n",
        "lsa = make_pipeline(svd, normalizer)\n",
        "norm = lsa.fit_transform(tfidf_skills)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "2vNsjHsaeUgT",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "#norm"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "H0Fw2oclYpBW",
        "colab_type": "code",
        "outputId": "1b2af515-8015-4563-aa44-e5ce58f51295",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "source": [
        "print(np.shape(norm))"
      ],
      "execution_count": 28,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "(1123, 100)\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "05JD1EpYhBNz",
        "colab_type": "text"
      },
      "source": [
        "### Standard Resume"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "AbqAZFq_hD1m",
        "colab_type": "code",
        "outputId": "4f7fb8cd-b310-4ca6-cc6d-7343041f17e3",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 52
        }
      },
      "source": [
        "norm_standard = lsa.fit_transform(my_resume)"
      ],
      "execution_count": 29,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "/usr/local/lib/python3.6/dist-packages/sklearn/decomposition/_truncated_svd.py:194: RuntimeWarning: invalid value encountered in true_divide\n",
            "  self.explained_variance_ratio_ = exp_var / full_var\n"
          ],
          "name": "stderr"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "m_5uuPoSkTiB",
        "colab_type": "text"
      },
      "source": [
        "It is just one sample and the algorithm has no idea how to decompose it into a lower dimension because there are not other samples to compare it with."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "sCa2SMg9jgP_",
        "colab_type": "code",
        "outputId": "92c3679c-5b8f-4dd2-bb5a-7abe728cec6a",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "source": [
        "print(np.shape(norm_standard))"
      ],
      "execution_count": 30,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "(1, 1)\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TVaORR-jnuz_",
        "colab_type": "text"
      },
      "source": [
        "Not required to process standard resume"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "t3NAFPSHcF3k",
        "colab_type": "text"
      },
      "source": [
        "## 5.Use k-means to cluster the SVD-transformed data, and decide on an optimal number of clusters"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yGpvCdWNhKv4",
        "colab_type": "text"
      },
      "source": [
        "### Downloaded Resume"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "gj3gmIRqc7Fs",
        "colab_type": "text"
      },
      "source": [
        "We create a DataFrame with the cluster labels in order to index our TFIDF matrix for further processing"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "t1KR06Ngc9tJ",
        "colab_type": "code",
        "outputId": "0d6fcd52-b446-4b74-9fa7-9c39a71fcb9d",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 0
        }
      },
      "source": [
        "clusters = 6\n",
        "km = KMeans(n_clusters=clusters, random_state=42)\n",
        "km.fit(norm)\n",
        "#cluster_labels_df = pd.DataFrame({'Cluster': km.labels_})"
      ],
      "execution_count": 31,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,\n",
              "       n_clusters=6, n_init=10, n_jobs=None, precompute_distances='auto',\n",
              "       random_state=42, tol=0.0001, verbose=0)"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 31
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "JDKvC5SmhSqD",
        "colab_type": "text"
      },
      "source": [
        "### Standard Resume"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Sb6GbaCmhVO3",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Not required"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "U-6kGDmJjlEV",
        "colab_type": "text"
      },
      "source": [
        "## 6.Combine skill requirement cluster texts "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "lk7NYdUkdGL7",
        "colab_type": "text"
      },
      "source": [
        "### Setting Things ... "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wiEOPq1Mbyk2",
        "colab_type": "text"
      },
      "source": [
        "Combining bullets (as these are set as individual docs for tf-idf) with their corresponding cluster ID"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "F1VsFBQ0aWoK",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "labels_df = pd.DataFrame({'Bullet': bullet_points,'Cluster': km.labels_})"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "TxCQepEPai2B",
        "colab_type": "code",
        "outputId": "596c47c4-884d-4a4b-a852-bf33da88b6b4",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "source": [
        "labels_df.shape"
      ],
      "execution_count": 34,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "(1123, 2)"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 34
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "pnKPWHvlamLH",
        "colab_type": "code",
        "outputId": "0f83f8e3-f285-455a-9f2b-6006058b7bb9",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 107
        }
      },
      "source": [
        "labels_df.head(2)"
      ],
      "execution_count": 35,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>Bullet</th>\n",
              "      <th>Cluster</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Demonstrated ability to propose, initiate, and...</td>\n",
              "      <td>4</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>Entrepreneurial abilities: demonstrated skills...</td>\n",
              "      <td>4</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "                                              Bullet  Cluster\n",
              "0  Demonstrated ability to propose, initiate, and...        4\n",
              "1  Entrepreneurial abilities: demonstrated skills...        4"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 35
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qrb60rGRXRVq",
        "colab_type": "text"
      },
      "source": [
        "Extracting Cluster 0 "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "92hqMsYmH5zn",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# cluster 0\n",
        "df_0 = list(labels_df.groupby('Cluster'))[0][1]\n",
        "\n",
        "# cluster 2\n",
        "df_2 = list(labels_df.groupby('Cluster'))[2][1]"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8ZQLKYvcW4V_",
        "colab_type": "text"
      },
      "source": [
        "### Converting Std Resume for Cosine Sim"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "NufP2JlCEbBS",
        "colab_type": "code",
        "outputId": "fcca3eb1-880f-4f41-98f6-e57a3e05b9d7",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "source": [
        "np.shape(bullet_vectorizer.get_feature_names())"
      ],
      "execution_count": 37,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "(2050,)"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 37
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "XW3O2j0PC_M0",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "df_resume_skills = resume_to_skills(my_resume)\n",
        "#df_resume_skills"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "_XFrTB8-GdKV",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# transposing it\n",
        "df_resume_skills_transposed = df_resume_skills.T # or df1.transpose()\n",
        "#df_resume_skills_transposed"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BCjhNmhQgR0f",
        "colab_type": "text"
      },
      "source": [
        "## 7.Calculate cosine similarity between our resume skills and the skill requirement clusters."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "qVFG0LmWTzpX",
        "colab_type": "code",
        "outputId": "32a838ff-1cdb-4e47-e6ca-d7d4b3844837",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 52
        }
      },
      "source": [
        "cluster_to_skills(df_0)"
      ],
      "execution_count": 40,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "#samples: 1, #features: 2050\n"
          ],
          "name": "stdout"
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "array(0.71436174)"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 40
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MVmrjqB6jrYC",
        "colab_type": "text"
      },
      "source": [
        "Grouping by Cluster column and using split/apply/combine loop"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Pwqias0JLI4C",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "gb_df = labels_df.groupby('Cluster',as_index=False)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "S-K2R7_5N196",
        "colab_type": "code",
        "outputId": "48eef7ed-a296-45b0-bb3d-52cf4e60dc45",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 123
        }
      },
      "source": [
        "cos_result = gb_df.apply(cluster_to_skills)"
      ],
      "execution_count": 42,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "#samples: 1, #features: 2050\n",
            "#samples: 1, #features: 2050\n",
            "#samples: 1, #features: 2050\n",
            "#samples: 1, #features: 2050\n",
            "#samples: 1, #features: 2050\n",
            "#samples: 1, #features: 2050\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "_gBF6CN-Ycu3",
        "colab_type": "code",
        "outputId": "b8b09737-98be-439c-e40a-aabe8c4b7398",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 141
        }
      },
      "source": [
        "cos_result"
      ],
      "execution_count": 43,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "0    0.7143617445537217\n",
              "1    0.8811555211886989\n",
              "2    0.9769363805300825\n",
              "3    0.8506619288194702\n",
              "4     0.930230370636883\n",
              "5    0.9575181329314076\n",
              "dtype: object"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 43
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "71ypVKDQj2nc",
        "colab_type": "text"
      },
      "source": [
        "output is a pd.series"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "NLeUKB0wYj0d",
        "colab_type": "code",
        "outputId": "9424256c-e92c-42ba-f551-9bf0d27aad4e",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "source": [
        "print(cos_result.shape)"
      ],
      "execution_count": 44,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "(6,)\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Iw0LuhShZSG8",
        "colab_type": "code",
        "outputId": "2e9eb250-cb68-415b-8778-c35c7f7f84b6",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "source": [
        "type(cos_result)"
      ],
      "execution_count": 45,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "pandas.core.series.Series"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 45
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "sEUgBAAmj6qW",
        "colab_type": "text"
      },
      "source": [
        "convert from pd.series to pd.dataframe"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "QUpGDYlxc2j8",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "df_cosine = cos_result.to_frame(name=\"cosine\")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "-jn-AWIHdCGH",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 227
        },
        "outputId": "15138c8e-53ad-4efa-b14b-7fb7ada1f4f0"
      },
      "source": [
        "df_cosine"
      ],
      "execution_count": 47,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>cosine</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>0.7143617445537217</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>0.8811555211886989</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>0.9769363805300825</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>0.8506619288194702</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>0.930230370636883</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>5</th>\n",
              "      <td>0.9575181329314076</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "               cosine\n",
              "0  0.7143617445537217\n",
              "1  0.8811555211886989\n",
              "2  0.9769363805300825\n",
              "3  0.8506619288194702\n",
              "4   0.930230370636883\n",
              "5  0.9575181329314076"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 47
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "QuvGtBSEdgo1",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        },
        "outputId": "ca6ff383-53fb-49c7-96ad-890e7b0a530f"
      },
      "source": [
        "df_cosine.index"
      ],
      "execution_count": 48,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "RangeIndex(start=0, stop=6, step=1)"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 48
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "at0MnYHOj-zT",
        "colab_type": "text"
      },
      "source": [
        "adding cluster as index"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "oYlOVyJqdfKy",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "df_cosine['cluster'] = df_cosine.index"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "cmT0fexMdsHM",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 227
        },
        "outputId": "a3406cd4-3576-4857-a3b9-54eafffc23f4"
      },
      "source": [
        "df_cosine"
      ],
      "execution_count": 50,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>cosine</th>\n",
              "      <th>cluster</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>0.7143617445537217</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>0.8811555211886989</td>\n",
              "      <td>1</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>0.9769363805300825</td>\n",
              "      <td>2</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>0.8506619288194702</td>\n",
              "      <td>3</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>0.930230370636883</td>\n",
              "      <td>4</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>5</th>\n",
              "      <td>0.9575181329314076</td>\n",
              "      <td>5</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "               cosine  cluster\n",
              "0  0.7143617445537217        0\n",
              "1  0.8811555211886989        1\n",
              "2  0.9769363805300825        2\n",
              "3  0.8506619288194702        3\n",
              "4   0.930230370636883        4\n",
              "5  0.9575181329314076        5"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 50
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "gJRezKBIgZ9v",
        "colab_type": "text"
      },
      "source": [
        "## 8.Rank the skill clusters by similarity to our resume skill list"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "h2_UsWFMgbFd",
        "colab_type": "code",
        "outputId": "c5eb6701-248d-4537-c3b0-aa6d0c531d9d",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 227
        }
      },
      "source": [
        "df_cosine.sort_values(by=['cosine'],ascending=False)"
      ],
      "execution_count": 51,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>cosine</th>\n",
              "      <th>cluster</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>0.9769363805300825</td>\n",
              "      <td>2</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>5</th>\n",
              "      <td>0.9575181329314076</td>\n",
              "      <td>5</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>0.930230370636883</td>\n",
              "      <td>4</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>0.8811555211886989</td>\n",
              "      <td>1</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>0.8506619288194702</td>\n",
              "      <td>3</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>0.7143617445537217</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "               cosine  cluster\n",
              "2  0.9769363805300825        2\n",
              "5  0.9575181329314076        5\n",
              "4   0.930230370636883        4\n",
              "1  0.8811555211886989        1\n",
              "3  0.8506619288194702        3\n",
              "0  0.7143617445537217        0"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 51
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "kpNMv82YlVOZ",
        "colab_type": "text"
      },
      "source": [
        "Highest similarity is with cluster 2"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "jbD5rcLGgdz8",
        "colab_type": "text"
      },
      "source": [
        "## 9.Visualize the results of how similar the clusters are to our skills."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "UzDrHYaYkFuN",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# TBD"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "FmWoieUKggwb",
        "colab_type": "text"
      },
      "source": [
        "## 10.Determine the skills that are missing from our resume that we could learn and/or add to our resume."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "7WaR6Js-ghwI",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# TBD"
      ],
      "execution_count": 0,
      "outputs": []
    }
  ]
}