Skip to content

Instantly share code, notes, and snippets.

@dannymorris
Created May 18, 2021 18:36
Show Gist options
  • Save dannymorris/dd19a7d6a7132e7d7470df13d0da75bd to your computer and use it in GitHub Desktop.
Save dannymorris/dd19a7d6a7132e7d7470df13d0da75bd to your computer and use it in GitHub Desktop.
EDM-SparkML-Clustering.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "EDM-SparkML-Clustering.ipynb",
"provenance": [],
"toc_visible": true,
"authorship_tag": "ABX9TyPd2LB0mK3Ru2UrcbGM/t/J",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/dannymorris/dd19a7d6a7132e7d7470df13d0da75bd/edm-sparkml-clustering.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Lhme3LFAYEj_"
},
"source": [
"## Overview\n",
"\n",
"This notebook implements K-Means clustering using PySpark and MLlib, the Spark Machine Learning API. \n",
"\n",
"The steps taken in this notebook include the following:\n",
"\n",
"- Install Spark and PySpark\n",
"- Create a SparkSession\n",
"- Read a CSV file from the web and load into Spark\n",
"- Select features for clustering\n",
"- Assemble an [ML Pipeline](https://spark.apache.org/docs/latest/ml-pipeline.html) that defines the clustering workflow, including:\n",
" - Assemble the features into a vector\n",
" - Scale the features to have mean=0 and sd=1\n",
" - Initialize the K-Means algorithm\n",
"- Fit the ML Pipeline to the training data\n",
"- Generate predictions (i.e. cluster labels) for the training data\n",
"- Compute Silhouette score to evalute the fit\n",
"- Compute the cluster centers\n",
"- Compute cluster frequencies\n",
"- Define a reusable function for testing multiple values for *K*"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Uaiz0fDEL1x-"
},
"source": [
"# Install PySpark"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "-iOTzB1gd7LE",
"outputId": "0857f0b0-e379-472d-921a-c8ede2ec8885"
},
"source": [
"!pip install pyspark"
],
"execution_count": 1,
"outputs": [
{
"output_type": "stream",
"text": [
"Collecting pyspark\n",
"\u001b[?25l Downloading https://files.pythonhosted.org/packages/45/b0/9d6860891ab14a39d4bddf80ba26ce51c2f9dc4805e5c6978ac0472c120a/pyspark-3.1.1.tar.gz (212.3MB)\n",
"\u001b[K |████████████████████████████████| 212.3MB 69kB/s \n",
"\u001b[?25hCollecting py4j==0.10.9\n",
"\u001b[?25l Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)\n",
"\u001b[K |████████████████████████████████| 204kB 17.8MB/s \n",
"\u001b[?25hBuilding wheels for collected packages: pyspark\n",
" Building wheel for pyspark (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
" Created wheel for pyspark: filename=pyspark-3.1.1-py2.py3-none-any.whl size=212767604 sha256=942148c17b0652ab8a0c9ad8a5a592fa09447a2993cad8f35ee22e3026f761d2\n",
" Stored in directory: /root/.cache/pip/wheels/0b/90/c0/01de724414ef122bd05f056541fb6a0ecf47c7ca655f8b3c0f\n",
"Successfully built pyspark\n",
"Installing collected packages: py4j, pyspark\n",
"Successfully installed py4j-0.10.9 pyspark-3.1.1\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "oeBvkC9UMyok"
},
"source": [
"## Load packages and modules"
]
},
{
"cell_type": "code",
"metadata": {
"id": "mLUKdc60jkBz"
},
"source": [
"from pyspark.sql import SparkSession\n",
"from pyspark import SparkFiles\n",
"from pyspark.ml import Pipeline\n",
"from pyspark.ml.feature import VectorAssembler\n",
"from pyspark.ml.feature import StandardScaler\n",
"from pyspark.ml.clustering import KMeans\n",
"from pyspark.ml.evaluation import ClusteringEvaluator\n",
"import pandas as pd"
],
"execution_count": 12,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "FMS5JD-kPQsI"
},
"source": [
"## Create SparkSession\n",
"\n",
"`SparkSession` provides a single point of entry to interact with underlying Spark functionality and allows programming Spark with DataFrame and Dataset APIs."
]
},
{
"cell_type": "code",
"metadata": {
"id": "9Gz1Z6ici7KB"
},
"source": [
"spark = SparkSession. \\\n",
" builder. \\\n",
" appName(\"MLlib Clustering\"). \\\n",
" getOrCreate()"
],
"execution_count": 7,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "lg_Hnxw8Pgu1"
},
"source": [
"## Read data\n",
"\n",
"In this example, the data resides on the internet in CSV format. Use `addFile` to load the file into Spark and read as a DataFrame using `read.csv`."
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 222
},
"id": "M3uGR6qaieqF",
"outputId": "4cbc48a0-dc58-4ef3-dcd3-daeca27440b6"
},
"source": [
"url = \"https://gist.githubusercontent.com/dannymorris/1bd95ddda1cfe7fd518e5cda01f4ac03/raw/295636f511f0afbd544aece7cbfe771edc01182c/county_data.csv\"\n",
"\n",
"# upload file to Spark\n",
"spark.sparkContext.addFile(url)\n",
"\n",
"# Read file\n",
"df = spark.read.csv(\"file://\"+SparkFiles.get(\"county_data.csv\"), header=True, inferSchema= True)\n",
"\n",
"df.limit(5).toPandas().head()"
],
"execution_count": 74,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>fips</th>\n",
" <th>state</th>\n",
" <th>name</th>\n",
" <th>party_winner</th>\n",
" <th>trump_pct</th>\n",
" <th>margin</th>\n",
" <th>POPULATION_Total</th>\n",
" <th>AGE_18_29</th>\n",
" <th>AGE_30_44</th>\n",
" <th>AGE_45_59</th>\n",
" <th>AGE_60_Plus</th>\n",
" <th>AGE_18_Plus</th>\n",
" <th>RACE_Total__Asian_alone</th>\n",
" <th>RACE_Total__Black_or_African_American_alone</th>\n",
" <th>RACE_Total__Hispanic_or_Latino</th>\n",
" <th>RACE_Total__White_alone</th>\n",
" <th>GINI_Gini_Index</th>\n",
" <th>INCOME_PER_CAPITA_INCOME_IN_THE_PAST_12_MONTHS__IN_2018_INFLATION_ADJUSTED_DOLLARS_</th>\n",
" <th>UNEMPLOY__16_YEARS_AND_OVER__ASIAN_ALONE_</th>\n",
" <th>UNEMPLOY__16_YEARS_AND_OVER__BLACK_OR_AFRICAN_AMERICAN_ALONE_</th>\n",
" <th>UNEMPLOY__16_YEARS_AND_OVER__HISPANIC_OR_LATINO_</th>\n",
" <th>UNEMPLOY__16_YEARS_AND_OVER__WHITE_ALONE_</th>\n",
" <th>UNEMPLOY_Total_16_YEARS_AND_OVER</th>\n",
" <th>EDU_ATTAIN_Total__Bachelor_s_degree_or_higher</th>\n",
" <th>EDU_ATTAIN_Total__High_school_graduate__includes_equivalency_</th>\n",
" <th>EDU_ATTAIN_Total__Less_than_high_school_diploma</th>\n",
" <th>EDU_ATTAIN_Total__Some_college_or_associate_s_degree</th>\n",
" <th>INDUSTRY_Total__Agriculture__forestry__fishing_and_hunting__and_mining__Agriculture__forestry__fishing_and_hunting</th>\n",
" <th>INDUSTRY_Total__Agriculture__forestry__fishing_and_hunting__and_mining__Mining__quarrying__and_oil_and_gas_extraction</th>\n",
" <th>INDUSTRY_Total__Arts__entertainment__and_recreation__and_accommodation_and_food_services__Accommodation_and_food_services</th>\n",
" <th>INDUSTRY_Total__Arts__entertainment__and_recreation__and_accommodation_and_food_services__Arts__entertainment__and_recreation</th>\n",
" <th>INDUSTRY_Total__Construction</th>\n",
" <th>INDUSTRY_Total__Educational_services__and_health_care_and_social_assistance__Educational_services</th>\n",
" <th>INDUSTRY_Total__Educational_services__and_health_care_and_social_assistance__Health_care_and_social_assistance</th>\n",
" <th>INDUSTRY_Total__Information</th>\n",
" <th>INDUSTRY_Total__Manufacturing</th>\n",
" <th>INDUSTRY_Total__Professional__scientific__and_management__and_administrative__and_waste_management_services__Administrative_and_support_and_waste_management_services</th>\n",
" <th>INDUSTRY_Total__Professional__scientific__and_management__and_administrative__and_waste_management_services__Management_of_companies_and_enterprises</th>\n",
" <th>INDUSTRY_Total__Professional__scientific__and_management__and_administrative__and_waste_management_services__Professional__scientific__and_technical_services</th>\n",
" <th>INDUSTRY_Total__Public_administration</th>\n",
" <th>INDUSTRY_Total__Retail_trade</th>\n",
" <th>CITIZEN_Estimate__Total__Not_a_U_S__citizen</th>\n",
" <th>CITIZEN_Estimate__Total__U_S__citizen_by_naturalization</th>\n",
" <th>CITIZEN_Estimate__Total__U_S__citizen__born_in_the_United_States</th>\n",
" <th>HEALTH_INSURANCE_No_health_insurance_coverage</th>\n",
" <th>HEALTH_INSURANCE_With_one_type_of_health_insurance_coverageWith_direct_purchase_health_insurance_only</th>\n",
" <th>HEALTH_INSURANCE_With_one_type_of_health_insurance_coverageWith_employer_based_health_insurance_only</th>\n",
" <th>HEALTH_INSURANCE_With_one_type_of_health_insurance_coverageWith_Medicaid_means_tested_public_coverage_only</th>\n",
" <th>HEALTH_INSURANCE_With_one_type_of_health_insurance_coverageWith_Medicare_coverage_only</th>\n",
" <th>HEALTH_INSURANCE_With_one_type_of_health_insurance_coverageWith_TRICARE_military_health_coverage_only</th>\n",
" <th>HEALTH_INSURANCE_With_one_type_of_health_insurance_coverageWith_VA_Health_Care_only</th>\n",
" <th>VETERANS_Estimate__Total__Veteran</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>48001</td>\n",
" <td>texas</td>\n",
" <td>Anderson</td>\n",
" <td>republican</td>\n",
" <td>0.786119</td>\n",
" <td>0.580355</td>\n",
" <td>57863</td>\n",
" <td>0.192527</td>\n",
" <td>0.308352</td>\n",
" <td>0.246227</td>\n",
" <td>0.252894</td>\n",
" <td>46648</td>\n",
" <td>0.005513</td>\n",
" <td>0.209823</td>\n",
" <td>0.175276</td>\n",
" <td>0.735340</td>\n",
" <td>0.4225</td>\n",
" <td>16868</td>\n",
" <td>0.0</td>\n",
" <td>0.004287</td>\n",
" <td>0.002272</td>\n",
" <td>0.007760</td>\n",
" <td>0.014320</td>\n",
" <td>0.105299</td>\n",
" <td>0.359608</td>\n",
" <td>0.232636</td>\n",
" <td>0.308481</td>\n",
" <td>0.007846</td>\n",
" <td>0.028490</td>\n",
" <td>0.022359</td>\n",
" <td>0.002165</td>\n",
" <td>0.022380</td>\n",
" <td>0.029905</td>\n",
" <td>0.057473</td>\n",
" <td>0.002037</td>\n",
" <td>0.025424</td>\n",
" <td>0.021373</td>\n",
" <td>0.000000</td>\n",
" <td>0.010912</td>\n",
" <td>0.038608</td>\n",
" <td>0.077860</td>\n",
" <td>0.041546</td>\n",
" <td>0.020946</td>\n",
" <td>0.928400</td>\n",
" <td>0.114324</td>\n",
" <td>0.028040</td>\n",
" <td>0.306080</td>\n",
" <td>0.028040</td>\n",
" <td>0.065833</td>\n",
" <td>0.005638</td>\n",
" <td>0.006088</td>\n",
" <td>0.086006</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>48003</td>\n",
" <td>texas</td>\n",
" <td>Andrews</td>\n",
" <td>republican</td>\n",
" <td>0.843084</td>\n",
" <td>0.698107</td>\n",
" <td>17818</td>\n",
" <td>0.245792</td>\n",
" <td>0.287747</td>\n",
" <td>0.246605</td>\n",
" <td>0.219855</td>\n",
" <td>12299</td>\n",
" <td>0.003536</td>\n",
" <td>0.019811</td>\n",
" <td>0.560052</td>\n",
" <td>0.924009</td>\n",
" <td>0.4506</td>\n",
" <td>31190</td>\n",
" <td>0.0</td>\n",
" <td>0.012196</td>\n",
" <td>0.015611</td>\n",
" <td>0.018863</td>\n",
" <td>0.046670</td>\n",
" <td>0.102529</td>\n",
" <td>0.461338</td>\n",
" <td>0.368810</td>\n",
" <td>0.319457</td>\n",
" <td>0.008537</td>\n",
" <td>0.167900</td>\n",
" <td>0.035125</td>\n",
" <td>0.003659</td>\n",
" <td>0.051955</td>\n",
" <td>0.036751</td>\n",
" <td>0.065696</td>\n",
" <td>0.009513</td>\n",
" <td>0.036100</td>\n",
" <td>0.022766</td>\n",
" <td>0.003252</td>\n",
" <td>0.014798</td>\n",
" <td>0.013172</td>\n",
" <td>0.077323</td>\n",
" <td>0.098047</td>\n",
" <td>0.047368</td>\n",
" <td>0.850208</td>\n",
" <td>0.175055</td>\n",
" <td>0.043256</td>\n",
" <td>0.511342</td>\n",
" <td>0.030084</td>\n",
" <td>0.055940</td>\n",
" <td>0.000000</td>\n",
" <td>0.003334</td>\n",
" <td>0.055696</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>48005</td>\n",
" <td>texas</td>\n",
" <td>Angelina</td>\n",
" <td>republican</td>\n",
" <td>0.723981</td>\n",
" <td>0.460148</td>\n",
" <td>87607</td>\n",
" <td>0.208537</td>\n",
" <td>0.245155</td>\n",
" <td>0.258604</td>\n",
" <td>0.287704</td>\n",
" <td>64914</td>\n",
" <td>0.011495</td>\n",
" <td>0.147956</td>\n",
" <td>0.218864</td>\n",
" <td>0.791855</td>\n",
" <td>0.4495</td>\n",
" <td>22322</td>\n",
" <td>0.0</td>\n",
" <td>0.012416</td>\n",
" <td>0.009582</td>\n",
" <td>0.032027</td>\n",
" <td>0.054025</td>\n",
" <td>0.156638</td>\n",
" <td>0.314755</td>\n",
" <td>0.215177</td>\n",
" <td>0.306251</td>\n",
" <td>0.011092</td>\n",
" <td>0.010768</td>\n",
" <td>0.044012</td>\n",
" <td>0.005099</td>\n",
" <td>0.037681</td>\n",
" <td>0.049589</td>\n",
" <td>0.092831</td>\n",
" <td>0.002927</td>\n",
" <td>0.064624</td>\n",
" <td>0.024525</td>\n",
" <td>0.000308</td>\n",
" <td>0.018578</td>\n",
" <td>0.023323</td>\n",
" <td>0.073836</td>\n",
" <td>0.058420</td>\n",
" <td>0.026197</td>\n",
" <td>0.909745</td>\n",
" <td>0.204409</td>\n",
" <td>0.058092</td>\n",
" <td>0.351111</td>\n",
" <td>0.057784</td>\n",
" <td>0.073559</td>\n",
" <td>0.001833</td>\n",
" <td>0.006085</td>\n",
" <td>0.091336</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>48007</td>\n",
" <td>texas</td>\n",
" <td>Aransas</td>\n",
" <td>republican</td>\n",
" <td>0.751811</td>\n",
" <td>0.514525</td>\n",
" <td>24763</td>\n",
" <td>0.147476</td>\n",
" <td>0.171872</td>\n",
" <td>0.252445</td>\n",
" <td>0.428208</td>\n",
" <td>20044</td>\n",
" <td>0.019707</td>\n",
" <td>0.015386</td>\n",
" <td>0.272826</td>\n",
" <td>0.892622</td>\n",
" <td>0.5351</td>\n",
" <td>30939</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.006186</td>\n",
" <td>0.031581</td>\n",
" <td>0.037767</td>\n",
" <td>0.211684</td>\n",
" <td>0.299242</td>\n",
" <td>0.185242</td>\n",
" <td>0.339453</td>\n",
" <td>0.005887</td>\n",
" <td>0.018010</td>\n",
" <td>0.057025</td>\n",
" <td>0.023249</td>\n",
" <td>0.062862</td>\n",
" <td>0.043205</td>\n",
" <td>0.052784</td>\n",
" <td>0.002644</td>\n",
" <td>0.025244</td>\n",
" <td>0.019407</td>\n",
" <td>0.000000</td>\n",
" <td>0.022051</td>\n",
" <td>0.030682</td>\n",
" <td>0.054131</td>\n",
" <td>0.039979</td>\n",
" <td>0.031579</td>\n",
" <td>0.914671</td>\n",
" <td>0.206945</td>\n",
" <td>0.079824</td>\n",
" <td>0.252744</td>\n",
" <td>0.026542</td>\n",
" <td>0.108162</td>\n",
" <td>0.007334</td>\n",
" <td>0.004690</td>\n",
" <td>0.128916</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>48009</td>\n",
" <td>texas</td>\n",
" <td>Archer</td>\n",
" <td>republican</td>\n",
" <td>0.896580</td>\n",
" <td>0.803586</td>\n",
" <td>8789</td>\n",
" <td>0.163153</td>\n",
" <td>0.208085</td>\n",
" <td>0.281809</td>\n",
" <td>0.346954</td>\n",
" <td>6877</td>\n",
" <td>0.005348</td>\n",
" <td>0.009216</td>\n",
" <td>0.082717</td>\n",
" <td>0.948231</td>\n",
" <td>0.4316</td>\n",
" <td>31806</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.002472</td>\n",
" <td>0.020503</td>\n",
" <td>0.022975</td>\n",
" <td>0.210121</td>\n",
" <td>0.305511</td>\n",
" <td>0.096990</td>\n",
" <td>0.310310</td>\n",
" <td>0.034026</td>\n",
" <td>0.028210</td>\n",
" <td>0.014977</td>\n",
" <td>0.008579</td>\n",
" <td>0.046968</td>\n",
" <td>0.042751</td>\n",
" <td>0.123164</td>\n",
" <td>0.006107</td>\n",
" <td>0.049149</td>\n",
" <td>0.021085</td>\n",
" <td>0.000000</td>\n",
" <td>0.027774</td>\n",
" <td>0.033881</td>\n",
" <td>0.080413</td>\n",
" <td>0.019229</td>\n",
" <td>0.006713</td>\n",
" <td>0.971100</td>\n",
" <td>0.127963</td>\n",
" <td>0.077359</td>\n",
" <td>0.422132</td>\n",
" <td>0.025447</td>\n",
" <td>0.073724</td>\n",
" <td>0.011778</td>\n",
" <td>0.000145</td>\n",
" <td>0.090883</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" fips ... VETERANS_Estimate__Total__Veteran\n",
"0 48001 ... 0.086006\n",
"1 48003 ... 0.055696\n",
"2 48005 ... 0.091336\n",
"3 48007 ... 0.128916\n",
"4 48009 ... 0.090883\n",
"\n",
"[5 rows x 52 columns]"
]
},
"metadata": {
"tags": []
},
"execution_count": 74
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "1bcZqbGVQJYa"
},
"source": [
"## Select columns for clustering\n",
"\n",
"Use `.select` to select columns by name."
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 222
},
"id": "UDMlKUGrkLjU",
"outputId": "d8aa1a25-ff02-4729-f5b5-a0957965bf96"
},
"source": [
"col_tags = ('AGE', 'RACE', 'GINI', 'INCOME', 'UNEMPLOY', 'EDU', 'INDUSTRY', 'CITIZEN', 'HEALTH', 'VETERANS')\n",
"\n",
"cols = list(filter(lambda x: x.startswith(col_tags), df.columns))\n",
"\n",
"cluster_df = df.select(cols)\n",
"\n",
"cluster_df.limit(5).toPandas().head()"
],
"execution_count": 75,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>AGE_18_29</th>\n",
" <th>AGE_30_44</th>\n",
" <th>AGE_45_59</th>\n",
" <th>AGE_60_Plus</th>\n",
" <th>AGE_18_Plus</th>\n",
" <th>RACE_Total__Asian_alone</th>\n",
" <th>RACE_Total__Black_or_African_American_alone</th>\n",
" <th>RACE_Total__Hispanic_or_Latino</th>\n",
" <th>RACE_Total__White_alone</th>\n",
" <th>GINI_Gini_Index</th>\n",
" <th>INCOME_PER_CAPITA_INCOME_IN_THE_PAST_12_MONTHS__IN_2018_INFLATION_ADJUSTED_DOLLARS_</th>\n",
" <th>UNEMPLOY__16_YEARS_AND_OVER__ASIAN_ALONE_</th>\n",
" <th>UNEMPLOY__16_YEARS_AND_OVER__BLACK_OR_AFRICAN_AMERICAN_ALONE_</th>\n",
" <th>UNEMPLOY__16_YEARS_AND_OVER__HISPANIC_OR_LATINO_</th>\n",
" <th>UNEMPLOY__16_YEARS_AND_OVER__WHITE_ALONE_</th>\n",
" <th>UNEMPLOY_Total_16_YEARS_AND_OVER</th>\n",
" <th>EDU_ATTAIN_Total__Bachelor_s_degree_or_higher</th>\n",
" <th>EDU_ATTAIN_Total__High_school_graduate__includes_equivalency_</th>\n",
" <th>EDU_ATTAIN_Total__Less_than_high_school_diploma</th>\n",
" <th>EDU_ATTAIN_Total__Some_college_or_associate_s_degree</th>\n",
" <th>INDUSTRY_Total__Agriculture__forestry__fishing_and_hunting__and_mining__Agriculture__forestry__fishing_and_hunting</th>\n",
" <th>INDUSTRY_Total__Agriculture__forestry__fishing_and_hunting__and_mining__Mining__quarrying__and_oil_and_gas_extraction</th>\n",
" <th>INDUSTRY_Total__Arts__entertainment__and_recreation__and_accommodation_and_food_services__Accommodation_and_food_services</th>\n",
" <th>INDUSTRY_Total__Arts__entertainment__and_recreation__and_accommodation_and_food_services__Arts__entertainment__and_recreation</th>\n",
" <th>INDUSTRY_Total__Construction</th>\n",
" <th>INDUSTRY_Total__Educational_services__and_health_care_and_social_assistance__Educational_services</th>\n",
" <th>INDUSTRY_Total__Educational_services__and_health_care_and_social_assistance__Health_care_and_social_assistance</th>\n",
" <th>INDUSTRY_Total__Information</th>\n",
" <th>INDUSTRY_Total__Manufacturing</th>\n",
" <th>INDUSTRY_Total__Professional__scientific__and_management__and_administrative__and_waste_management_services__Administrative_and_support_and_waste_management_services</th>\n",
" <th>INDUSTRY_Total__Professional__scientific__and_management__and_administrative__and_waste_management_services__Management_of_companies_and_enterprises</th>\n",
" <th>INDUSTRY_Total__Professional__scientific__and_management__and_administrative__and_waste_management_services__Professional__scientific__and_technical_services</th>\n",
" <th>INDUSTRY_Total__Public_administration</th>\n",
" <th>INDUSTRY_Total__Retail_trade</th>\n",
" <th>CITIZEN_Estimate__Total__Not_a_U_S__citizen</th>\n",
" <th>CITIZEN_Estimate__Total__U_S__citizen_by_naturalization</th>\n",
" <th>CITIZEN_Estimate__Total__U_S__citizen__born_in_the_United_States</th>\n",
" <th>HEALTH_INSURANCE_No_health_insurance_coverage</th>\n",
" <th>HEALTH_INSURANCE_With_one_type_of_health_insurance_coverageWith_direct_purchase_health_insurance_only</th>\n",
" <th>HEALTH_INSURANCE_With_one_type_of_health_insurance_coverageWith_employer_based_health_insurance_only</th>\n",
" <th>HEALTH_INSURANCE_With_one_type_of_health_insurance_coverageWith_Medicaid_means_tested_public_coverage_only</th>\n",
" <th>HEALTH_INSURANCE_With_one_type_of_health_insurance_coverageWith_Medicare_coverage_only</th>\n",
" <th>HEALTH_INSURANCE_With_one_type_of_health_insurance_coverageWith_TRICARE_military_health_coverage_only</th>\n",
" <th>HEALTH_INSURANCE_With_one_type_of_health_insurance_coverageWith_VA_Health_Care_only</th>\n",
" <th>VETERANS_Estimate__Total__Veteran</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.192527</td>\n",
" <td>0.308352</td>\n",
" <td>0.246227</td>\n",
" <td>0.252894</td>\n",
" <td>46648</td>\n",
" <td>0.005513</td>\n",
" <td>0.209823</td>\n",
" <td>0.175276</td>\n",
" <td>0.735340</td>\n",
" <td>0.4225</td>\n",
" <td>16868</td>\n",
" <td>0.0</td>\n",
" <td>0.004287</td>\n",
" <td>0.002272</td>\n",
" <td>0.007760</td>\n",
" <td>0.014320</td>\n",
" <td>0.105299</td>\n",
" <td>0.359608</td>\n",
" <td>0.232636</td>\n",
" <td>0.308481</td>\n",
" <td>0.007846</td>\n",
" <td>0.028490</td>\n",
" <td>0.022359</td>\n",
" <td>0.002165</td>\n",
" <td>0.022380</td>\n",
" <td>0.029905</td>\n",
" <td>0.057473</td>\n",
" <td>0.002037</td>\n",
" <td>0.025424</td>\n",
" <td>0.021373</td>\n",
" <td>0.000000</td>\n",
" <td>0.010912</td>\n",
" <td>0.038608</td>\n",
" <td>0.077860</td>\n",
" <td>0.041546</td>\n",
" <td>0.020946</td>\n",
" <td>0.928400</td>\n",
" <td>0.114324</td>\n",
" <td>0.028040</td>\n",
" <td>0.306080</td>\n",
" <td>0.028040</td>\n",
" <td>0.065833</td>\n",
" <td>0.005638</td>\n",
" <td>0.006088</td>\n",
" <td>0.086006</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.245792</td>\n",
" <td>0.287747</td>\n",
" <td>0.246605</td>\n",
" <td>0.219855</td>\n",
" <td>12299</td>\n",
" <td>0.003536</td>\n",
" <td>0.019811</td>\n",
" <td>0.560052</td>\n",
" <td>0.924009</td>\n",
" <td>0.4506</td>\n",
" <td>31190</td>\n",
" <td>0.0</td>\n",
" <td>0.012196</td>\n",
" <td>0.015611</td>\n",
" <td>0.018863</td>\n",
" <td>0.046670</td>\n",
" <td>0.102529</td>\n",
" <td>0.461338</td>\n",
" <td>0.368810</td>\n",
" <td>0.319457</td>\n",
" <td>0.008537</td>\n",
" <td>0.167900</td>\n",
" <td>0.035125</td>\n",
" <td>0.003659</td>\n",
" <td>0.051955</td>\n",
" <td>0.036751</td>\n",
" <td>0.065696</td>\n",
" <td>0.009513</td>\n",
" <td>0.036100</td>\n",
" <td>0.022766</td>\n",
" <td>0.003252</td>\n",
" <td>0.014798</td>\n",
" <td>0.013172</td>\n",
" <td>0.077323</td>\n",
" <td>0.098047</td>\n",
" <td>0.047368</td>\n",
" <td>0.850208</td>\n",
" <td>0.175055</td>\n",
" <td>0.043256</td>\n",
" <td>0.511342</td>\n",
" <td>0.030084</td>\n",
" <td>0.055940</td>\n",
" <td>0.000000</td>\n",
" <td>0.003334</td>\n",
" <td>0.055696</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.208537</td>\n",
" <td>0.245155</td>\n",
" <td>0.258604</td>\n",
" <td>0.287704</td>\n",
" <td>64914</td>\n",
" <td>0.011495</td>\n",
" <td>0.147956</td>\n",
" <td>0.218864</td>\n",
" <td>0.791855</td>\n",
" <td>0.4495</td>\n",
" <td>22322</td>\n",
" <td>0.0</td>\n",
" <td>0.012416</td>\n",
" <td>0.009582</td>\n",
" <td>0.032027</td>\n",
" <td>0.054025</td>\n",
" <td>0.156638</td>\n",
" <td>0.314755</td>\n",
" <td>0.215177</td>\n",
" <td>0.306251</td>\n",
" <td>0.011092</td>\n",
" <td>0.010768</td>\n",
" <td>0.044012</td>\n",
" <td>0.005099</td>\n",
" <td>0.037681</td>\n",
" <td>0.049589</td>\n",
" <td>0.092831</td>\n",
" <td>0.002927</td>\n",
" <td>0.064624</td>\n",
" <td>0.024525</td>\n",
" <td>0.000308</td>\n",
" <td>0.018578</td>\n",
" <td>0.023323</td>\n",
" <td>0.073836</td>\n",
" <td>0.058420</td>\n",
" <td>0.026197</td>\n",
" <td>0.909745</td>\n",
" <td>0.204409</td>\n",
" <td>0.058092</td>\n",
" <td>0.351111</td>\n",
" <td>0.057784</td>\n",
" <td>0.073559</td>\n",
" <td>0.001833</td>\n",
" <td>0.006085</td>\n",
" <td>0.091336</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.147476</td>\n",
" <td>0.171872</td>\n",
" <td>0.252445</td>\n",
" <td>0.428208</td>\n",
" <td>20044</td>\n",
" <td>0.019707</td>\n",
" <td>0.015386</td>\n",
" <td>0.272826</td>\n",
" <td>0.892622</td>\n",
" <td>0.5351</td>\n",
" <td>30939</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.006186</td>\n",
" <td>0.031581</td>\n",
" <td>0.037767</td>\n",
" <td>0.211684</td>\n",
" <td>0.299242</td>\n",
" <td>0.185242</td>\n",
" <td>0.339453</td>\n",
" <td>0.005887</td>\n",
" <td>0.018010</td>\n",
" <td>0.057025</td>\n",
" <td>0.023249</td>\n",
" <td>0.062862</td>\n",
" <td>0.043205</td>\n",
" <td>0.052784</td>\n",
" <td>0.002644</td>\n",
" <td>0.025244</td>\n",
" <td>0.019407</td>\n",
" <td>0.000000</td>\n",
" <td>0.022051</td>\n",
" <td>0.030682</td>\n",
" <td>0.054131</td>\n",
" <td>0.039979</td>\n",
" <td>0.031579</td>\n",
" <td>0.914671</td>\n",
" <td>0.206945</td>\n",
" <td>0.079824</td>\n",
" <td>0.252744</td>\n",
" <td>0.026542</td>\n",
" <td>0.108162</td>\n",
" <td>0.007334</td>\n",
" <td>0.004690</td>\n",
" <td>0.128916</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.163153</td>\n",
" <td>0.208085</td>\n",
" <td>0.281809</td>\n",
" <td>0.346954</td>\n",
" <td>6877</td>\n",
" <td>0.005348</td>\n",
" <td>0.009216</td>\n",
" <td>0.082717</td>\n",
" <td>0.948231</td>\n",
" <td>0.4316</td>\n",
" <td>31806</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.002472</td>\n",
" <td>0.020503</td>\n",
" <td>0.022975</td>\n",
" <td>0.210121</td>\n",
" <td>0.305511</td>\n",
" <td>0.096990</td>\n",
" <td>0.310310</td>\n",
" <td>0.034026</td>\n",
" <td>0.028210</td>\n",
" <td>0.014977</td>\n",
" <td>0.008579</td>\n",
" <td>0.046968</td>\n",
" <td>0.042751</td>\n",
" <td>0.123164</td>\n",
" <td>0.006107</td>\n",
" <td>0.049149</td>\n",
" <td>0.021085</td>\n",
" <td>0.000000</td>\n",
" <td>0.027774</td>\n",
" <td>0.033881</td>\n",
" <td>0.080413</td>\n",
" <td>0.019229</td>\n",
" <td>0.006713</td>\n",
" <td>0.971100</td>\n",
" <td>0.127963</td>\n",
" <td>0.077359</td>\n",
" <td>0.422132</td>\n",
" <td>0.025447</td>\n",
" <td>0.073724</td>\n",
" <td>0.011778</td>\n",
" <td>0.000145</td>\n",
" <td>0.090883</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" AGE_18_29 ... VETERANS_Estimate__Total__Veteran\n",
"0 0.192527 ... 0.086006\n",
"1 0.245792 ... 0.055696\n",
"2 0.208537 ... 0.091336\n",
"3 0.147476 ... 0.128916\n",
"4 0.163153 ... 0.090883\n",
"\n",
"[5 rows x 45 columns]"
]
},
"metadata": {
"tags": []
},
"execution_count": 75
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "rD1yc70LQcsf"
},
"source": [
"## Assemble and fit an ML Pipeline\n",
"\n",
"An [ML Pipeline](https://spark.apache.org/docs/latest/ml-pipeline.html) contains a sequence of transformations to apply to data. Each stage in the Pipeline is either a `Transformer` or an `Estimator`.\n",
"\n",
"The following Pipeline contains three stages: \n",
"\n",
"1. `VectorAssembler`: assemble features into a vector\n",
"2. `StandardScaler`: scale the features to have mean=0, sd=1\n",
"3. `KMeans`: initalize the K-Means algorithm"
]
},
{
"cell_type": "code",
"metadata": {
"id": "-jRW-GzIQfSI"
},
"source": [
"## 1. Assemble features into a vector\n",
"vecAssembler = VectorAssembler(inputCols=cols, outputCol=\"vecfeatures\")\n",
"\n",
"## 2. Scale the features to have mean 0 and standard deviation 1\n",
"scaler = StandardScaler(inputCol=\"vecfeatures\", outputCol=\"features\",\n",
" withStd=True, withMean=True)\n",
"\n",
"## 3. Initialize the K-Means algorithm\n",
"kmeans = KMeans(k=3, seed=1)\n",
"\n",
"# Assemble Pipeline\n",
"pipeline = Pipeline(stages=[vecAssembler, scaler, kmeans])"
],
"execution_count": 19,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "aVSGjTYNgJ-W"
},
"source": [
"Fit the pipeline to data using `.fit`, then use `.transform` to make predictions (i.e. predict cluster labels)."
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 202
},
"id": "YudzkC7AfiX5",
"outputId": "30a9f9ee-9e39-4f1c-d0f6-ba971a9bda7b"
},
"source": [
"# Fit the pipeline \n",
"model = pipeline.fit(cluster_df) \n",
"\n",
"# Make a prediction \n",
"prediction = model.transform(cluster_df)\n",
"\n",
"prediction.select('vecFeatures', 'features', 'prediction').limit(5).toPandas().head()"
],
"execution_count": 79,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>vecFeatures</th>\n",
" <th>features</th>\n",
" <th>prediction</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>[0.19252701080432172, 0.3083519121934488, 0.24...</td>\n",
" <td>[0.04420494748177839, 2.471247594748924, -0.53...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>[0.24579234084071877, 0.28774697129847954, 0.2...</td>\n",
" <td>[1.0502649452687496, 1.8607396843027733, -0.51...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>[0.20853744954863357, 0.24515512832362818, 0.2...</td>\n",
" <td>[0.34660543869556293, 0.5987774593601314, -0.0...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>[0.1474755537816803, 0.1718718818599082, 0.252...</td>\n",
" <td>[-0.8067138156522811, -1.5725464914015652, -0....</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>[0.16315253744365277, 0.20808492075032717, 0.2...</td>\n",
" <td>[-0.5106115266494129, -0.4995831271003042, 0.8...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" vecFeatures ... prediction\n",
"0 [0.19252701080432172, 0.3083519121934488, 0.24... ... 0\n",
"1 [0.24579234084071877, 0.28774697129847954, 0.2... ... 1\n",
"2 [0.20853744954863357, 0.24515512832362818, 0.2... ... 0\n",
"3 [0.1474755537816803, 0.1718718818599082, 0.252... ... 0\n",
"4 [0.16315253744365277, 0.20808492075032717, 0.2... ... 0\n",
"\n",
"[5 rows x 3 columns]"
]
},
"metadata": {
"tags": []
},
"execution_count": 79
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "sPRjlYkoTiSu"
},
"source": [
"## Silhouette score\n",
"\n",
"Evaluate the predictions using the Silhouette score."
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "lwwIfUK_lwe3",
"outputId": "ad4b9b27-5164-43f1-fff8-53986de8e0ff"
},
"source": [
"# Evaluate clustering by computing Silhouette score\n",
"evaluator = ClusteringEvaluator()\n",
"silhouette = evaluator.evaluate(prediction)\n",
"print(\"Silhouette with squared euclidean distance = \" + str(silhouette))"
],
"execution_count": 22,
"outputs": [
{
"output_type": "stream",
"text": [
"Silhouette with squared euclidean distance = 0.35346767131585405\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "oXpQx2ZITlk_"
},
"source": [
"## Cluster centers\n",
"\n",
"To retreive the cluster centers, the K-Means model needs to be extracted from the fitted Pipeline. Use `model.stages` to view the Pipeline stages."
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "hy1qmQ56gwEa",
"outputId": "d964d777-a73c-4558-ce15-313756badf91"
},
"source": [
"model.stages"
],
"execution_count": 61,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[VectorAssembler_0c8c65f82cf3,\n",
" StandardScalerModel: uid=StandardScaler_77347baa1268, numFeatures=45, withMean=true, withStd=true,\n",
" KMeansModel: uid=KMeans_455712dbad97, k=3, distanceMeasure=euclidean, numFeatures=45]"
]
},
"metadata": {
"tags": []
},
"execution_count": 61
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 161
},
"id": "rfnR8r8BTPJM",
"outputId": "30d1f645-20e2-4fd1-b183-bd11e288ab50"
},
"source": [
"centers = model.stages[2].clusterCenters()\n",
"centers_df = pd.DataFrame(centers)\n",
"centers_df.columns = cluster_df.columns\n",
"\n",
"centers_df"
],
"execution_count": 62,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>AGE_18_29</th>\n",
" <th>AGE_30_44</th>\n",
" <th>AGE_45_59</th>\n",
" <th>AGE_60_Plus</th>\n",
" <th>AGE_18_Plus</th>\n",
" <th>RACE_Total__Asian_alone</th>\n",
" <th>RACE_Total__Black_or_African_American_alone</th>\n",
" <th>RACE_Total__Hispanic_or_Latino</th>\n",
" <th>RACE_Total__White_alone</th>\n",
" <th>GINI_Gini_Index</th>\n",
" <th>INCOME_PER_CAPITA_INCOME_IN_THE_PAST_12_MONTHS__IN_2018_INFLATION_ADJUSTED_DOLLARS_</th>\n",
" <th>UNEMPLOY__16_YEARS_AND_OVER__ASIAN_ALONE_</th>\n",
" <th>UNEMPLOY__16_YEARS_AND_OVER__BLACK_OR_AFRICAN_AMERICAN_ALONE_</th>\n",
" <th>UNEMPLOY__16_YEARS_AND_OVER__HISPANIC_OR_LATINO_</th>\n",
" <th>UNEMPLOY__16_YEARS_AND_OVER__WHITE_ALONE_</th>\n",
" <th>UNEMPLOY_Total_16_YEARS_AND_OVER</th>\n",
" <th>EDU_ATTAIN_Total__Bachelor_s_degree_or_higher</th>\n",
" <th>EDU_ATTAIN_Total__High_school_graduate__includes_equivalency_</th>\n",
" <th>EDU_ATTAIN_Total__Less_than_high_school_diploma</th>\n",
" <th>EDU_ATTAIN_Total__Some_college_or_associate_s_degree</th>\n",
" <th>INDUSTRY_Total__Agriculture__forestry__fishing_and_hunting__and_mining__Agriculture__forestry__fishing_and_hunting</th>\n",
" <th>INDUSTRY_Total__Agriculture__forestry__fishing_and_hunting__and_mining__Mining__quarrying__and_oil_and_gas_extraction</th>\n",
" <th>INDUSTRY_Total__Arts__entertainment__and_recreation__and_accommodation_and_food_services__Accommodation_and_food_services</th>\n",
" <th>INDUSTRY_Total__Arts__entertainment__and_recreation__and_accommodation_and_food_services__Arts__entertainment__and_recreation</th>\n",
" <th>INDUSTRY_Total__Construction</th>\n",
" <th>INDUSTRY_Total__Educational_services__and_health_care_and_social_assistance__Educational_services</th>\n",
" <th>INDUSTRY_Total__Educational_services__and_health_care_and_social_assistance__Health_care_and_social_assistance</th>\n",
" <th>INDUSTRY_Total__Information</th>\n",
" <th>INDUSTRY_Total__Manufacturing</th>\n",
" <th>INDUSTRY_Total__Professional__scientific__and_management__and_administrative__and_waste_management_services__Administrative_and_support_and_waste_management_services</th>\n",
" <th>INDUSTRY_Total__Professional__scientific__and_management__and_administrative__and_waste_management_services__Management_of_companies_and_enterprises</th>\n",
" <th>INDUSTRY_Total__Professional__scientific__and_management__and_administrative__and_waste_management_services__Professional__scientific__and_technical_services</th>\n",
" <th>INDUSTRY_Total__Public_administration</th>\n",
" <th>INDUSTRY_Total__Retail_trade</th>\n",
" <th>CITIZEN_Estimate__Total__Not_a_U_S__citizen</th>\n",
" <th>CITIZEN_Estimate__Total__U_S__citizen_by_naturalization</th>\n",
" <th>CITIZEN_Estimate__Total__U_S__citizen__born_in_the_United_States</th>\n",
" <th>HEALTH_INSURANCE_No_health_insurance_coverage</th>\n",
" <th>HEALTH_INSURANCE_With_one_type_of_health_insurance_coverageWith_direct_purchase_health_insurance_only</th>\n",
" <th>HEALTH_INSURANCE_With_one_type_of_health_insurance_coverageWith_employer_based_health_insurance_only</th>\n",
" <th>HEALTH_INSURANCE_With_one_type_of_health_insurance_coverageWith_Medicaid_means_tested_public_coverage_only</th>\n",
" <th>HEALTH_INSURANCE_With_one_type_of_health_insurance_coverageWith_Medicare_coverage_only</th>\n",
" <th>HEALTH_INSURANCE_With_one_type_of_health_insurance_coverageWith_TRICARE_military_health_coverage_only</th>\n",
" <th>HEALTH_INSURANCE_With_one_type_of_health_insurance_coverageWith_VA_Health_Care_only</th>\n",
" <th>VETERANS_Estimate__Total__Veteran</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>-0.287159</td>\n",
" <td>-0.249382</td>\n",
" <td>0.125231</td>\n",
" <td>0.320757</td>\n",
" <td>-0.207120</td>\n",
" <td>-0.279015</td>\n",
" <td>-0.004489</td>\n",
" <td>-0.287263</td>\n",
" <td>0.097970</td>\n",
" <td>-0.048442</td>\n",
" <td>-0.233522</td>\n",
" <td>-0.245131</td>\n",
" <td>-0.009091</td>\n",
" <td>-0.244621</td>\n",
" <td>-0.018794</td>\n",
" <td>-0.143356</td>\n",
" <td>-0.338661</td>\n",
" <td>0.266309</td>\n",
" <td>-0.070275</td>\n",
" <td>0.007361</td>\n",
" <td>0.090156</td>\n",
" <td>-0.024996</td>\n",
" <td>-0.224085</td>\n",
" <td>-0.125167</td>\n",
" <td>-0.014100</td>\n",
" <td>-0.218072</td>\n",
" <td>-0.036337</td>\n",
" <td>-0.188347</td>\n",
" <td>0.102984</td>\n",
" <td>-0.226144</td>\n",
" <td>-0.138354</td>\n",
" <td>-0.304601</td>\n",
" <td>-0.068678</td>\n",
" <td>-0.120262</td>\n",
" <td>-0.345333</td>\n",
" <td>-0.346906</td>\n",
" <td>0.380162</td>\n",
" <td>-0.022233</td>\n",
" <td>-0.008200</td>\n",
" <td>-0.221717</td>\n",
" <td>0.060725</td>\n",
" <td>0.254616</td>\n",
" <td>-0.107198</td>\n",
" <td>0.064268</td>\n",
" <td>0.143876</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.601502</td>\n",
" <td>0.905257</td>\n",
" <td>-0.649011</td>\n",
" <td>-0.718091</td>\n",
" <td>0.035547</td>\n",
" <td>0.026191</td>\n",
" <td>-0.338923</td>\n",
" <td>2.970818</td>\n",
" <td>-0.141378</td>\n",
" <td>0.084737</td>\n",
" <td>-0.643557</td>\n",
" <td>0.182039</td>\n",
" <td>-0.294189</td>\n",
" <td>2.638494</td>\n",
" <td>0.603661</td>\n",
" <td>1.402759</td>\n",
" <td>-0.371628</td>\n",
" <td>0.588405</td>\n",
" <td>2.426231</td>\n",
" <td>0.650506</td>\n",
" <td>0.625147</td>\n",
" <td>1.074540</td>\n",
" <td>0.126774</td>\n",
" <td>-0.271156</td>\n",
" <td>0.138014</td>\n",
" <td>0.041850</td>\n",
" <td>-0.767634</td>\n",
" <td>-0.405155</td>\n",
" <td>-0.502686</td>\n",
" <td>-0.007139</td>\n",
" <td>-0.127766</td>\n",
" <td>-0.393236</td>\n",
" <td>0.122950</td>\n",
" <td>-0.338224</td>\n",
" <td>2.179047</td>\n",
" <td>1.098175</td>\n",
" <td>-1.790288</td>\n",
" <td>1.350968</td>\n",
" <td>-0.318381</td>\n",
" <td>-0.409738</td>\n",
" <td>0.120902</td>\n",
" <td>-0.309357</td>\n",
" <td>-0.177184</td>\n",
" <td>-0.132713</td>\n",
" <td>-0.847554</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.865168</td>\n",
" <td>0.627222</td>\n",
" <td>-0.251938</td>\n",
" <td>-0.951414</td>\n",
" <td>0.753148</td>\n",
" <td>1.021615</td>\n",
" <td>0.126450</td>\n",
" <td>0.097437</td>\n",
" <td>-0.315864</td>\n",
" <td>0.151373</td>\n",
" <td>1.070786</td>\n",
" <td>0.845992</td>\n",
" <td>0.128937</td>\n",
" <td>0.047743</td>\n",
" <td>-0.126316</td>\n",
" <td>0.074497</td>\n",
" <td>1.370794</td>\n",
" <td>-1.173952</td>\n",
" <td>-0.527116</td>\n",
" <td>-0.238068</td>\n",
" <td>-0.535520</td>\n",
" <td>-0.256075</td>\n",
" <td>0.786207</td>\n",
" <td>0.550017</td>\n",
" <td>0.007314</td>\n",
" <td>0.791539</td>\n",
" <td>0.383017</td>\n",
" <td>0.826713</td>\n",
" <td>-0.217245</td>\n",
" <td>0.837223</td>\n",
" <td>0.552214</td>\n",
" <td>1.252050</td>\n",
" <td>0.213695</td>\n",
" <td>0.553648</td>\n",
" <td>0.568514</td>\n",
" <td>0.924733</td>\n",
" <td>-0.823136</td>\n",
" <td>-0.355890</td>\n",
" <td>0.133491</td>\n",
" <td>0.951399</td>\n",
" <td>-0.263388</td>\n",
" <td>-0.839736</td>\n",
" <td>0.453209</td>\n",
" <td>-0.194250</td>\n",
" <td>-0.256408</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" AGE_18_29 ... VETERANS_Estimate__Total__Veteran\n",
"0 -0.287159 ... 0.143876\n",
"1 0.601502 ... -0.847554\n",
"2 0.865168 ... -0.256408\n",
"\n",
"[3 rows x 45 columns]"
]
},
"metadata": {
"tags": []
},
"execution_count": 62
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9rL0ZGs3WYRZ"
},
"source": [
"## Cluster frequencies"
]
},
{
"cell_type": "code",
"metadata": {
"id": "F9aXDTRkU07y"
},
"source": [
"prediction.groupBy('prediction').count().show()"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "d5Fxz3HbWVHM"
},
"source": [
"## Reusable function\n",
"\n",
"The following functions captures the previous steps (assemble Pipeline, fit model, make predictions, assess fit, obtain cluster centers and frequencies) for reusability."
]
},
{
"cell_type": "code",
"metadata": {
"id": "Q4kL8DxdWXDl"
},
"source": [
"def run_clustering(k):\n",
" # Define and assemble the Pipeline\n",
" vecAssembler = VectorAssembler(inputCols=cols, outputCol=\"vecfeatures\")\n",
" scaler = StandardScaler(inputCol=\"vecfeatures\", outputCol=\"features\",\n",
" withStd=True, withMean=True)\n",
" kmeans = KMeans(k=k, seed=1)\n",
" pipeline = Pipeline(stages=[vecAssembler, scaler, kmeans])\n",
" \n",
" # Fit the pipeline \n",
" model = pipeline.fit(cluster_df) \n",
" \n",
" # Make a prediction \n",
" prediction = model.transform(cluster_df)\n",
" \n",
" # Evaluate clustering by computing Silhouette score\n",
" evaluator = ClusteringEvaluator()\n",
" silhouette = evaluator.evaluate(prediction)\n",
" \n",
" # Cluster centers\n",
" centers = model.stages[2].clusterCenters()\n",
" centers_df = pd.DataFrame(centers)\n",
" centers_df.columns = cluster_df.columns\n",
" \n",
" # Cluster frequencies\n",
" cluster_freq = prediction.groupBy('prediction').count()\n",
" \n",
" out = {\n",
" \"silhouette\": silhouette,\n",
" \"centers\": centers_df,\n",
" \"freq\": cluster_freq\n",
" }\n",
" \n",
" return(out)"
],
"execution_count": 51,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "-jo5IOUCW9eD"
},
"source": [
"## Assess a range of K values\n",
"\n",
"Use the previously defined `run_clustering` function to assess the Silhouette score for a range of *K* values."
]
},
{
"cell_type": "code",
"metadata": {
"id": "kHD4HNybWcWv"
},
"source": [
"k_values = list(range(3,20))\n",
"\n",
"k_clustering = [run_clustering(i) for i in k_values]\n",
"\n",
"silhouette_results = [{\"k\": k, \"silhouette\": i['silhouette']} for k,i in zip(k_values, k_clustering)]\n",
"\n",
"silhouette_df = pd.DataFrame(silhouette_results)"
],
"execution_count": 67,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 314
},
"id": "KRGtwTxcil0p",
"outputId": "5720a281-67ab-4ae8-fcbc-059183f5fbf4"
},
"source": [
"silhouette_df.plot.line(x='k', y='silhouette', title = \"Silhouette score for a range of K values\")"
],
"execution_count": 71,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x7f2782b3f1d0>"
]
},
"metadata": {
"tags": []
},
"execution_count": 71
},
{
"output_type": "display_data",
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"tags": [],
"needs_background": "light"
}
}
]
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment