Skip to content

Instantly share code, notes, and snippets.

@dannymorris
Created May 18, 2021 19:22
Show Gist options
  • Save dannymorris/ba1788ec0ea141a1dac5c1c24507fcaa to your computer and use it in GitHub Desktop.
Save dannymorris/ba1788ec0ea141a1dac5c1c24507fcaa to your computer and use it in GitHub Desktop.
Spark_Apache_Sedona.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "Spark_Apache_Sedona.ipynb",
"provenance": [],
"collapsed_sections": [],
"toc_visible": true,
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/dannymorris/ba1788ec0ea141a1dac5c1c24507fcaa/spark_apache_sedona.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "sq8U3BtmhtRx"
},
"source": [
"\n",
"# Install Spark"
]
},
{
"cell_type": "code",
"metadata": {
"id": "lh5NCoc8fsSO"
},
"source": [
"!apt-get install openjdk-11-jdk-headless -qq > /dev/null\n",
"!wget -q https://www-us.apache.org/dist/spark/spark-3.0.2/spark-3.0.2-bin-hadoop2.7.tgz\n",
"!tar xf spark-3.0.2-bin-hadoop2.7.tgz\n",
"!pip install -q findspark"
],
"execution_count": 1,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "7rX3TtaOEsQa"
},
"source": [
"# Install Apache Sedona and dependencies"
]
},
{
"cell_type": "code",
"metadata": {
"id": "v0AE-6j7EvNO",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "2def66ab-d189-4d7f-d64a-3b3d9777674b"
},
"source": [
"!pip install shapely\n",
"!pip install attrs\n",
"!pip install apache-sedona"
],
"execution_count": 2,
"outputs": [
{
"output_type": "stream",
"text": [
"Requirement already satisfied: shapely in /usr/local/lib/python3.7/dist-packages (1.7.1)\n",
"Requirement already satisfied: attrs in /usr/local/lib/python3.7/dist-packages (21.2.0)\n",
"Collecting apache-sedona\n",
"\u001b[?25l Downloading https://files.pythonhosted.org/packages/f2/1d/0604627905d04f93ee0cf8e9b312d5b4590a6e80e3d0112902d332302365/apache_sedona-1.0.0-py3-none-any.whl (67kB)\n",
"\u001b[K |████████████████████████████████| 71kB 3.9MB/s \n",
"\u001b[?25hCollecting pyspark<3.1.0\n",
"\u001b[?25l Downloading https://files.pythonhosted.org/packages/27/67/5158f846202d7f012d1c9ca21c3549a58fd3c6707ae8ee823adcaca6473c/pyspark-3.0.2.tar.gz (204.8MB)\n",
"\u001b[K |████████████████████████████████| 204.8MB 51kB/s \n",
"\u001b[?25hRequirement already satisfied: shapely in /usr/local/lib/python3.7/dist-packages (from apache-sedona) (1.7.1)\n",
"Requirement already satisfied: attrs in /usr/local/lib/python3.7/dist-packages (from apache-sedona) (21.2.0)\n",
"Collecting py4j==0.10.9\n",
"\u001b[?25l Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)\n",
"\u001b[K |████████████████████████████████| 204kB 42.0MB/s \n",
"\u001b[?25hBuilding wheels for collected packages: pyspark\n",
" Building wheel for pyspark (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
" Created wheel for pyspark: filename=pyspark-3.0.2-py2.py3-none-any.whl size=205186687 sha256=4aadc827253c8a86653ab441f2e0afa0d94d57d6293bbbe425ad611c161eb877\n",
" Stored in directory: /root/.cache/pip/wheels/8b/09/da/c1f2859bcc86375dc972c5b6af4881b3603269bcc4c9be5d16\n",
"Successfully built pyspark\n",
"Installing collected packages: py4j, pyspark, apache-sedona\n",
"Successfully installed apache-sedona-1.0.0 py4j-0.10.9 pyspark-3.0.2\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ILheUROOhprv"
},
"source": [
"# Set location of Java and Spark\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "v1b8k_OVf2QF"
},
"source": [
"import os\n",
"os.environ[\"JAVA_HOME\"] = \"/usr/lib/jvm/java-11-openjdk-amd64\"\n",
"os.environ[\"SPARK_HOME\"] = \"/content/spark-3.0.2-bin-hadoop2.7\""
],
"execution_count": 3,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "KwrqMk3HiMiE"
},
"source": [
"# Load packages"
]
},
{
"cell_type": "code",
"metadata": {
"id": "9_Uz1NL4gHFx"
},
"source": [
"from pyspark.sql import SparkSession\n",
"from pyspark import SparkFiles\n",
"from sedona.register import SedonaRegistrator\n",
"from sedona.utils import SedonaKryoRegistrator, KryoSerializer\n",
"\n",
"import findspark\n",
"findspark.init()"
],
"execution_count": 4,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "-8UmT-1QE6aQ"
},
"source": [
"# Create SparkSession"
]
},
{
"cell_type": "code",
"metadata": {
"id": "WPabOMqXBBmH",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "106db4e1-7fae-4e21-b27e-17f26e845c67"
},
"source": [
"spark = SparkSession. \\\n",
" builder. \\\n",
" appName('sedonaTest'). \\\n",
" config(\"spark.serializer\", KryoSerializer.getName). \\\n",
" config(\"spark.kryo.registrator\", SedonaKryoRegistrator.getName). \\\n",
" config('spark.jars.packages',\n",
" 'org.apache.sedona:sedona-python-adapter-3.0_2.12:1.0.0-incubating,'\n",
" 'org.datasyslab:geotools-wrapper:geotools-24.0'). \\\n",
" getOrCreate()\n",
"\n",
"SedonaRegistrator.registerAll(spark)"
],
"execution_count": 5,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"True"
]
},
"metadata": {
"tags": []
},
"execution_count": 5
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "fSwJvXvqFFkM"
},
"source": [
"# Read data\n",
"\n",
"The raw data is a text file containing latitude/longitidue coordinates of cities worldwide."
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 202
},
"id": "ZLSMXKkYB-j0",
"outputId": "ffe7d42e-706a-4390-f57e-d6bb96e25fb1"
},
"source": [
"url = \"https://gist.githubusercontent.com/dannymorris/d28665a8b5e58f7eb6d8e065e04b1231/raw/b78a3536850dfd9da06a717a3ff69deaa05b2edb/worldcities.csv\"\n",
"spark.sparkContext.addFile(url)\n",
"\n",
"world_cities = spark.read.csv(\"file://\"+SparkFiles.get(\"worldcities.csv\"), header=True, inferSchema= True)\n",
"world_cities.createOrReplaceTempView(\"world_cities\")\n",
"world_cities.limit(5).toPandas().head()"
],
"execution_count": 6,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>city</th>\n",
" <th>city_ascii</th>\n",
" <th>lat</th>\n",
" <th>lng</th>\n",
" <th>country</th>\n",
" <th>iso2</th>\n",
" <th>iso3</th>\n",
" <th>admin_name</th>\n",
" <th>capital</th>\n",
" <th>population</th>\n",
" <th>id</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Tokyo</td>\n",
" <td>Tokyo</td>\n",
" <td>35.6897</td>\n",
" <td>139.6922</td>\n",
" <td>Japan</td>\n",
" <td>JP</td>\n",
" <td>JPN</td>\n",
" <td>Tōkyō</td>\n",
" <td>primary</td>\n",
" <td>37977000.0</td>\n",
" <td>1392685764</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Jakarta</td>\n",
" <td>Jakarta</td>\n",
" <td>-6.2146</td>\n",
" <td>106.8451</td>\n",
" <td>Indonesia</td>\n",
" <td>ID</td>\n",
" <td>IDN</td>\n",
" <td>Jakarta</td>\n",
" <td>primary</td>\n",
" <td>34540000.0</td>\n",
" <td>1360771077</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Delhi</td>\n",
" <td>Delhi</td>\n",
" <td>28.6600</td>\n",
" <td>77.2300</td>\n",
" <td>India</td>\n",
" <td>IN</td>\n",
" <td>IND</td>\n",
" <td>Delhi</td>\n",
" <td>admin</td>\n",
" <td>29617000.0</td>\n",
" <td>1356872604</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Mumbai</td>\n",
" <td>Mumbai</td>\n",
" <td>18.9667</td>\n",
" <td>72.8333</td>\n",
" <td>India</td>\n",
" <td>IN</td>\n",
" <td>IND</td>\n",
" <td>Mahārāshtra</td>\n",
" <td>admin</td>\n",
" <td>23355000.0</td>\n",
" <td>1356226629</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Manila</td>\n",
" <td>Manila</td>\n",
" <td>14.5958</td>\n",
" <td>120.9772</td>\n",
" <td>Philippines</td>\n",
" <td>PH</td>\n",
" <td>PHL</td>\n",
" <td>Manila</td>\n",
" <td>primary</td>\n",
" <td>23088000.0</td>\n",
" <td>1608618140</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" city city_ascii lat ... capital population id\n",
"0 Tokyo Tokyo 35.6897 ... primary 37977000.0 1392685764\n",
"1 Jakarta Jakarta -6.2146 ... primary 34540000.0 1360771077\n",
"2 Delhi Delhi 28.6600 ... admin 29617000.0 1356872604\n",
"3 Mumbai Mumbai 18.9667 ... admin 23355000.0 1356226629\n",
"4 Manila Manila 14.5958 ... primary 23088000.0 1608618140\n",
"\n",
"[5 rows x 11 columns]"
]
},
"metadata": {
"tags": []
},
"execution_count": 6
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "A4tCa2qeFMlk"
},
"source": [
"#Convert lat/long to geometric points\n",
"\n",
"The Apache Sedona constructor ST_Point takes Decimal typed values X,Y and constructs a geometric point."
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 219
},
"id": "ZVERoPZWDKHk",
"outputId": "9ca6ec28-55f9-487c-969f-8648daf43764"
},
"source": [
"# create Spark DF containing geometric representation of lat/long coordinates (column name \"Geom\")\n",
"geom_df = spark.sql(\n",
" \"\"\"\n",
" SELECT *, ST_Point(CAST(world_cities.lng AS Decimal(24,20)), CAST(world_cities.lat AS Decimal(24,20))) As Geom\n",
" FROM world_cities\n",
" \"\"\"\n",
")\n",
" \n",
"# write a local, temporary table for use with Spark SQL\n",
"geom_df.createOrReplaceTempView(\"geom_df\")\n",
"geom_df.limit(5).toPandas().head()"
],
"execution_count": 7,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>city</th>\n",
" <th>city_ascii</th>\n",
" <th>lat</th>\n",
" <th>lng</th>\n",
" <th>country</th>\n",
" <th>iso2</th>\n",
" <th>iso3</th>\n",
" <th>admin_name</th>\n",
" <th>capital</th>\n",
" <th>population</th>\n",
" <th>id</th>\n",
" <th>Geom</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Tokyo</td>\n",
" <td>Tokyo</td>\n",
" <td>35.6897</td>\n",
" <td>139.6922</td>\n",
" <td>Japan</td>\n",
" <td>JP</td>\n",
" <td>JPN</td>\n",
" <td>Tōkyō</td>\n",
" <td>primary</td>\n",
" <td>37977000.0</td>\n",
" <td>1392685764</td>\n",
" <td>POINT (139.6922 35.6897)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Jakarta</td>\n",
" <td>Jakarta</td>\n",
" <td>-6.2146</td>\n",
" <td>106.8451</td>\n",
" <td>Indonesia</td>\n",
" <td>ID</td>\n",
" <td>IDN</td>\n",
" <td>Jakarta</td>\n",
" <td>primary</td>\n",
" <td>34540000.0</td>\n",
" <td>1360771077</td>\n",
" <td>POINT (106.8451 -6.2146)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Delhi</td>\n",
" <td>Delhi</td>\n",
" <td>28.6600</td>\n",
" <td>77.2300</td>\n",
" <td>India</td>\n",
" <td>IN</td>\n",
" <td>IND</td>\n",
" <td>Delhi</td>\n",
" <td>admin</td>\n",
" <td>29617000.0</td>\n",
" <td>1356872604</td>\n",
" <td>POINT (77.23 28.66)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Mumbai</td>\n",
" <td>Mumbai</td>\n",
" <td>18.9667</td>\n",
" <td>72.8333</td>\n",
" <td>India</td>\n",
" <td>IN</td>\n",
" <td>IND</td>\n",
" <td>Mahārāshtra</td>\n",
" <td>admin</td>\n",
" <td>23355000.0</td>\n",
" <td>1356226629</td>\n",
" <td>POINT (72.83329999999999 18.9667)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Manila</td>\n",
" <td>Manila</td>\n",
" <td>14.5958</td>\n",
" <td>120.9772</td>\n",
" <td>Philippines</td>\n",
" <td>PH</td>\n",
" <td>PHL</td>\n",
" <td>Manila</td>\n",
" <td>primary</td>\n",
" <td>23088000.0</td>\n",
" <td>1608618140</td>\n",
" <td>POINT (120.9772 14.5958)</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" city city_ascii ... id Geom\n",
"0 Tokyo Tokyo ... 1392685764 POINT (139.6922 35.6897)\n",
"1 Jakarta Jakarta ... 1360771077 POINT (106.8451 -6.2146)\n",
"2 Delhi Delhi ... 1356872604 POINT (77.23 28.66)\n",
"3 Mumbai Mumbai ... 1356226629 POINT (72.83329999999999 18.9667)\n",
"4 Manila Manila ... 1608618140 POINT (120.9772 14.5958)\n",
"\n",
"[5 rows x 12 columns]"
]
},
"metadata": {
"tags": []
},
"execution_count": 7
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "L7Vt0UsJFR0i"
},
"source": [
"# Distance calculations\n",
"\n",
"## All points to a single query points\n",
"\n",
"Compute the Euclidean distances between all cities and a single query point. Results are ordered by smallest distance to query point.\n",
"\n",
"Notation: *N x 1*\n",
"\n",
"The ST_Distance function takes two geometric inputs and computes the Euclidean distance."
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 202
},
"id": "iW5a7UxcDWaG",
"outputId": "a7cd1bbc-57fa-4871-acae-864c1007f249"
},
"source": [
"knn_query = spark.sql(\n",
" \"\"\"\n",
" SELECT Geom, city, admin_name, ST_Distance(ST_Point(-78.858060, 43.062770), Geom) AS distance \n",
" FROM geom_df \n",
" ORDER BY distance\n",
" \"\"\"\n",
")\n",
"knn_query.createOrReplaceTempView(\"knn_query\")\n",
"knn_query.limit(5).toPandas().head()"
],
"execution_count": 8,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Geom</th>\n",
" <th>city</th>\n",
" <th>admin_name</th>\n",
" <th>distance</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>POINT (-78.8659 43.0457)</td>\n",
" <td>North Tonawanda</td>\n",
" <td>New York</td>\n",
" <td>0.018784</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>POINT (-78.8831 43.0975)</td>\n",
" <td>Wheatfield</td>\n",
" <td>New York</td>\n",
" <td>0.042816</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>POINT (-78.8805 43.0105)</td>\n",
" <td>Tonawanda</td>\n",
" <td>New York</td>\n",
" <td>0.056883</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>POINT (-78.78870000000001 43.0025)</td>\n",
" <td>University at Buffalo</td>\n",
" <td>New York</td>\n",
" <td>0.091887</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>POINT (-78.87130000000001 42.9646)</td>\n",
" <td>Kenmore</td>\n",
" <td>New York</td>\n",
" <td>0.099059</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Geom ... distance\n",
"0 POINT (-78.8659 43.0457) ... 0.018784\n",
"1 POINT (-78.8831 43.0975) ... 0.042816\n",
"2 POINT (-78.8805 43.0105) ... 0.056883\n",
"3 POINT (-78.78870000000001 43.0025) ... 0.091887\n",
"4 POINT (-78.87130000000001 42.9646) ... 0.099059\n",
"\n",
"[5 rows x 4 columns]"
]
},
"metadata": {
"tags": []
},
"execution_count": 8
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XpLpcAnjFcZm"
},
"source": [
"## All pairwise points\n",
"\n",
"Compute the distances between all pairs of cities. This is done by creating a second DataFrame and computing distances between geometric coordinates in the first and second DataFrame using the ST_Distance function. In this toy example, the original DataFrame is replicated. In a real scenario, the first DataFrame may contain member coordinates and the second provider coordinates.\n",
"\n",
"Notation: *N x N*"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 202
},
"id": "z5oR-kpCDZ_q",
"outputId": "e56b8c22-c03c-4c52-ae65-8e505fed83c4"
},
"source": [
"geom_df2 = spark.sql(\n",
" \"\"\"\n",
" SELECT *, Geom As Geom2\n",
" FROM geom_df \n",
" \"\"\"\n",
")\n",
"geom_df2.createOrReplaceTempView(\"geom_df2\")\n",
" \n",
"knn_all = spark.sql(\n",
" \"\"\"\n",
" SELECT CONCAT(geom_df.city, ',', geom_df.country) as Location1,\n",
" CONCAT(geom_df2.city, ',', geom_df2.country) as Location2,\n",
" ST_Distance(geom_df.Geom, geom_df2.Geom) As Distance\n",
" FROM geom_df, geom_df2\n",
" WHERE ST_Distance(geom_df.Geom, geom_df2.Geom) < 0.1\n",
" \"\"\"\n",
")\n",
"knn_all.createOrReplaceTempView(\"knn_all\")\n",
"knn_all.limit(5).toPandas().head()"
],
"execution_count": 9,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Location1</th>\n",
" <th>Location2</th>\n",
" <th>Distance</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Tokyo,Japan</td>\n",
" <td>Tokyo,Japan</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Tokyo,Japan</td>\n",
" <td>Shibuya,Japan</td>\n",
" <td>0.039903</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Jakarta,Indonesia</td>\n",
" <td>Jakarta,Indonesia</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Delhi,India</td>\n",
" <td>Delhi,India</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Delhi,India</td>\n",
" <td>New Delhi,India</td>\n",
" <td>0.050000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Location1 Location2 Distance\n",
"0 Tokyo,Japan Tokyo,Japan 0.000000\n",
"1 Tokyo,Japan Shibuya,Japan 0.039903\n",
"2 Jakarta,Indonesia Jakarta,Indonesia 0.000000\n",
"3 Delhi,India Delhi,India 0.000000\n",
"4 Delhi,India New Delhi,India 0.050000"
]
},
"metadata": {
"tags": []
},
"execution_count": 9
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "P6eEUrxOFsUH"
},
"source": [
"# Cities and their top 3 closest neighboring cities"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 202
},
"id": "GS8eZPcLDhpl",
"outputId": "fc79aaa4-7259-4cb2-96bd-6d09cc2d3e0d"
},
"source": [
"top_3 = spark.sql(\n",
" \"\"\"\n",
" SELECT *\n",
" FROM (\n",
" SELECT *, DENSE_RANK() OVER (PARTITION BY Location1 ORDER BY Distance) as rank\n",
" FROM knn_all\n",
" ) \n",
" WHERE rank <= 3;\n",
" \"\"\"\n",
")\n",
"top_3.createOrReplaceTempView(\"top_3\")\n",
"top_3.limit(5).toPandas().head()"
],
"execution_count": 12,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Location1</th>\n",
" <th>Location2</th>\n",
" <th>Distance</th>\n",
" <th>rank</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Apsheronsk,Russia</td>\n",
" <td>Apsheronsk,Russia</td>\n",
" <td>0.000000</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Arth,Switzerland</td>\n",
" <td>Arth,Switzerland</td>\n",
" <td>0.000000</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Ash Shaykh ‘Uthmān,Yemen</td>\n",
" <td>Ash Shaykh ‘Uthmān,Yemen</td>\n",
" <td>0.000000</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Ash Shaykh ‘Uthmān,Yemen</td>\n",
" <td>Aden,Yemen</td>\n",
" <td>0.086768</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Austin,United States</td>\n",
" <td>Austin,United States</td>\n",
" <td>0.000000</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Location1 Location2 Distance rank\n",
"0 Apsheronsk,Russia Apsheronsk,Russia 0.000000 1\n",
"1 Arth,Switzerland Arth,Switzerland 0.000000 1\n",
"2 Ash Shaykh ‘Uthmān,Yemen Ash Shaykh ‘Uthmān,Yemen 0.000000 1\n",
"3 Ash Shaykh ‘Uthmān,Yemen Aden,Yemen 0.086768 2\n",
"4 Austin,United States Austin,United States 0.000000 1"
]
},
"metadata": {
"tags": []
},
"execution_count": 12
}
]
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment