Skip to content

Instantly share code, notes, and snippets.

@pavlov99
Created December 19, 2016 07:52
Show Gist options
  • Star 24 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save pavlov99/bd265be244f8a84e291e96c5656ceb5c to your computer and use it in GitHub Desktop.
Save pavlov99/bd265be244f8a84e291e96c5656ceb5c to your computer and use it in GitHub Desktop.
Spherical distance calcualtion based on latitude and longitude with Apache Spark
// Based on following links:
// http://andrew.hedges.name/experiments/haversine/
// http://www.movable-type.co.uk/scripts/latlong.html
df
.withColumn("a", pow(sin(toRadians($"destination_latitude" - $"origin_latitude") / 2), 2) + cos(toRadians($"origin_latitude")) * cos(toRadians($"destination_latitude")) * pow(sin(toRadians($"destination_longitude" - $"origin_longitude") / 2), 2))
.withColumn("distance", atan2(sqrt($"a"), sqrt(-$"a" + 1)) * 2 * 6371)
>>>
+--------------+-------------------+-------------+----------------+---------------+----------------+--------------------+---------------------+--------------------+------------------+
|origin_airport|destination_airport| origin_city|destination_city|origin_latitude|origin_longitude|destination_latitude|destination_longitude| a| distance|
+--------------+-------------------+-------------+----------------+---------------+----------------+--------------------+---------------------+--------------------+------------------+
| HKG| SYD| Hong Kong| Sydney| 22.308919| 113.914603| -33.946111| 151.177222| 0.3005838068886348|7393.8837884771565|
| YYZ| HKG| Toronto| Hong Kong| 43.677223| -79.630556| 22.308919| 113.914603| 0.6941733892671567|12548.533187172497|
+--------------+-------------------+-------------+----------------+---------------+----------------+--------------------+---------------------+--------------------+------------------+
@nathanwalther
Copy link

Thanks for sharing, this was a huge help!

@joekane3
Copy link

joekane3 commented Oct 4, 2018

+1

@kennethlimjf
Copy link

Thanks @pavlov99, I still use this!

@harpaj
Copy link

harpaj commented May 31, 2019

Thanks a lot for this! I ported it to Pyspark, maybe it helps someone:

    import pyspark.sql.functions as F
    df = df.withColumn("a", (
        F.pow(F.sin(F.radians(F.col("destination_latitude") - F.col("origin_latitude")) / 2), 2) +
        F.cos(F.radians(F.col("origin_latitude"))) * F.cos(F.radians(F.col("destination_latitude"))) *
        F.pow(F.sin(F.radians(F.col("destination_longitude") - F.col("origin_longitude")) / 2), 2)
    )).withColumn("distance", F.atan2(F.sqrt(F.col("a")), F.sqrt(-F.col("a") + 1)) * 12742000)

@RobinL
Copy link

RobinL commented Mar 30, 2020

Thanks @pavlov99 and @harpaj!. Worth noting that harpaj's code gives distance in meters

and if you like sql:

cast(atan2(sqrt(
(
pow(sin(radians(lat_r - lat_l))/2, 2) + 
cos(radians(lat_l)) * cos(radians(lat_r)) *
pow(sin(radians(long_r - long_l)/2),2)
)
), sqrt(-1*
(
pow(sin(radians(lat_r - lat_l))/2, 2) + 
cos(radians(lat_l)) * cos(radians(lat_r)) *
pow(sin(radians(long_r - long_l)/2),2)
)
 + 1)) * 12742 as float) as distance_km

@John-Cusack
Copy link

John-Cusack commented Aug 8, 2020

I took @harpaj 's code and put it into a function

def hav_dist(origin_lat, origin_long, dest_lat, dest_long):
    a = (
        F.pow(F.sin(F.radians(dest_lat - origin_lat) / 2), 2) +
        F.cos(F.radians(origin_lat)) * F.cos(F.radians(dest_lat)) *
        F.pow(F.sin(F.radians(dest_long - origin_long) / 2), 2))
    return ( F.atan2(F.sqrt(a), F.sqrt(-a + 1)) * 12742)

@wenmin-wu
Copy link

wenmin-wu commented Aug 15, 2020

I took @harpaj 's code and implement it based on numpy, the return distance is in KM

import numpy as np

def haversine(origin_lat,
              origin_long,
              dest_lat,
              dest_long) -> float:
    o_lat = np.asarray(origin_lat)
    o_long = np.asarray(origin_long)
    d_lat = np.asarray(dest_lat)
    d_long = np.asarray(dest_long)
    
    a = np.sin(np.radians(d_lat - o_lat) / 2) ** 2
    b = np.cos(np.radians(o_lat)) * np.cos(np.radians(d_lat))
    c = np.sin(np.radians(d_long - o_long) / 2) ** 2
    d = a + b * c
    return np.arctan2(np.sqrt(d), np.sqrt(-d + 1)) * 12742


assert abs(haversine(22.308919, 113.914603, -33.946111, 151.177222) - 7393.8837884771565) < 1e-6
assert abs(haversine(43.677223, -79.630556, 22.308919, 113.914603) - 12548.533187172497) < 1e-6

@minus34
Copy link

minus34 commented Dec 9, 2020

Thanks, saved me a performance bottleneck I had!

@kangeugine
Copy link

in case anyone wants to save a column

import org.apache.spark.sql.Column

def haversineDistance(destination_latitude: Column, destination_longitude: Column, origin_latitude: Column, origin_longitude: Column): Column = {
    val a = pow(sin(toRadians(destination_latitude - origin_latitude) / 2), 2) + cos(toRadians(origin_latitude)) * cos(toRadians(destination_latitude)) * pow(sin(toRadians(destination_longitude - origin_longitude) / 2), 2)
    val distance = atan2(sqrt(a), sqrt(-a + 1)) * 2 * 6371
    return distance
}

val x = Seq(
    ("Hong Kong", "Sydney", 22.308919, 113.914603, -33.946111, 151.177222),
    ("Toronto", "Hong Kong", 43.677223, -79.630556, 22.308919, 113.914603)
    ).toDF("origin_city", "destination_city", "origin_latitude", "origin_longitude", "destination_latitude", "destination_longitude")
    .withColumn("distance", haversineDistance($"destination_latitude", $"destination_longitude",  $"origin_latitude", $"origin_longitude"))

x.show()
+-----------+----------------+---------------+----------------+--------------------+---------------------+------------------+
|origin_city|destination_city|origin_latitude|origin_longitude|destination_latitude|destination_longitude|          distance|
+-----------+----------------+---------------+----------------+--------------------+---------------------+------------------+
|  Hong Kong|          Sydney|      22.308919|      113.914603|          -33.946111|           151.177222|7393.8837884771565|
|    Toronto|       Hong Kong|      43.677223|      -79.630556|           22.308919|           113.914603|12548.533187172497|
+-----------+----------------+---------------+----------------+--------------------+---------------------+------------------+

@andelink
Copy link

andelink commented Nov 5, 2021

Love this thread of people sharing different implementations for different needs ❤️

Also saved me some time, so thanks all!

Here's my contribution:

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

AVG_EARTH_RADIUS = 6371.0

def haversine(lat1, lng1, lat2, lng2):
    """Cython fast-distance as Spark SQL"""
    lat1 = F.radians(lat1)
    lng1 = F.radians(lng1)
    lat2 = F.radians(lat2)
    lng2 = F.radians(lng2)
    lat = lat2 - lat1
    lng = lng2 - lng1
    d = F.sin(lat * 0.5) ** 2 + F.cos(lat1) * F.cos(lat2) * F.sin(lng * 0.5) ** 2
    return 2 * AVG_EARTH_RADIUS * F.asin(F.sqrt(d))

>>>
+-------+---------+----------+----------+
|airport|     city|       lat|       lng|
+-------+---------+----------+----------+
|    HKG|Hong Kong| 22.308919|113.914603|
|    SYD|   Sydney|-33.946111|151.177222|
|    YYZ|  Toronto| 43.677223|-79.630556|
+-------+---------+----------+----------+

+---------------------------------------+-------------------------------------+------------------+
|a                                      |b                                    |distance          |
+---------------------------------------+-------------------------------------+------------------+
|{HKG, Hong Kong, 22.308919, 113.914603}|{SYD, Sydney, -33.946111, 151.177222}|7393.8837884771565|
|{HKG, Hong Kong, 22.308919, 113.914603}|{YYZ, Toronto, 43.677223, -79.630556}|12548.533187172497|
|{SYD, Sydney, -33.946111, 151.177222}  |{YYZ, Toronto, 43.677223, -79.630556}|15554.728375861841|
+---------------------------------------+-------------------------------------+------------------+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment