Skip to content

Instantly share code, notes, and snippets.

@pavlov99
pavlov99 / haversine.scala
Created December 19, 2016 07:52
Spherical distance calcualtion based on latitude and longitude with Apache Spark
// Based on following links:
// http://andrew.hedges.name/experiments/haversine/
// http://www.movable-type.co.uk/scripts/latlong.html
df
.withColumn("a", pow(sin(toRadians($"destination_latitude" - $"origin_latitude") / 2), 2) + cos(toRadians($"origin_latitude")) * cos(toRadians($"destination_latitude")) * pow(sin(toRadians($"destination_longitude" - $"origin_longitude") / 2), 2))
.withColumn("distance", atan2(sqrt($"a"), sqrt(-$"a" + 1)) * 2 * 6371)
>>>
+--------------+-------------------+-------------+----------------+---------------+----------------+--------------------+---------------------+--------------------+------------------+
|origin_airport|destination_airport| origin_city|destination_city|origin_latitude|origin_longitude|destination_latitude|destination_longitude| a| distance|
@tzachz
tzachz / CombineMaps.scala
Last active January 26, 2023 04:31
Apache Spark UserDefinedAggregateFunction combining maps
import org.apache.spark.SparkContext
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Column, Row, SQLContext}
/***
* UDAF combining maps, overriding any duplicate key with "latest" value
* @param keyType DataType of Map key
* @param valueType DataType of Value key
* @param merge function to merge values of identical keys

Latency numbers every programmer should know

L1 cache reference ......................... 0.5 ns
Branch mispredict ............................ 5 ns                     on recent CPU
L2 cache reference ........................... 7 ns                     14x L1 cache
Mutex lock/unlock ........................... 25 ns
Main memory reference ...................... 100 ns                     20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy ............. 3,000 ns  =   3 µs
Send 2K bytes over 1 Gbps network ....... 20,000 ns  =  20 µs
SSD random read ........................ 150,000 ns  = 150 µs

Read 1 MB sequentially from memory ..... 250,000 ns = 250 µs 4X memory

@azymnis
azymnis / ItemSimilarity.scala
Created December 13, 2013 05:17
Approximate item similarity using LSH in Scalding.
import com.twitter.scalding._
import com.twitter.algebird.{ MinHasher, MinHasher32, MinHashSignature }
/**
* Computes similar items (with a string itemId), based on approximate
* Jaccard similarity, using LSH.
*
* Assumes an input data TSV file of the following format:
*
* itemId userId
@fightingtheboss
fightingtheboss / configuration.yml
Last active September 15, 2023 07:29
An Amazon Elastic Beanstalk configuration file for a Meteor project. This file needs to be saved in the .ebconfiguration/ directory at the root of your project. Deployed with `git aws.push`. Replace MONGO_URL with the URL to your MongoDB instance (I use MongoHQ).
packages:
yum:
git: []
files:
/opt/elasticbeanstalk/hooks/appdeploy/pre/51install_meteor.sh:
mode: "000755"
user: root
group: root
encoding: plain
@jed
jed / LICENSE.txt
Created May 20, 2011 13:27 — forked from 140bytes/LICENSE.txt
generate random UUIDs
DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
Version 2, December 2004
Copyright (C) 2011 Jed Schmidt <http://jed.is>
Everyone is permitted to copy and distribute verbatim or modified
copies of this license document, and changing it is allowed as long
as the name is changed.
DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE