Skip to content

Instantly share code, notes, and snippets.

@sanp
sanp / currying_vs_partial_application.md
Last active February 3, 2017 19:57
Lightening Talk for Centro Tech Team on 2/3/17

Currying vs Partials

Some concepts:

Functions have arity: n of arguments they take [source]

  • Nullary: 0 args
  • Unary: 1 arg
  • Polyadic: many args
    • Binary: 2 args
  • Ternary: 3 args
@sanp
sanp / ssh_tunnelling.md
Last active February 21, 2017 21:43
Lightening Talk for Centro Tech Team on 2/15/17

SSH Tunneling

Problem

You want to query a DB and get a result set, but you don't have access to that DB directly from your localhost.

  • Bad solution A: Cry
  • Bad solution B: Ask someone who does have access to run your query for you
  • Bad solution C: ssh into a box that has access, then psql into the DB
@sanp
sanp / parse_json_with_spark_lateral_view.py
Last active March 13, 2017 15:59
Lightening Talk for Centro Tech Team on 3/10/17
# Parse JSON data with this one weird trick!
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SQLContext
from pyspark.sql import Row
# Set up basic spark session
conf = (SparkConf()
.setAppName('My App')
@sanp
sanp / vacuuming.md
Last active September 27, 2017 07:49

Vacuuming

  • Postgres uses an MVCC (Multiversion concurrency control) model (as opposed to table locking)

    • When an update/transaction is happening, a new snaphot of the data is created
    • Whenever you query data, you're seeing a snapshot of the data as it was at a certain time in the past.
  • So: when you run an update, it's essentially doubling the size of the table, because a new snapshot is being created.

Joins using Where clause vs on clause

Hive and postgres handle where vs on clauses differently. Postgres' query engine is smarter: where and on clause joins will be handled the same. In Hive, where clause is more efficient than on clause.

Stats:

Hive:

On clause: In stage 1, pulls in ~400MM records; takes ~13 minutes to execute

Where clause: In stage 1, pulls in ~60MM records; takes ~5 minutes to execute

@sanp
sanp / 2pa.md
Last active August 23, 2019 18:51

Second Price Auctions

Overview

  • Second price auctions (2PA) are a type of auction where the highest bidder pays the second highest bid
  • In contrast to first price auctions (FPA), where the highest bidder pays her own bid

In this talk, going to go over

  • Why 2PA work better than FPA