Skip to content

Instantly share code, notes, and snippets.

@aslotnick
aslotnick / point_in_polygon.sql
Last active February 6, 2018 17:28
Simple point-in-polygon UDF for Amazon Redshift based on http://geospatialpython.com/2011/08/point-in-polygon-2-on-line.html
CREATE FUNCTION point_in_polygon(point_x float, point_y float, polygon_wkt varchar(max))
RETURNS boolean IMMUTABLE AS
$$
### begin section copied from http://geospatialpython.com/2011/08/point-in-polygon-2-on-line.html (I modifed to return boolean)
# Improved point in polygon test which includes edge
# and vertex points
def point_in_poly(x,y,poly):
@aslotnick
aslotnick / Strata+Hadoop World 2016 Notes.md
Last active November 16, 2016 21:34
Strata+Hadoop World 2016 Notes

File format benchmark: Avro, JSON, ORC, and Parquet (slides: https://cdn.oreillystatic.com/en/assets/1/event/160/File%20format%20benchmark_%20Avro,%20JSON,%20ORC,%20and%20Parquet%20Presentation%201.pptx)

  • ORC has some built-in tuning for better performance with double and timestamp types
  • Both ORC and Parquet support predicate pushdown
  • Avro was a good choice for very wide tables with lots of text fields
  • For future investigation: look into “schema evolution” for both columnar formats
  • Snappy is faster than Zlib at the cost of more disk space

Data science at eHarmony: A generalized framework for personalization