Skip to content

Instantly share code, notes, and snippets.

View msukmanowsky's full-sized avatar
🥳
Building the future of how companies work with elvex!

Mike Sukmanowsky msukmanowsky

🥳
Building the future of how companies work with elvex!
View GitHub Profile
@msukmanowsky
msukmanowsky / pyspark_cassandra.py
Last active August 29, 2015 14:08
Work in progress ideas for a PySpark binding to the DataStax Cassandra-Spark Connector.
from pyspark.context import SparkContext
from pyspark.serializers import BatchedSerializer, PickleSerializer
from pyspark.rdd import RDD
from py4j.java_gateway import java_import
class CassandraSparkContext(SparkContext):
def _do_init(self, *args, **kwargs):
install.packages("jsonlite", dependencies = TRUE)
install.packages("RCurl", dependencies = TRUE)
library("jsonlite")
library("RCurl")
base_url <- "https://api.parsely.com/v2"
apikey <- "computerworld.com"
api_secret <- "YOUR SECRET KEY"
#!/usr/bin/env bash
# Hitting CTRL-C kills the Django server as well as all tunnels that were created
TUNNEL_PIDS=()
function kill_tunnels() {
for tunnel_pid in "${TUNNEL_PIDS[@]}"
do
kill $tunnel_pid
done
}

Basics

Sort the output of a command

By 3rd column (1-indexed) in reverse order

sort -k3 -r

Spark / PySpark aggregateByKey Example

The existing examples for this are good, but they miss a pretty critical observation, the number of partitions and how this affects things.

Assume we have the following script, aggregate_by_key.py:

import pprint
from pyspark.context import SparkContext
import datetime as dt
import pprint
import pytz
print(pytz.__version__)
# '2015.4'
timezone = pytz.timezone('Europe/London')
tmsp = dt.datetime(2015, 3, 29, 1, tzinfo=pytz.utc)

Keep in mind, our use case is largely timeseries analytics, but broad themes of issues we encountered:

  1. Realtime indexing + querying is tough. Required us to throw beefed up dedicated hardware at that problem while we were serving historical queries on nodes w/ a different config (typical hot, warm cold node configuration).
  2. As always, skewed data sets require special consideration in index and document schema modelling.
  3. JVM heap, aggregation query and doc mapping optimization needed or you'll easily hit OOM on nodes which can lead to...
  4. Bad failure scenarios where you get an entire cluster brought to a halt, no queries able to be served. Literally one bad and greedy query can put your node and cluster in a very bad state.
  5. Depending on your document mapping, disk storage requirements can easily bite you but are made better by https://www.elastic.co/blog/store-compression-in-lucene-and-elasticsearch

+1 to the ES team though, they do listen to and fix issues quickly. Moving to doc values as the d

@msukmanowsky
msukmanowsky / wordpress-plugin-svn-to-git.md
Created November 26, 2015 15:17 — forked from kasparsd/wordpress-plugin-svn-to-git.md
Using Git with Subversion Mirroring for WordPress Plugin Development
from collections import defaultdict
try:
import cStringIO as StringIO
except ImportError:
import StringIO
class EscapedLineReader(object):
"""Custom reader for files where we could have escaped new lines.
@msukmanowsky
msukmanowsky / install-forked-conda-env.sh
Created March 31, 2016 14:24
Install a forked version of conda-env which falls back to PyPI for requirements and supports -e editable requirements.
# Clone Dan's fork of conda-env
git clone https://github.com/dan-blanchard/conda-env.git
# Install the fork of conda-env
cd conda-env
git checkout feature/pip_requirements.txt
conda uninstall --yes conda-env
python setup.py develop