Mike Sukmanowsky msukmanowsky

## pyspark_cassandra.py
from pyspark.context import SparkContext
from pyspark.serializers import BatchedSerializer, PickleSerializer
from pyspark.rdd import RDD

from py4j.java_gateway import java_import


class CassandraSparkContext(SparkContext):

    def _do_init(self, *args, **kwargs):

## parsely_api.r
install.packages("jsonlite", dependencies = TRUE)
install.packages("RCurl", dependencies = TRUE)
library("jsonlite")
library("RCurl")


base_url <- "https://api.parsely.com/v2"
apikey <- "computerworld.com"
api_secret <- "YOUR SECRET KEY"

## auto_kill_tunnels.sh
#!/usr/bin/env bash
# Hitting CTRL-C kills the Django server as well as all tunnels that were created

TUNNEL_PIDS=()
function kill_tunnels() {
    for tunnel_pid in "${TUNNEL_PIDS[@]}"
    do
        kill $tunnel_pid
    done
}

## *nix_command_cheat_sheet.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                msukmanowsky
                / *nix_command_cheat_sheet.md
            
            
              Last active
              August 29, 2015 14:22
            
          
    Basics

Sort the output of a command

By 3rd column (1-indexed) in reverse order

sort -k3 -r

  
## aggregateByKey.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                msukmanowsky
                / aggregateByKey.md
            
            
              Last active
              August 29, 2015 14:27
            
          
    Spark / PySpark aggregateByKey Example

The existing examples for this are good, but they miss a pretty critical observation, the number of partitions and how this affects things.
Assume we have the following script, aggregate_by_key.py:
import pprint
from pyspark.context import SparkContext

  
## pytz_dst_bug.py
import datetime as dt
import pprint

import pytz

print(pytz.__version__)
# '2015.4'

timezone = pytz.timezone('Europe/London')
tmsp = dt.datetime(2015, 3, 29, 1, tzinfo=pytz.utc)

## es_broad_themes.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                msukmanowsky
                / es_broad_themes.md
            
            
              Created
              October 21, 2015 15:19
            
          
    Keep in mind, our use case is largely timeseries analytics, but broad themes of issues we encountered:

Realtime indexing + querying is tough. Required us to throw beefed up dedicated hardware at that problem while we were serving historical queries on nodes w/ a different config (typical hot, warm cold node configuration).
As always, skewed data sets require special consideration in index and document schema modelling.
JVM heap, aggregation query and doc mapping optimization needed or you'll easily hit OOM on nodes which can lead to...
Bad failure scenarios where you get an entire cluster brought to a halt, no queries able to be served. Literally one bad and greedy query can put your node and cluster in a very bad state.
Depending on your document mapping, disk storage requirements can easily bite you but are made better by https://www.elastic.co/blog/store-compression-in-lucene-and-elasticsearch

+1 to the ES team though, they do listen to and fix issues quickly. Moving to doc values as the d

  
## wordpress-plugin-svn-to-git.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                msukmanowsky
                / wordpress-plugin-svn-to-git.md
            
            
              Created
              November 26, 2015 15:17
                — forked from kasparsd/wordpress-plugin-svn-to-git.md
            
              
                Using Git with Subversion Mirroring for WordPress Plugin Development
              
          
    Update: please note that I have since switched to using a set of bash scripts instead of poluting the Git repository with git svn.
Using Git and GitHub with Subversion (SVN) Mirroring for WordPress Plugin Development

Author: Kaspars Dambis

kaspars.net / @konstruktors


## escaped_line_reader.py
from collections import defaultdict
try:
    import cStringIO as StringIO
except ImportError:
    import StringIO


class EscapedLineReader(object):
    """Custom reader for files where we could have escaped new lines.


## install-forked-conda-env.sh
# Clone Dan's fork of conda-env
git clone https://github.com/dan-blanchard/conda-env.git

# Install the fork of conda-env
cd conda-env
git checkout feature/pip_requirements.txt
conda uninstall --yes conda-env
python setup.py develop
	from pyspark.context import SparkContext
	from pyspark.serializers import BatchedSerializer, PickleSerializer
	from pyspark.rdd import RDD

	from py4j.java_gateway import java_import


	class CassandraSparkContext(SparkContext):

	def _do_init(self, args, *kwargs):
	install.packages("jsonlite", dependencies = TRUE)
	install.packages("RCurl", dependencies = TRUE)
	library("jsonlite")
	library("RCurl")


	base_url <- "https://api.parsely.com/v2"
	apikey <- "computerworld.com"
	api_secret <- "YOUR SECRET KEY"
	#!/usr/bin/env bash
	# Hitting CTRL-C kills the Django server as well as all tunnels that were created

	TUNNEL_PIDS=()
	function kill_tunnels() {
	for tunnel_pid in "${TUNNEL_PIDS[@]}"
	do
	kill $tunnel_pid
	done
	}
	import datetime as dt
	import pprint

	import pytz

	print(pytz.__version__)
	# '2015.4'

	timezone = pytz.timezone('Europe/London')
	tmsp = dt.datetime(2015, 3, 29, 1, tzinfo=pytz.utc)
	from collections import defaultdict
	try:
	import cStringIO as StringIO
	except ImportError:
	import StringIO


	class EscapedLineReader(object):
	"""Custom reader for files where we could have escaped new lines.
	# Clone Dan's fork of conda-env
	git clone https://github.com/dan-blanchard/conda-env.git

	# Install the fork of conda-env
	cd conda-env
	git checkout feature/pip_requirements.txt
	conda uninstall --yes conda-env
	python setup.py develop