sort -k3 -r
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from pyspark.context import SparkContext | |
from pyspark.serializers import BatchedSerializer, PickleSerializer | |
from pyspark.rdd import RDD | |
from py4j.java_gateway import java_import | |
class CassandraSparkContext(SparkContext): | |
def _do_init(self, *args, **kwargs): |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
install.packages("jsonlite", dependencies = TRUE) | |
install.packages("RCurl", dependencies = TRUE) | |
library("jsonlite") | |
library("RCurl") | |
base_url <- "https://api.parsely.com/v2" | |
apikey <- "computerworld.com" | |
api_secret <- "YOUR SECRET KEY" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env bash | |
# Hitting CTRL-C kills the Django server as well as all tunnels that were created | |
TUNNEL_PIDS=() | |
function kill_tunnels() { | |
for tunnel_pid in "${TUNNEL_PIDS[@]}" | |
do | |
kill $tunnel_pid | |
done | |
} |
The existing examples for this are good, but they miss a pretty critical observation, the number of partitions and how this affects things.
Assume we have the following script, aggregate_by_key.py:
import pprint
from pyspark.context import SparkContext
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import datetime as dt | |
import pprint | |
import pytz | |
print(pytz.__version__) | |
# '2015.4' | |
timezone = pytz.timezone('Europe/London') | |
tmsp = dt.datetime(2015, 3, 29, 1, tzinfo=pytz.utc) |
Keep in mind, our use case is largely timeseries analytics, but broad themes of issues we encountered:
- Realtime indexing + querying is tough. Required us to throw beefed up dedicated hardware at that problem while we were serving historical queries on nodes w/ a different config (typical hot, warm cold node configuration).
- As always, skewed data sets require special consideration in index and document schema modelling.
- JVM heap, aggregation query and doc mapping optimization needed or you'll easily hit OOM on nodes which can lead to...
- Bad failure scenarios where you get an entire cluster brought to a halt, no queries able to be served. Literally one bad and greedy query can put your node and cluster in a very bad state.
- Depending on your document mapping, disk storage requirements can easily bite you but are made better by https://www.elastic.co/blog/store-compression-in-lucene-and-elasticsearch
+1 to the ES team though, they do listen to and fix issues quickly. Moving to doc values as the d
Update: please note that I have since switched to using a set of bash scripts instead of poluting the Git repository with git svn
.
Author: Kaspars Dambis
kaspars.net / @konstruktors
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from collections import defaultdict | |
try: | |
import cStringIO as StringIO | |
except ImportError: | |
import StringIO | |
class EscapedLineReader(object): | |
"""Custom reader for files where we could have escaped new lines. | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Clone Dan's fork of conda-env | |
git clone https://github.com/dan-blanchard/conda-env.git | |
# Install the fork of conda-env | |
cd conda-env | |
git checkout feature/pip_requirements.txt | |
conda uninstall --yes conda-env | |
python setup.py develop |