Standalone Spark 2.0.0 with s3
###Tested with:
- Spark 2.0.0 pre-built for Hadoop 2.7
- Mac OS X 10.11
- Python 3.5.2
Goal
Use s3 within pyspark with minimal hassle.
#!/bin/bash | |
src='/usr/src' | |
# Get current version | |
cur_ver="linux-`uname -r`" | |
# Get new version | |
ver=`ls $src | grep linux- | sort -V | tail -1` |
###Tested with:
Use s3 within pyspark with minimal hassle.
by Bjørn Friese
Beautiful is better than ugly. Explicit is better than implicit.
I frequently deal with collections of things in the programs I write. Collections of droids, jedis, planets, lightsabers, starfighters, etc. When programming in Python, these collections of things are usually represented as lists, sets and dictionaries. Oftentimes, what I want to do with collections is to transform them in various ways. Comprehensions is a powerful syntax for doing just that. I use them extensively, and it's one of the things that keep me coming back to Python. Let me show you a few examples of the incredible usefulness of comprehensions.
from __future__ import print_function | |
import ast | |
def recurse(node): | |
if isinstance(node, ast.BinOp): | |
if isinstance(node.op, ast.Mult) or isinstance(node.op, ast.Div): | |
print('(', end='') | |
recurse(node.left) | |
recurse(node.op) |
import logging | |
import uuid | |
import time | |
from mesos.interface import Scheduler | |
from mesos.native import MesosSchedulerDriver | |
from mesos.interface import mesos_pb2 | |
logging.basicConfig(level=logging.INFO) |
STARTUP_MSG: java = 1.8.0_25 | |
************************************************************/ | |
14/10/28 06:27:07 INFO mapred.JobTracker: registered UNIX signal handlers for [TERM, HUP, INT] | |
14/10/28 06:27:08 FATAL mapred.JobTracker: java.lang.IllegalArgumentException: Does not contain a valid host:port authority: local | |
at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:211) | |
at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:163) | |
at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:152) | |
at org.apache.hadoop.mapred.JobTracker.getAddress(JobTracker.java:2165) | |
at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:1764) | |
at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:1757) |
I've been using the Anaconda python package from continuum.io recently and found it to be a good way to get all the complex compiled libs you need for a scientific python environment. Even better, their conda tool lets you create environments much like virtualenv, but without having to re-compile stuff like numpy, which gets old very very quickly with virtualenv and can be a nightmare to get correctly set up on OSX.
The only thing missing was an easy way to switch environments - their docs suggest running python executables from the install folder, which I find a bit of a pain. Coincidentally I came across this article - Virtualenv's bin/activate is Doing It Wrong - which desribes a simple way to launch a sub-shell with certain environment variables set. Now simple was the key word for me since my bash-fu isn't very strong, but I managed to come up with the script below. Put this in a text file called conda-work