brew install apache-spark
# spark wants scala
brew install scala
# Optional: for notebooks
pip install --upgrade pip
pip install ipython
pip install jupyter
Add the following to .bash_profile
alias sparkMasterStart="/usr/local/Cellar/apache-spark/1.6.1/libexec/sbin/start-master.sh"
alias sparkMasterStop="/usr/local/Cellar/apache-spark/1.6.1/libexec/sbin/stop-master.sh"
alias sparkSlaveStart="/usr/local/Cellar/apache-spark/1.6.1/libexec/sbin/start-slave.sh"
alias sparkSlaveStop="/usr/local/Cellar/apache-spark/1.6.1/libexec/sbin/stop-slave.sh"
brew install hadoop
Add the following to .bash_profile
alias hadoopStart="/usr/local/Cellar/hadoop/2.7.2/sbin/start-dfs.sh;/usr/local/Cellar/hadoop/2.7.2/sbin/start-yarn.sh"
alias hadoopStop="/usr/local/Cellar/hadoop/2.7.2/sbin/stop-yarn.sh;/usr/local/Cellar/hadoop/2.7.2/sbin/stop-dfs.sh"
- Resource Manager: http://localhost:50070
- JobTracker: http://localhost:8088
- Specific Node Information: http://localhost:8042
Basics:
# Start all the things
hadoopStart
sparkMasterStart
sparkSlaveStart
# Add an application to spark
spark-submit --name test --master spark://Kyles-Mac-mini.local:7077 hello-world.py
# Check out what's going on
open http://kyles-mac-mini.local:8080/
S3:
The following is just a generality. If you've configured hadoop-env.sh
and core-site.xml
to use s3 you should be able to do something like:
hdfs dfs -ls s3n://kyleparisi/
When you submit a job, tail these logs for good/bad news:
tail -f -n 0 /usr/local/Cellar/hadoop/2.7.2/libexec/logs/hadoop-kyleparisi-*.log
ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in doCheckpoint
hadoopStop
# Clear out the tmp dir of hadoop (only advised in first setup?)
rm -rf /usr/local/Cellar/hadoop/hdfs/tmp/*
# Remake the tmp structure
hdfs namenode -format
hadoopStart
tail -f -n 0 /usr/local/Cellar/hadoop/2.7.2/libexec/logs/hadoop-kyleparisi-*.log
# Submit job again
-
INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000
- your tmp dir is screwed up,
rm -rf /usr/local/Cellar/hadoop/hdfs/tmp/*
,hdfs namenode -format
.
- your tmp dir is screwed up,
-
WARN client.AppClient$ClientEndpoint: Failed to connect to master
- Make sure master is exactly what http://localhost:8080 says it is
-
java.lang.NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer$.handledType()Ljava/lang/Class;
- This ones terrible... brew ships with an old version of jackson-core/jackson-databind. Update them.
py4j.protocol.Py4JJavaError: An error occurred while calling o12.textFile.
: java.lang.NoSuchMethodError: scala.util.matching.Regex.unapplySeq(Ljava/lang/CharSequence;)Lscala/Option;
- This is where you give up. There is probably a version mismatch somewhere. In any case it's been a waste of time.
Same thing as mysql.
hive
select tables;
From pyspark
from pyspark.sql import HiveContext
conf = SparkConf().setAppName("Test")
sc = SparkContext(conf = conf)
sqlContext = HiveContext(sc)
print sqlContext.tableNames()
From here you can just sqlContext.sql(" ")
with normal queries.
What are DataFrames & DataSets?
- A DataFrame is a Dataset organized into named columns.
- A Dataset is a distributed collection of data.
val df = spark.read.json("examples/src/main/resources/people.json")
// Displays the content of the DataFrame to stdout
df.show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+
What is RDD?
- Resilient Distributed Datasets
- It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.
rdd = sc.textFile(file_name)
print rdd.take(1)
# [u'Campaign ID,Ad group ID,Day,Clicks,Cost,Impressions,Conversions,Total conv. value,Device']
What are partitions?
- RDDs across nodes
What are pandas?
- A data science library to manipulate DataFrames
Is adding packages a possiblity in my env?
- No
What is more desireable, RDDs or DataFrames?
- RDDs are flexible in that you can apply python functions on them.
- DataFrames are limited to the pyspark dataframe api
- RDDs : sentences :: DataFrames : table
What is SerDe?
- Serializer and a Deserializer
- Provides a way for data to transition from java object to hdfs (hive is the arbitrator)
What is a Partition?
- Basically a sub directory of the data to make queries more effiicient on a table