kyleparisi/Kyle's Data Science.md

## Kyle's Data Science.md

      
    Raw
  

              Kyle's Data Science.md
            
          
    Kyle's Data Science

tl;dr: don't setup locally, skip to Q/A...

Setup spark on local (with ipython)

gist
brew install apache-spark
# spark wants scala
brew install scala

# Optional: for notebooks
pip install --upgrade pip
pip install ipython
pip install jupyter
Add the following to .bash_profile
alias sparkMasterStart="/usr/local/Cellar/apache-spark/1.6.1/libexec/sbin/start-master.sh"
alias sparkMasterStop="/usr/local/Cellar/apache-spark/1.6.1/libexec/sbin/stop-master.sh"
alias sparkSlaveStart="/usr/local/Cellar/apache-spark/1.6.1/libexec/sbin/start-slave.sh"
alias sparkSlaveStop="/usr/local/Cellar/apache-spark/1.6.1/libexec/sbin/stop-slave.sh"

Setup hadoop on local

tutorial
brew install hadoop
Add the following to .bash_profile
alias hadoopStart="/usr/local/Cellar/hadoop/2.7.2/sbin/start-dfs.sh;/usr/local/Cellar/hadoop/2.7.2/sbin/start-yarn.sh"
alias hadoopStop="/usr/local/Cellar/hadoop/2.7.2/sbin/stop-yarn.sh;/usr/local/Cellar/hadoop/2.7.2/sbin/stop-dfs.sh"


Resource Manager: http://localhost:50070
JobTracker: http://localhost:8088
Specific Node Information: http://localhost:8042

Ops

Basics:
# Start all the things
hadoopStart
sparkMasterStart
sparkSlaveStart

# Add an application to spark
spark-submit --name test --master spark://Kyles-Mac-mini.local:7077 hello-world.py

# Check out what's going on
open http://kyles-mac-mini.local:8080/
S3:
The following is just a generality.  If you've configured hadoop-env.sh and core-site.xml to use s3 you should be able to do something like:
hdfs dfs -ls s3n://kyleparisi/
Problems

When you submit a job, tail these logs for good/bad news:
tail -f -n 0 /usr/local/Cellar/hadoop/2.7.2/libexec/logs/hadoop-kyleparisi-*.log

ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in doCheckpoint

hadoopStop
# Clear out the tmp dir of hadoop (only advised in first setup?)
rm -rf /usr/local/Cellar/hadoop/hdfs/tmp/*

# Remake the tmp structure
hdfs namenode -format

hadoopStart
tail -f -n 0 /usr/local/Cellar/hadoop/2.7.2/libexec/logs/hadoop-kyleparisi-*.log

# Submit job again


INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000

your tmp dir is screwed up, rm -rf /usr/local/Cellar/hadoop/hdfs/tmp/*, hdfs namenode -format.


WARN client.AppClient$ClientEndpoint: Failed to connect to master

Make sure master is exactly what http://localhost:8080 says it is


java.lang.NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer$.handledType()Ljava/lang/Class;

This ones terrible... brew ships with an old version of jackson-core/jackson-databind. Update them.

jackson-databind
jackson-core


py4j.protocol.Py4JJavaError: An error occurred while calling o12.textFile.
: java.lang.NoSuchMethodError: scala.util.matching.Regex.unapplySeq(Ljava/lang/CharSequence;)Lscala/Option;


This is where you give up.  There is probably a version mismatch somewhere.  In any case it's been a waste of time.

Hive

Same thing as mysql.
hive
select tables;
From pyspark
from pyspark.sql import HiveContext

conf = SparkConf().setAppName("Test")
sc = SparkContext(conf = conf)
sqlContext = HiveContext(sc)

print sqlContext.tableNames()
From here you can just sqlContext.sql(" ") with normal queries.
Q/A

What are DataFrames & DataSets?

A DataFrame is a Dataset organized into named columns.
A Dataset is a distributed collection of data.

val df = spark.read.json("examples/src/main/resources/people.json")

// Displays the content of the DataFrame to stdout
df.show()
// +----+-------+
// | age|   name|
// +----+-------+
// |null|Michael|
// |  30|   Andy|
// |  19| Justin|
// +----+-------+
What is RDD?

Resilient Distributed Datasets
It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.

rdd = sc.textFile(file_name)
print rdd.take(1)

# [u'Campaign ID,Ad group ID,Day,Clicks,Cost,Impressions,Conversions,Total conv. value,Device']
What are partitions?

RDDs across nodes

What are pandas?

A data science library to manipulate DataFrames

Is adding packages a possiblity in my env?

No

What is more desireable, RDDs or DataFrames?

RDDs are flexible in that you can apply python functions on them.
DataFrames are limited to the pyspark dataframe api
RDDs : sentences :: DataFrames : table

What is SerDe?

Serializer and a Deserializer
Provides a way for data to transition from java object to hdfs (hive is the arbitrator)

What is a Partition?

Basically a sub directory of the data to make queries more effiicient on a table