Skip to content

Instantly share code, notes, and snippets.

@kyleparisi
Last active August 30, 2016 16:39
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kyleparisi/36287960c84d8678f0de4abf09017ba8 to your computer and use it in GitHub Desktop.
Save kyleparisi/36287960c84d8678f0de4abf09017ba8 to your computer and use it in GitHub Desktop.
Self exploration of data science

Kyle's Data Science

tl;dr: don't setup locally, skip to Q/A...

Setup spark on local (with ipython)

gist

brew install apache-spark
# spark wants scala
brew install scala

# Optional: for notebooks
pip install --upgrade pip
pip install ipython
pip install jupyter

Add the following to .bash_profile

alias sparkMasterStart="/usr/local/Cellar/apache-spark/1.6.1/libexec/sbin/start-master.sh"
alias sparkMasterStop="/usr/local/Cellar/apache-spark/1.6.1/libexec/sbin/stop-master.sh"
alias sparkSlaveStart="/usr/local/Cellar/apache-spark/1.6.1/libexec/sbin/start-slave.sh"
alias sparkSlaveStop="/usr/local/Cellar/apache-spark/1.6.1/libexec/sbin/stop-slave.sh"

Setup hadoop on local

tutorial

brew install hadoop

Add the following to .bash_profile

alias hadoopStart="/usr/local/Cellar/hadoop/2.7.2/sbin/start-dfs.sh;/usr/local/Cellar/hadoop/2.7.2/sbin/start-yarn.sh"
alias hadoopStop="/usr/local/Cellar/hadoop/2.7.2/sbin/stop-yarn.sh;/usr/local/Cellar/hadoop/2.7.2/sbin/stop-dfs.sh"

Ops

Basics:

# Start all the things
hadoopStart
sparkMasterStart
sparkSlaveStart

# Add an application to spark
spark-submit --name test --master spark://Kyles-Mac-mini.local:7077 hello-world.py

# Check out what's going on
open http://kyles-mac-mini.local:8080/

S3:

The following is just a generality. If you've configured hadoop-env.sh and core-site.xml to use s3 you should be able to do something like:

hdfs dfs -ls s3n://kyleparisi/

Problems

When you submit a job, tail these logs for good/bad news:

tail -f -n 0 /usr/local/Cellar/hadoop/2.7.2/libexec/logs/hadoop-kyleparisi-*.log
  • ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in doCheckpoint
hadoopStop
# Clear out the tmp dir of hadoop (only advised in first setup?)
rm -rf /usr/local/Cellar/hadoop/hdfs/tmp/*

# Remake the tmp structure
hdfs namenode -format

hadoopStart
tail -f -n 0 /usr/local/Cellar/hadoop/2.7.2/libexec/logs/hadoop-kyleparisi-*.log

# Submit job again
  • INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000

    • your tmp dir is screwed up, rm -rf /usr/local/Cellar/hadoop/hdfs/tmp/*, hdfs namenode -format.
  • WARN client.AppClient$ClientEndpoint: Failed to connect to master

  • java.lang.NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer$.handledType()Ljava/lang/Class;

py4j.protocol.Py4JJavaError: An error occurred while calling o12.textFile.
: java.lang.NoSuchMethodError: scala.util.matching.Regex.unapplySeq(Ljava/lang/CharSequence;)Lscala/Option;
  • This is where you give up. There is probably a version mismatch somewhere. In any case it's been a waste of time.

Hive

Same thing as mysql.

hive
select tables;

From pyspark

from pyspark.sql import HiveContext

conf = SparkConf().setAppName("Test")
sc = SparkContext(conf = conf)
sqlContext = HiveContext(sc)

print sqlContext.tableNames()

From here you can just sqlContext.sql(" ") with normal queries.

Q/A

What are DataFrames & DataSets?

  • A DataFrame is a Dataset organized into named columns.
  • A Dataset is a distributed collection of data.
val df = spark.read.json("examples/src/main/resources/people.json")

// Displays the content of the DataFrame to stdout
df.show()
// +----+-------+
// | age|   name|
// +----+-------+
// |null|Michael|
// |  30|   Andy|
// |  19| Justin|
// +----+-------+

What is RDD?

  • Resilient Distributed Datasets
  • It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.
rdd = sc.textFile(file_name)
print rdd.take(1)

# [u'Campaign ID,Ad group ID,Day,Clicks,Cost,Impressions,Conversions,Total conv. value,Device']

What are partitions?

  • RDDs across nodes

What are pandas?

  • A data science library to manipulate DataFrames

Is adding packages a possiblity in my env?

  • No

What is more desireable, RDDs or DataFrames?

  • RDDs are flexible in that you can apply python functions on them.
  • DataFrames are limited to the pyspark dataframe api
  • RDDs : sentences :: DataFrames : table

What is SerDe?

  • Serializer and a Deserializer
  • Provides a way for data to transition from java object to hdfs (hive is the arbitrator)

What is a Partition?

  • Basically a sub directory of the data to make queries more effiicient on a table
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment