nchammas/February 2015 Spark Meetup Notes.md

## February 2015 Spark Meetup Notes.md

      
    Raw
  

              February 2015 Spark Meetup Notes.md
            
          
    Boston Spark February 2015 Meetup

Boston Spark Community Updates


This is Event #3 for this Meetup.
450+ members; Meetup started in May 2014.
New talk format: Lightning Talks

30 seconds to 5 minutes maximum.
Any topic relating to Spark.
Any skill or experience level--beginners welcome!


Speakers: We want you!

Feature talks or Lighting talks


Sponsors: You want us!

Host an event, pay for food and drink, or cover other event costs.
Take a few minutes at the start of the event to pitch your company, job, or product.
Thank you to Turbine for buying food and drink for this event!


Spark Project Updates


Spark sets a new Daytona Gray Sort record.

100TB sorted in 23 minutes; no use of memory cache.
Previous record of 72 minutes set by Hadoop on 10x larger cluster.


Spark 1.2.0 and 1.2.1 were released.

Many new features; a selection:

Python API for Spark Streaming
Elastic scaling on YARN


Spark Packages was launched. Community-contributed packages include:

Redshift integration
Avro integration
IndexedRDD, for efficient point operations
Tool to launch Spark on Google Compute Engine
Spork, Pig on Spark


Upcoming in Spark 1.3.0:

DataFrame API (inspired by data frames in R and Python Pandas)


"In Spark, a DataFrame is a distributed collection of data organized into named columns."


Data science from a few KB locally all the way up to several PB on a large cluster.


Advanced optimization and integration with variety of data sources.


Example from the Databricks blog post:
# Create a new DataFrame that contains “young users” only
young = users.filter(users.age < 21)

# Alternatively, using Pandas-like syntax
young = users[users.age < 21]

# Increment everybody’s age by 1
young.select(young.name, young.age + 1)

# Count the number of young users by gender
young.groupBy("gender").count()

# Join young users with another DataFrame called logs
young.join(logs, logs.userId == users.userId, "left_outer")
Incorporating SQL while working with DataFrames:
young.registerTempTable("young")
context.sql("SELECT count(*) FROM young")


Targeted for Spark 1.4.0:

R language API for Spark.


A Common Pitfall when handling GZipped Files


"I have a huge cluster. Why is only 1 core being used?"


It looks innocent enough:
# PySpark
daytah = sc.textFile("/my/daytah.gz", minPartitions=5)  # minPartitions is useless here!
daytah.getNumPartitions()  # 1


You want 5 partitions so 5 tasks can process your data in parallel.
daytah.gz
 ------------------------------------------------------------------------------------
|                |                |                |                |                |
|      task      |      task      |      task      |      task      |      task      |
|                |                |                |                |                |
 ------------------------------------------------------------------------------------


Problem: You cannot decompress a random chunk of a gzipped file without decompressing the whole file.

Only 1 task processes the file.
Only 1 task can work on the resulting RDD and its children.


Solution 1: Partition data before compressing.
daytah = sc.textFile("/my/daytah.*.gz")  # daytah.1.gz, daytah.2.gz,...
daytah.getNumPartitions()  # 5


Solution 2: Repartition RDD after loading compressed data.
daytah = sc.textFile("/my/daytah.gz").repartition(5)
daytah.getNumPartitions()  # 5


Solution 3: Use uncompressed data; minPartitions will work then.


Solution 4: Use a splittable compression codec like bzip2, or a container format like SequenceFile or Avro DataFile.


Contributing to Spark

Personal goal: Help grow a community of Spark contributors in the Boston area.
Basics


The Spark codebase is on GitHub.
Code contributions are accepted via GitHub Pull Requests.
Issues (bugs, feature requests, etc.) are tracked on JIRA.
Official contributing guide: Contributing to Spark
Mailing Lists:

user@spark.apache.org
dev@spark.apache.org


Lots of activity on Stack Overflow under the apache-spark tag.

Different Ways to Contribute

Contributions come in all shapes and sizes. It's easy to get started, especially if you are already using Spark, whether for toying around or in production.
Helping Other Users


There's an explosion of interest in Spark.
Number of people interested in Spark >>> Number of people who know something about Spark
Helping someone on the user list or on Stack Overflow is great!

Triaging Bug Reports


Are you running into the same issue? Or can you reproduce it?
How badly does this bug affect you?
Reporting these things on JIRA escalates the issue and makes it more likely to be fixed quickly.

Example on Hacker News.


Voting on Spark Releases


Voting happens publicly on the dev list.

Example: Voting to release Spark 1.3.0


It's your last chance to raise major issues before a release goes out.
Test the release candidate against your workload and share your findings.
This helps the team release with confidence.

Reviewing Pull Requests


This generally works best when you are already interested in someone else's work, but it hasn't been merged in because it is pending review.
Test their work and report your findings.

Example: Testing someone else's fix for spark-ec2


Again, it gives committers more confidence to merge in a change.

Implementing Fixes, Improvements, or New Features

There are so many places to contribute. In general, incremental changes are easier to review (and thus merge) than big changes.
Here are some examples:

Spark itself: Core, SQL, MLlib, GraphX, Streaming, etc.

SPARK-3533: Add saveAsTextFileByKey() method to RDDs
SPARK-5865: Add doc warnings for methods that return local data structures


Project infrastructure: Improvements here make development generally smoother and save time in little ways all over the place.

SPARK-3849: Automate remaining Spark Code Style Guide rules
SPARK-3431: Parallelize Scala/Java test execution


spark-ec2: The easiest way to get a working Spark cluster at scale.

Launch Spark clusters on EC2 on YARN or Mesos.


Spark PR dashboard: How Spark reviewers manage the deluge of contributions. They will love you for improving this!

#2: Create PR activity report
#34: Show whether PR can be cleanly merged into maintenance branches
#42: Add a favicon


Spark Packages: A community index of Spark packages. (Think of PyPi for Python. pip install ...) Anything goes!

Connectors to new data sources
Fancy notebook environments
Tools for launching Spark clusters on other clouds (e.g. Azure)
New algorithms or abstractions not ready for Spark proper yet
Your crazy idea here!


Appendix: More on Pull Requests

Mechanics of Opening a Pull Request

Abbreviated from the real contributing guide.


Fork the Spark repo and create a new branch from master.

It's good practice to make any changes on a new branch so it's easy to pull upstream updates.


Build Spark and test your changes locally. Or, leave it to Jenkins (a.k.a. SparkQA).

Building Spark:


Both Maven and sbt supported.


Example build:
./build/mvn -DskipTests clean package


Additional build flags to build against Hadoop, Hive, etc.


Testing locally:

Handy scripts live under dev/.
Examples:

dev/lint-python: Checks your Python style
dev/run-tests: Checks style, builds Spark, and runs unit tests (can take ~1hr altogether)


Commit your changes.

Your commit messages are not that important. If your PR gets merged, your commits will be squashed into one, and your PR title and body will become the commit title and message that goes into the repo.


Open your PR with a descriptive title and body, as well as a reference to the JIRA issue it addresses.
Example PR title:
[SPARK-1805] [EC2] Validate instance types 


Authoring a PR that will be Accepted

The overarching theme here is: Think of the people who will review your code.

Keep your PR small and focused.

Fix 1 problem or introduce 1 new feature per PR.
Smaller PRs are easier to review, so committers can merge with confidence.


If you are introducing something major, discuss it via JIRA or on the Spark developer list before diving in to development.

Basically, get some buy-in for your idea.


Respond to feedback from SparkQA, other contributors, or committers.

A typical PR goes through several rounds of updates before being accepted. It's normal!


Be persistent!

People are busy. Don't be discouraged if your PR doesn't get attention!
This is especially true close to releases, when there is a lot of activity.
Ping people via JIRA or on the dev list.


Also, it should be noted that you will contribute best when you work on something you are personally interested in, like fixing a problem you ran into, or adding a feature you want to use.