- This is Event #3 for this Meetup.
- 450+ members; Meetup started in May 2014.
- New talk format: Lightning Talks
- 30 seconds to 5 minutes maximum.
- Any topic relating to Spark.
- Any skill or experience level--beginners welcome!
- Speakers: We want you!
- Feature talks or Lighting talks
- Sponsors: You want us!
- Host an event, pay for food and drink, or cover other event costs.
- Take a few minutes at the start of the event to pitch your company, job, or product.
- Thank you to Turbine for buying food and drink for this event!
-
Spark sets a new Daytona Gray Sort record.
- 100TB sorted in 23 minutes; no use of memory cache.
- Previous record of 72 minutes set by Hadoop on 10x larger cluster.
-
Spark 1.2.0 and 1.2.1 were released.
- Many new features; a selection:
- Python API for Spark Streaming
- Elastic scaling on YARN
- Many new features; a selection:
-
Spark Packages was launched. Community-contributed packages include:
- Redshift integration
- Avro integration
- IndexedRDD, for efficient point operations
- Tool to launch Spark on Google Compute Engine
- Spork, Pig on Spark
-
Upcoming in Spark 1.3.0:
- DataFrame API (inspired by data frames in R and Python Pandas)
-
"In Spark, a DataFrame is a distributed collection of data organized into named columns."
-
Data science from a few KB locally all the way up to several PB on a large cluster.
-
Advanced optimization and integration with variety of data sources.
-
Example from the Databricks blog post:
# Create a new DataFrame that contains “young users” only young = users.filter(users.age < 21) # Alternatively, using Pandas-like syntax young = users[users.age < 21] # Increment everybody’s age by 1 young.select(young.name, young.age + 1) # Count the number of young users by gender young.groupBy("gender").count() # Join young users with another DataFrame called logs young.join(logs, logs.userId == users.userId, "left_outer")
Incorporating SQL while working with DataFrames:
young.registerTempTable("young") context.sql("SELECT count(*) FROM young")
-
- DataFrame API (inspired by data frames in R and Python Pandas)
-
Targeted for Spark 1.4.0:
-
It looks innocent enough:
# PySpark daytah = sc.textFile("/my/daytah.gz", minPartitions=5) # minPartitions is useless here! daytah.getNumPartitions() # 1
-
You want 5 partitions so 5 tasks can process your data in parallel.
daytah.gz ------------------------------------------------------------------------------------ | | | | | | | task | task | task | task | task | | | | | | | ------------------------------------------------------------------------------------
-
Problem: You cannot decompress a random chunk of a gzipped file without decompressing the whole file.
- Only 1 task processes the file.
- Only 1 task can work on the resulting RDD and its children.
-
Solution 1: Partition data before compressing.
daytah = sc.textFile("/my/daytah.*.gz") # daytah.1.gz, daytah.2.gz,... daytah.getNumPartitions() # 5
-
Solution 2: Repartition RDD after loading compressed data.
daytah = sc.textFile("/my/daytah.gz").repartition(5) daytah.getNumPartitions() # 5
-
Solution 3: Use uncompressed data;
minPartitions
will work then. -
Solution 4: Use a splittable compression codec like bzip2, or a container format like SequenceFile or Avro DataFile.
Personal goal: Help grow a community of Spark contributors in the Boston area.
- The Spark codebase is on GitHub.
- Code contributions are accepted via GitHub Pull Requests.
- Issues (bugs, feature requests, etc.) are tracked on JIRA.
- Official contributing guide: Contributing to Spark
- Mailing Lists:
- Lots of activity on Stack Overflow under the
apache-spark
tag.
Contributions come in all shapes and sizes. It's easy to get started, especially if you are already using Spark, whether for toying around or in production.
- There's an explosion of interest in Spark.
- Number of people interested in Spark
>>>
Number of people who know something about Spark - Helping someone on the user list or on Stack Overflow is great!
- Are you running into the same issue? Or can you reproduce it?
- How badly does this bug affect you?
- Reporting these things on JIRA escalates the issue and makes it more likely to be fixed quickly.
- Example on Hacker News.
- Voting happens publicly on the dev list.
- Example: Voting to release Spark 1.3.0
- It's your last chance to raise major issues before a release goes out.
- Test the release candidate against your workload and share your findings.
- This helps the team release with confidence.
- This generally works best when you are already interested in someone else's work, but it hasn't been merged in because it is pending review.
- Test their work and report your findings.
- Example: Testing someone else's fix for
spark-ec2
- Example: Testing someone else's fix for
- Again, it gives committers more confidence to merge in a change.
There are so many places to contribute. In general, incremental changes are easier to review (and thus merge) than big changes.
Here are some examples:
- Spark itself: Core, SQL, MLlib, GraphX, Streaming, etc.
- SPARK-3533: Add saveAsTextFileByKey() method to RDDs
- SPARK-5865: Add doc warnings for methods that return local data structures
- Project infrastructure: Improvements here make development generally smoother and save time in little ways all over the place.
- SPARK-3849: Automate remaining Spark Code Style Guide rules
- SPARK-3431: Parallelize Scala/Java test execution
spark-ec2
: The easiest way to get a working Spark cluster at scale.- Launch Spark clusters on EC2 on YARN or Mesos.
- Spark PR dashboard: How Spark reviewers manage the deluge of contributions. They will love you for improving this!
- Spark Packages: A community index of Spark packages. (Think of PyPi for Python.
pip install ...
) Anything goes!- Connectors to new data sources
- Fancy notebook environments
- Tools for launching Spark clusters on other clouds (e.g. Azure)
- New algorithms or abstractions not ready for Spark proper yet
- Your crazy idea here!
Abbreviated from the real contributing guide.
-
Fork the Spark repo and create a new branch from
master
.- It's good practice to make any changes on a new branch so it's easy to pull upstream updates.
-
Build Spark and test your changes locally. Or, leave it to Jenkins (a.k.a. SparkQA).
- Building Spark:
-
Both Maven and sbt supported.
-
Example build:
./build/mvn -DskipTests clean package
-
Additional build flags to build against Hadoop, Hive, etc.
-
- Testing locally:
- Handy scripts live under
dev/
. - Examples:
dev/lint-python
: Checks your Python styledev/run-tests
: Checks style, builds Spark, and runs unit tests (can take ~1hr altogether)
- Handy scripts live under
- Building Spark:
-
Commit your changes.
- Your commit messages are not that important. If your PR gets merged, your commits will be squashed into one, and your PR title and body will become the commit title and message that goes into the repo.
-
Open your PR with a descriptive title and body, as well as a reference to the JIRA issue it addresses.
Example PR title:
[SPARK-1805] [EC2] Validate instance types
The overarching theme here is: Think of the people who will review your code.
- Keep your PR small and focused.
- Fix 1 problem or introduce 1 new feature per PR.
- Smaller PRs are easier to review, so committers can merge with confidence.
- If you are introducing something major, discuss it via JIRA or on the Spark developer list before diving in to development.
- Basically, get some buy-in for your idea.
- Respond to feedback from SparkQA, other contributors, or committers.
- A typical PR goes through several rounds of updates before being accepted. It's normal!
- Be persistent!
- People are busy. Don't be discouraged if your PR doesn't get attention!
- This is especially true close to releases, when there is a lot of activity.
- Ping people via JIRA or on the dev list.
Also, it should be noted that you will contribute best when you work on something you are personally interested in, like fixing a problem you ran into, or adding a feature you want to use.