Skip to content

Instantly share code, notes, and snippets.

@nchammas
Created February 24, 2015 05:06
Show Gist options
  • Save nchammas/4c027c5edb3ef5318c3d to your computer and use it in GitHub Desktop.
Save nchammas/4c027c5edb3ef5318c3d to your computer and use it in GitHub Desktop.
Nick Chammas

Boston Spark Community Updates

  • This is Event #3 for this Meetup.
  • 450+ members; Meetup started in May 2014.
  • New talk format: Lightning Talks
    • 30 seconds to 5 minutes maximum.
    • Any topic relating to Spark.
    • Any skill or experience level--beginners welcome!
  • Speakers: We want you!
    • Feature talks or Lighting talks
  • Sponsors: You want us!
    • Host an event, pay for food and drink, or cover other event costs.
    • Take a few minutes at the start of the event to pitch your company, job, or product.
    • Thank you to Turbine for buying food and drink for this event!

Spark Project Updates

  • Spark sets a new Daytona Gray Sort record.

    • 100TB sorted in 23 minutes; no use of memory cache.
    • Previous record of 72 minutes set by Hadoop on 10x larger cluster.
  • Spark 1.2.0 and 1.2.1 were released.

    • Many new features; a selection:
  • Spark Packages was launched. Community-contributed packages include:

  • Upcoming in Spark 1.3.0:

    • DataFrame API (inspired by data frames in R and Python Pandas)
      • "In Spark, a DataFrame is a distributed collection of data organized into named columns."

      • Data science from a few KB locally all the way up to several PB on a large cluster.

      • Advanced optimization and integration with variety of data sources.

      • Example from the Databricks blog post:

        # Create a new DataFrame that contains “young users” only
        young = users.filter(users.age < 21)
        
        # Alternatively, using Pandas-like syntax
        young = users[users.age < 21]
        
        # Increment everybody’s age by 1
        young.select(young.name, young.age + 1)
        
        # Count the number of young users by gender
        young.groupBy("gender").count()
        
        # Join young users with another DataFrame called logs
        young.join(logs, logs.userId == users.userId, "left_outer")

        Incorporating SQL while working with DataFrames:

        young.registerTempTable("young")
        context.sql("SELECT count(*) FROM young")
  • Targeted for Spark 1.4.0:

A Common Pitfall when handling GZipped Files

  • "I have a huge cluster. Why is only 1 core being used?"

  • It looks innocent enough:

    # PySpark
    daytah = sc.textFile("/my/daytah.gz", minPartitions=5)  # minPartitions is useless here!
    daytah.getNumPartitions()  # 1
  • You want 5 partitions so 5 tasks can process your data in parallel.

    daytah.gz
     ------------------------------------------------------------------------------------
    |                |                |                |                |                |
    |      task      |      task      |      task      |      task      |      task      |
    |                |                |                |                |                |
     ------------------------------------------------------------------------------------
    
  • Problem: You cannot decompress a random chunk of a gzipped file without decompressing the whole file.

    • Only 1 task processes the file.
    • Only 1 task can work on the resulting RDD and its children.
  • Solution 1: Partition data before compressing.

    daytah = sc.textFile("/my/daytah.*.gz")  # daytah.1.gz, daytah.2.gz,...
    daytah.getNumPartitions()  # 5
  • Solution 2: Repartition RDD after loading compressed data.

    daytah = sc.textFile("/my/daytah.gz").repartition(5)
    daytah.getNumPartitions()  # 5
  • Solution 3: Use uncompressed data; minPartitions will work then.

  • Solution 4: Use a splittable compression codec like bzip2, or a container format like SequenceFile or Avro DataFile.

Contributing to Spark

Personal goal: Help grow a community of Spark contributors in the Boston area.

Basics

Different Ways to Contribute

Contributions come in all shapes and sizes. It's easy to get started, especially if you are already using Spark, whether for toying around or in production.

Helping Other Users

  • There's an explosion of interest in Spark.
  • Number of people interested in Spark >>> Number of people who know something about Spark
  • Helping someone on the user list or on Stack Overflow is great!

Triaging Bug Reports

  • Are you running into the same issue? Or can you reproduce it?
  • How badly does this bug affect you?
  • Reporting these things on JIRA escalates the issue and makes it more likely to be fixed quickly.

Voting on Spark Releases

  • Voting happens publicly on the dev list.
  • It's your last chance to raise major issues before a release goes out.
  • Test the release candidate against your workload and share your findings.
  • This helps the team release with confidence.

Reviewing Pull Requests

  • This generally works best when you are already interested in someone else's work, but it hasn't been merged in because it is pending review.
  • Test their work and report your findings.
  • Again, it gives committers more confidence to merge in a change.

Implementing Fixes, Improvements, or New Features

There are so many places to contribute. In general, incremental changes are easier to review (and thus merge) than big changes.

Here are some examples:

  • Spark itself: Core, SQL, MLlib, GraphX, Streaming, etc.
    • SPARK-3533: Add saveAsTextFileByKey() method to RDDs
    • SPARK-5865: Add doc warnings for methods that return local data structures
  • Project infrastructure: Improvements here make development generally smoother and save time in little ways all over the place.
    • SPARK-3849: Automate remaining Spark Code Style Guide rules
    • SPARK-3431: Parallelize Scala/Java test execution
  • spark-ec2: The easiest way to get a working Spark cluster at scale.
    • Launch Spark clusters on EC2 on YARN or Mesos.
  • Spark PR dashboard: How Spark reviewers manage the deluge of contributions. They will love you for improving this!
    • #2: Create PR activity report
    • #34: Show whether PR can be cleanly merged into maintenance branches
    • #42: Add a favicon
  • Spark Packages: A community index of Spark packages. (Think of PyPi for Python. pip install ...) Anything goes!
    • Connectors to new data sources
    • Fancy notebook environments
    • Tools for launching Spark clusters on other clouds (e.g. Azure)
    • New algorithms or abstractions not ready for Spark proper yet
    • Your crazy idea here!

Appendix: More on Pull Requests

Mechanics of Opening a Pull Request

Abbreviated from the real contributing guide.

  1. Fork the Spark repo and create a new branch from master.

    • It's good practice to make any changes on a new branch so it's easy to pull upstream updates.
  2. Build Spark and test your changes locally. Or, leave it to Jenkins (a.k.a. SparkQA).

    • Building Spark:
      • Both Maven and sbt supported.

      • Example build:

        ./build/mvn -DskipTests clean package
        
      • Additional build flags to build against Hadoop, Hive, etc.

    • Testing locally:
      • Handy scripts live under dev/.
      • Examples:
  3. Commit your changes.

    • Your commit messages are not that important. If your PR gets merged, your commits will be squashed into one, and your PR title and body will become the commit title and message that goes into the repo.
  4. Open your PR with a descriptive title and body, as well as a reference to the JIRA issue it addresses.

    Example PR title:

    [SPARK-1805] [EC2] Validate instance types 
    

Authoring a PR that will be Accepted

The overarching theme here is: Think of the people who will review your code.

  1. Keep your PR small and focused.
    • Fix 1 problem or introduce 1 new feature per PR.
    • Smaller PRs are easier to review, so committers can merge with confidence.
  2. If you are introducing something major, discuss it via JIRA or on the Spark developer list before diving in to development.
    • Basically, get some buy-in for your idea.
  3. Respond to feedback from SparkQA, other contributors, or committers.
    • A typical PR goes through several rounds of updates before being accepted. It's normal!
  4. Be persistent!
    • People are busy. Don't be discouraged if your PR doesn't get attention!
    • This is especially true close to releases, when there is a lot of activity.
    • Ping people via JIRA or on the dev list.

Also, it should be noted that you will contribute best when you work on something you are personally interested in, like fixing a problem you ran into, or adding a feature you want to use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment