Skip to content

Instantly share code, notes, and snippets.

@davegurnell
Created March 31, 2015 16:44
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save davegurnell/9db3bce069ebd23b6f26 to your computer and use it in GitHub Desktop.
Save davegurnell/9db3bce069ebd23b6f26 to your computer and use it in GitHub Desktop.
Anonymised notes from the session on Spark at Scale Summit.

Spark

Group leader:

  • Building Spark app.
  • Not in production.
  • Wants to know more about how people are building apps.
  • Currently runs jobs by hand using Spark EC2 scripts. What else is available?

Group member in similar situation in recommendations:

  • Person a company has written stuff in Spark.
  • Unsure that Spark will be running algorithms in the way they want.

Two group members:

  • Existing Hadoop infrastructure in production.
  • Looking at Spark at possible replacements.
  • Have written Spark scripts that interact with existing HDFS back-end.
  • Looking at Elasticsearch as another possible back-end:
    • Operations they want to perform are small.
    • Most time is taken finding data and parsing JSON.

Another user:

  • Running Hadoop at moment on EMR
  • Difficult getting Spark up and running on EMR:
    • Introduce overhead getting it to run on Yarn.
    • Still have to manually submit jobs. No API interfaces.
    • Not sure what it gives you over spinning up cluster yourself.
  • EMR sales pitch seems to be around using spots for processing.
    • Possible to automate this with Spark itself without EMR.

Group leader:

  • Working in Scala.
  • Scala interface works well.
  • Not always obvious what you're serializing over the wire.
  • Can take 15-20 minutes operating in a distributed system to discover things aren't serializable.
  • Ran into problems allocating too much RAM to the Spark process. Finding the sweet spot between allocating RAM to Spark and OS is tricky.
  • Getting log files out can be tricky... lots of log files:
    • Master logs
    • Per-node local logs
    • Not sure how to get log files out
    • Another group member has had similar problems with Hive... general issue with distributed systems
  • Once these issues are addressed, performance is good
  • Potentially issues closing over things that are distributed... again, half-hour
  • NOTE: Are the Scala tools to statically check for closing over non-serializable/mutable state?
  • Using MLLib

Another user:

  • Interested in Spark for streaming!
    • 9x people in room
    • 3x people more interested in streaming
    • 8x people interested in batch

Comparisons between Spark Streaming and Storm:

  • Analytics use cases from one user (in publishing):
    • Trending content on web site
    • 25/50/90/95 percentiles of performance according to user browsers
    • Breakdown of referrers for a piece of content
    • Top tweets that drive content to site
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment