davegurnell/sparknotes.md

## sparknotes.md

      
    Raw
  

              sparknotes.md
            
          
    Spark

Group leader:

Building Spark app.
Not in production.
Wants to know more about how people are building apps.
Currently runs jobs by hand using Spark EC2 scripts. What else is available?

Group member in similar situation in recommendations:

Person a company has written stuff in Spark.
Unsure that Spark will be running algorithms in the way they want.

Two group members:

Existing Hadoop infrastructure in production.
Looking at Spark at possible replacements.
Have written Spark scripts that interact with existing HDFS back-end.
Looking at Elasticsearch as another possible back-end:

Operations they want to perform are small.
Most time is taken finding data and parsing JSON.


Another user:

Running Hadoop at moment on EMR
Difficult getting Spark up and running on EMR:

Introduce overhead getting it to run on Yarn.
Still have to manually submit jobs. No API interfaces.
Not sure what it gives you over spinning up cluster yourself.


EMR sales pitch seems to be around using spots for processing.

Possible to automate this with Spark itself without EMR.


Group leader:

Working in Scala.
Scala interface works well.
Not always obvious what you're serializing over the wire.
Can take 15-20 minutes operating in a distributed system
to discover things aren't serializable.
Ran into problems allocating too much RAM to the Spark process.
Finding the sweet spot between allocating RAM to Spark and OS is tricky.
Getting log files out can be tricky... lots of log files:

Master logs
Per-node local logs
Not sure how to get log files out
Another group member has had similar problems with Hive...
general issue with distributed systems


Once these issues are addressed, performance is good
Potentially issues closing over things that are distributed...
again, half-hour
NOTE: Are the Scala tools to statically check for closing over
non-serializable/mutable state?
Using MLLib

Another user:

Interested in Spark for streaming!

9x people in room
3x people more interested in streaming
8x people interested in batch


Comparisons between Spark Streaming and Storm:

Analytics use cases from one user (in publishing):

Trending content on web site
25/50/90/95 percentiles of performance according to user browsers
Breakdown of referrers for a piece of content
Top tweets that drive content to site