Group leader:
- Building Spark app.
- Not in production.
- Wants to know more about how people are building apps.
- Currently runs jobs by hand using Spark EC2 scripts. What else is available?
Group member in similar situation in recommendations:
- Person a company has written stuff in Spark.
- Unsure that Spark will be running algorithms in the way they want.
Two group members:
- Existing Hadoop infrastructure in production.
- Looking at Spark at possible replacements.
- Have written Spark scripts that interact with existing HDFS back-end.
- Looking at Elasticsearch as another possible back-end:
- Operations they want to perform are small.
- Most time is taken finding data and parsing JSON.
Another user:
- Running Hadoop at moment on EMR
- Difficult getting Spark up and running on EMR:
- Introduce overhead getting it to run on Yarn.
- Still have to manually submit jobs. No API interfaces.
- Not sure what it gives you over spinning up cluster yourself.
- EMR sales pitch seems to be around using spots for processing.
- Possible to automate this with Spark itself without EMR.
Group leader:
- Working in Scala.
- Scala interface works well.
- Not always obvious what you're serializing over the wire.
- Can take 15-20 minutes operating in a distributed system to discover things aren't serializable.
- Ran into problems allocating too much RAM to the Spark process. Finding the sweet spot between allocating RAM to Spark and OS is tricky.
- Getting log files out can be tricky... lots of log files:
- Master logs
- Per-node local logs
- Not sure how to get log files out
- Another group member has had similar problems with Hive... general issue with distributed systems
- Once these issues are addressed, performance is good
- Potentially issues closing over things that are distributed... again, half-hour
- NOTE: Are the Scala tools to statically check for closing over non-serializable/mutable state?
- Using MLLib
Another user:
- Interested in Spark for streaming!
- 9x people in room
- 3x people more interested in streaming
- 8x people interested in batch
Comparisons between Spark Streaming and Storm:
- Analytics use cases from one user (in publishing):
- Trending content on web site
- 25/50/90/95 percentiles of performance according to user browsers
- Breakdown of referrers for a piece of content
- Top tweets that drive content to site