peijiehu/when_to_use_spark.md

## when_to_use_spark.md

      
    Raw
  

              when_to_use_spark.md
            
          
    When to (and not to) use Spark

Notes from a very good talk/presentation - https://spark-summit.org/east-2016/events/not-your-fathers-database-how-to-use-apache-spark-properly-in-your-big-data-architecture/
Problems that are perfectly solved with Apache Spark:

Analyzing a large set of data files.
Doing ETL of a large amount of data.
Applying Machine Learning & Data Science to a large dataset.
Connecting BI/Visualization tools to Apache Spark to analyze large datasets internally.

In summary, consistent access and processing of large data is often a good sign. I should also add:

Traditional Hadoop jobs.

Examples of problems that Apache Spark is not optimized for:

Random access, frequent inserts, and updates of rows of SQL tables. Databases have better performance for these use cases.
Supporting Incremental updates of Databases into Spark. It’s not performant to update your Spark SQL tables backed by files. Instead, you can use message queues and Spark Streaming or doing an incremental select to make sure your Spark SQL tables stay up to date with your production databases.
External Reporting with many concurrent requests. While Spark’s ability to cache your data in memory will allow you to get back to fast interactive querying, Spark is not meant to be optimal for supporting many concurrent requests. It’s better to use Spark to ETL your data to summary tables or some other format into a traditional database to serve your reports if you have many concurrent users to support.
Searching content. A Spark job can certainly be written to filter or search for any content you’d like. ElasticSearch is a specialized engine designed to return search results quicker than Spark.