How to tune Spark w.r.t memory issues?
If you just want to how to properly set parameter values, just go to this section.
Some notes about cache:
Spark provides its own native caching mechanisms, which can be used through different methods such as .persist(), .cache(), and CACHE TABLE. This native caching is effective with small data sets as well as in ETL pipelines where you need to cache intermediate results. However, Spark native caching currently does not work well with partitioning, since a cached table does not retain the partitioning data. A more generic and reliable caching technique is storage layer caching.
A nice image to show how memory is used on each executor
yarn.nodemanager.resource.memory-mb
: controls the maximum sum of memory used by all containers on each Spark node.