Skip to content

Instantly share code, notes, and snippets.

@feitang0
Last active November 15, 2015 07:54
Show Gist options
  • Save feitang0/80ffd8c0b9a0a9152a56 to your computer and use it in GitHub Desktop.
Save feitang0/80ffd8c0b9a0a9152a56 to your computer and use it in GitHub Desktop.
Spark Learn Note

Spark学习笔记

集群环境的挑战

  1. 并行化
  2. 单点失败的处理(节点宕机,个别节点缓慢)
  3. 多用户共享,动态进行计算资源的分配

MapReduce--简单通用和自动容错的批处理计算模型,不适合交互计算,迭代计算和流计算

RDD--抽象的弹性数据集

An RDD is a read-only,partitioned collection of records. RDDs can only be created through deterministic operations on either data in stable storage or other RDDs. Users can control two aspects of RDDs: persistence and partitioning.

For iterative algorithms and interactive data mining tools abstractions for leveraging distributed memory

RDD生成

  • 来自于内存集合和外部存储系统
  • 通过转换操作来自于其他RDD Spark RDD操作图

某些transformation比较复杂,会包含多个子transformation,因而会生成多个RDD.

Transformation Generated RDDs Compute()
map(func) MappedRDD iterator(split).map(f)
filter(func) FilteredRDD iterator(split).filter(f)
flatMap(func) FlatMappedRDD iterator(split).flatMap(f)
  • a single RDD has one or more partitions scattered across multiple nodes
  • a single partition is processed on a single node
  • a single node can handle multiple partitions (with optimum 2-4 partitions per CPU according to the official documentation)

Since Spark supports pluggable resource management details of the distribution will depend on the one you use (Standalone, Yarn, Messos).

Spark on Yarn

分为两种模式

  • yarn-client driver在提交Spark作业的机器上运行
  • yarn-cluster driver运行在yarn的container上,提交任务之前不知道driver运行在哪个节点

Yarn资源

如果不指定任何参数,默认只分配2个executor 有些集群executor的最大内存是8G

注意的问题

  • 增加新节点时要注意环境变量的冲突问题
  • 格式化namenode时slave节点datanode文件夹要清空
  • Configured Capacity = Total Disk Space - Reserved Space
  • Non DFS used = (Total Disk Space - Reserved Space) - Remaining Space - DFS Used
  • 如果整个集群的HDFS备份数为1,在集群之外的其他节点上传文件到HDFS,文件的备份数以该节点配置的hdfs-site.xml为准
  • 磁盘空间使用率超过90%,默认情况下,Yarn会把这个节点视为unhealthy节点,可以在yarn-site.xml里添加property yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage,value为0.0-100.0
  • SUSE关闭防火墙 /sbin/SuSEfirewall2 off(永久) /etc/init.d/SuSEfirewall2_setup stop(暂时)
  • spark.speculation(Spark configuration values) Default false Setting to true will enable speculative execution of tasks. This means tasks that are running slowly will have a second copy launched on another node. Enabling this can help cut down on straggler tasks in large clusters.

我们可以用三种方式来配置spark speculation

In code:

val conf=new SparkConf().set("spark.speculation","false")

Or We can add configuration dynamically at time for deploy application with the help of flag as follows

--conf spark.speculation=false

Or, We can also add it in properties file with the space delimiter as follows

spark.speculation false

Spark Conf

Enabling Dynamic Executor Allocation

Spark can dynamically increase and decrease the number of executors for an application if the resource requirements for the application changes over time. To enable dynamic allocation, set spark.dynamicAllocation.enabled to true. Specify the minimum number of executors that should be allocated to an application by means of the spark.dynamicAllocation.minExecutors parameter, and specify and the maximum number of executors by means of the spark.dynamicAllocation.maxExecutors parameter. Set the initial number of executors in the spark.dynamicAllocation.initialExecutors parameter. Do not use the --num-executors command line argument or the spark.executor.instances parameter; they are incompatible with dynamic allocation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment