- debug
- evaluation of elapsed time; http://localhost:4040/jobs/ 에 들어가면 각 cell별 소요 시간이 나오므로 시간이 많이 소요되는 cell을 먼저 debug
- etc
- return type이 RDD인 경우와 아닌 경우를 구분 잘 해야 할 듯(chaining을 계속 할 수 있거나, 아니면 끝나니까)
- execution
-
CDH 5.3 + Pre-built for hadoop 2.4 and later
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --executor-memory 4G --num-executors 20 ./lib/spark-examples-1.4.0-hadoop2.4.0.jar 1000 ./bin/spark-submit --class SimpleApp --master yarn-cluster --executor-memory 4G --num-executors 20 /path/to/spark-simple-app/target/scala-2.11/simple-project_2.11-1.0.jar documents_news/part-m-01949 [failed] ./bin/spark-submit --class SimpleApp --master local[4] /path/to/spark-simple-app/target/scala-2.11/simple-project_2.11-1.0.jar documents_news/part-m-01949
-
CDH 4 + Pre-built for CDH4
./bin/spark-submit --class SimpleApp --master local[8] /path/to/spark-simple-app/target/scala-2.11/simple-project_2.11-1.0.jar /path/to/spark-simple-app/build.sbt export HADOOP_CONF_DIR=/etc/hadoop/conf; ./bin/spark-submit --class SimpleApp --master local[8] /path/to/spark-simple-app/target/scala-2.11/simple-project_2.11-1.0.jar documents/part-m-04618 [failed] ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://10.27.22.37:7077 --executor-memory 4G --total-executor-cores 100 /tmp/spark-1.4.0-bin-cdh4/lib/spark-examples-1.4.0-hadoop2.0.0-mr1-cdh4.2.0.jar 1000
-
failed to access hbase from pyspark
build jackson-module-scala like https://github.com/FasterXML/jackson-module-scala/issues/212 jackson-module-scala$ sbt package export HADOOP_CONF_DIR=`hbase classpath`:/etc/hadoop/conf export SPARK_CLASSPATH=$SPARK_CLASSPATH:`hbase classpath`:./lib/*:/data1/program/spark-1.4.0-bin-hadoop2.4/jackson-module-scala/target/scala-2.10/jackson-module-scala_2.10-2.6.0-rc3-SNAPSHOT.jar ./bin/pyspark >>> conf = {"hbase.zookeeper.quorum": "search-fqa-test12","hbase.mapreduce.inputtable": "SS_CATEGORY"} >>> rdd = sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat", "org.apache.hadoop.hbase.io.ImmutableBytesWritable", "org.apache.hadoop.hbase.client.Result", conf=conf)
-
- ref
- https://github.com/apache/spark/tree/master/python/pyspark
- https://github.com/GenTang/spark_hbase
- https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_inputformat.py
- https://github.com/apache/hbase/blob/0.96/hbase-examples/src/main/python/thrift1/DemoClient.py
- http://www.slideshare.net/BenjaminBengfort/fast-data-analytics-with-spark-and-python
- https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
- https://gist.github.com/MLnick/6ec916b646c3004d7523
-
-
Save hyunjun/d9d73c5fe8a7f7b17b28 to your computer and use it in GitHub Desktop.
spark-hbase
-
execution
$ export HADOOP_CONF_DIR=`hbase classpath`:/etc/hadoop/conf $ ./bin/spark-submit --class org.apache.spark.examples.HBaseTest --master local[4] ./examples/target/spark-examples_2.10-1.5.0-SNAPSHOT.jar [hbase table name] $ ./bin/spark-submit --driver-class-path ./examples/target/spark-examples_2.10-1.5.0-SNAPSHOT.jar ./examples/src/main/python/hbase_inputformat.py [hbase master] [hbase table name]
-
configuration
$ hadoop version Hadoop 2.5.0-cdh5.[2|3].0 ... $ hbase version HBase 0.98.6-cdh5.[2|3].0 ... $ java -version (java >= 1.7) java version ["1.7.0_55"|"1.8.0_25"] ... $ mvn -version (3.0.4 <= maven < 3.3.x) Apache Maven 3.2.5 (12a6b3acb947671f09b81f49094c53f426d8cea1; 2014-12-15T02:29:23+09:00) ... $ git clone https://github.com/apache/spark.git $ git log commit d9838196ff48faeac19756852a7f695129c08047 Author: Josh Rosen <joshrosen@databricks.com> Date: Thu Jul 2 18:07:09 2015 -0700 ... $ vi pom.xml ... <hadoop.version>2.5.0-cdh5.[2|3].0</hadoop.version> ... <hbase.version>0.98.6-cdh5.[2|3].0</hbase.version> ... $ mvn -DskipTests clean package
-
troubleshooting
-
Found both spark.driver.extraClassPath and SPARK_CLASSPATH. Use only the former
->unset SPARK_CLASSPATH
-
Total size of serialized results of tasks is bigger than spark.driver.maxResultSize
-
pyspark OutOfMemoryError Java heap space
-> allocate appropriate executor memoryconf = (SparkConf().setAppName("app name") .set("spark.executor.memory", "1g")) sc = SparkContext(conf=conf)
-
-
failed with
- CDH 4.5
- compiled with
<hadoop.version>2.0.0-cdh4.5.0</hadoop.version>
- failed with
<hbase.version>0.94.6-cdh4.5.0</hbase.version>
- https://repository.cloudera.com/artifactory/repo/org/apache/hbase/에도 hbase-*-0.94.6-cdh4.5.0.jar는 없음
- compiled with
- maven 3.3.x
- https://www.mail-archive.com/issues@spark.apache.org/msg53357.html
- subproject중 bagel에서 dependency-reduced-pom.xml부분이 infinite loop에 빠진 듯한 현상 발생
- CDH 4.5
- pyspark support yarn-client mode, NOT yarn-cluster yet
- prerequisite
-
distribute configuration files in cloudera manager
-
prepare yarn-site.xml included configuration directory
# cd /data4/etc/hadoop/conf.cloudera.hdfs # cp /data4/etc/hive/conf.cloudera.hive/yarn-site.xml .
-
mvn package for yarn
$ cd /path/to/spark $ MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m" mvn -Pyarn -DskipTests package $ hadoop fs -put assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop2.5.0-cdh5.2.0.jar /tmp
-
check python syntax appropriate for 2.6
print('{}'.format(var)) -> print('%s' % var) lambda x: {c:x[i] for i, c in enumerate(columns)} -> lambda x: dict([(c, x[i]) for i, c in enumerate(columns)])
-
execute
$ export PYSPARK_PYTHON=/usr/bin/python2.6 $ export PYSPARK_SUBMIT_ARGS='--master yarn-client --conf spark.executor.memory=2g' $ export HADOOP_CONF_DIR=`hbase classpath`:/data4/etc/hadoop/conf.cloudera.hdfs/ $ /path/to/spark/bin/spark-submit --conf spark.yarn.jar=hdfs:///tmp/spark-assembly-1.5.0-SNAPSHOT-hadoop2.5.0-cdh5.2.0.jar --master yarn-client --driver-class-path /path/to/spark/examples/spark-examples_2.10-1.5.0-SNAPSHOT.jar [python file] [args...]
- ref
- troubleshooting
org.apache.hadoop.hbase.UnknownScannerException: org.apache.hadoop.hbase.UnknownScannerException
- hbase.regionserver.lease.period; default 60000(1분) -> 900000(15분), 필요하면 20000분까지도 늘려 볼 것
- hbase.rpc.timeout; lease period와 동일한 설정
- http://stackoverflow.com/questions/11463458/hbase-mapreduce-error
java.lang.ClassNotFoundException: org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter
Exception: Python in worker has different version 2.6 than that in driver 2.7, PySpark cannot run with different minor versions
-> use python 2.6spark org.apache.hadoop.hbase.RegionTooBusyException
-> to solve this, trying to execute in yarn
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment