Skip to content

Instantly share code, notes, and snippets.

@hyunjun
Last active August 29, 2015 14:24
Show Gist options
  • Save hyunjun/d9d73c5fe8a7f7b17b28 to your computer and use it in GitHub Desktop.
Save hyunjun/d9d73c5fe8a7f7b17b28 to your computer and use it in GitHub Desktop.
spark-hbase
  • debug
    • evaluation of elapsed time; http://localhost:4040/jobs/ 에 들어가면 각 cell별 소요 시간이 나오므로 시간이 많이 소요되는 cell을 먼저 debug
  • etc
    • return type이 RDD인 경우와 아닌 경우를 구분 잘 해야 할 듯(chaining을 계속 할 수 있거나, 아니면 끝나니까)
  • execution
    • CDH 5.3 + Pre-built for hadoop 2.4 and later

      ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --executor-memory 4G --num-executors 20 ./lib/spark-examples-1.4.0-hadoop2.4.0.jar 1000
      ./bin/spark-submit --class SimpleApp --master yarn-cluster --executor-memory 4G --num-executors 20 /path/to/spark-simple-app/target/scala-2.11/simple-project_2.11-1.0.jar documents_news/part-m-01949
      [failed] ./bin/spark-submit --class SimpleApp --master local[4] /path/to/spark-simple-app/target/scala-2.11/simple-project_2.11-1.0.jar documents_news/part-m-01949
      
    • CDH 4 + Pre-built for CDH4

      ./bin/spark-submit --class SimpleApp --master local[8] /path/to/spark-simple-app/target/scala-2.11/simple-project_2.11-1.0.jar /path/to/spark-simple-app/build.sbt
      export HADOOP_CONF_DIR=/etc/hadoop/conf; ./bin/spark-submit --class SimpleApp --master local[8] /path/to/spark-simple-app/target/scala-2.11/simple-project_2.11-1.0.jar documents/part-m-04618
      [failed] ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://10.27.22.37:7077 --executor-memory 4G --total-executor-cores 100 /tmp/spark-1.4.0-bin-cdh4/lib/spark-examples-1.4.0-hadoop2.0.0-mr1-cdh4.2.0.jar 1000
      
    • failed to access hbase from pyspark

      build jackson-module-scala like https://github.com/FasterXML/jackson-module-scala/issues/212
      jackson-module-scala$ sbt package
      export HADOOP_CONF_DIR=`hbase classpath`:/etc/hadoop/conf
      export SPARK_CLASSPATH=$SPARK_CLASSPATH:`hbase classpath`:./lib/*:/data1/program/spark-1.4.0-bin-hadoop2.4/jackson-module-scala/target/scala-2.10/jackson-module-scala_2.10-2.6.0-rc3-SNAPSHOT.jar
      ./bin/pyspark
      >>> conf = {"hbase.zookeeper.quorum": "search-fqa-test12","hbase.mapreduce.inputtable": "SS_CATEGORY"}
      >>> rdd = sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat", "org.apache.hadoop.hbase.io.ImmutableBytesWritable", "org.apache.hadoop.hbase.client.Result", conf=conf)
      
  • ref
  • execution

    $ export HADOOP_CONF_DIR=`hbase classpath`:/etc/hadoop/conf
    $ ./bin/spark-submit --class org.apache.spark.examples.HBaseTest --master local[4] ./examples/target/spark-examples_2.10-1.5.0-SNAPSHOT.jar [hbase table name]
    $ ./bin/spark-submit --driver-class-path ./examples/target/spark-examples_2.10-1.5.0-SNAPSHOT.jar ./examples/src/main/python/hbase_inputformat.py [hbase master] [hbase table name]
    
  • configuration

    $ hadoop version
    Hadoop 2.5.0-cdh5.[2|3].0
    ...
    $ hbase version
    HBase 0.98.6-cdh5.[2|3].0
    ...
    $ java -version (java >= 1.7)
    java version ["1.7.0_55"|"1.8.0_25"]
    ...
    $ mvn -version (3.0.4 <= maven < 3.3.x)
    Apache Maven 3.2.5 (12a6b3acb947671f09b81f49094c53f426d8cea1; 2014-12-15T02:29:23+09:00)
    ...
    
    $ git clone https://github.com/apache/spark.git
    $ git log
    commit d9838196ff48faeac19756852a7f695129c08047
    Author: Josh Rosen <joshrosen@databricks.com>
    Date:   Thu Jul 2 18:07:09 2015 -0700
    ...
    $ vi pom.xml
    ...
      <hadoop.version>2.5.0-cdh5.[2|3].0</hadoop.version>
    ...
      <hbase.version>0.98.6-cdh5.[2|3].0</hbase.version>
    ...
    
    $ mvn -DskipTests clean package
    
  • troubleshooting

    • Found both spark.driver.extraClassPath and SPARK_CLASSPATH. Use only the former -> unset SPARK_CLASSPATH

    • Total size of serialized results of tasks is bigger than spark.driver.maxResultSize

    • pyspark OutOfMemoryError Java heap space -> allocate appropriate executor memory

      conf = (SparkConf().setAppName("app name")
                         .set("spark.executor.memory", "1g"))
      sc = SparkContext(conf=conf)
      
  • failed with

  • pyspark support yarn-client mode, NOT yarn-cluster yet
  • prerequisite
    • ref. https://gist.github.com/laserson/1d1185b412b41057810b

    • distribute configuration files in cloudera manager

    • prepare yarn-site.xml included configuration directory

      # cd /data4/etc/hadoop/conf.cloudera.hdfs
      # cp /data4/etc/hive/conf.cloudera.hive/yarn-site.xml .
      
    • mvn package for yarn

      $ cd /path/to/spark
      $ MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m" mvn -Pyarn -DskipTests package
      $ hadoop fs -put assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop2.5.0-cdh5.2.0.jar /tmp
      
    • check python syntax appropriate for 2.6

      print('{}'.format(var)) -> print('%s' % var)
      lambda x: {c:x[i] for i, c in enumerate(columns)} -> lambda x: dict([(c, x[i]) for i, c in enumerate(columns)])
      
    • execute

      $ export PYSPARK_PYTHON=/usr/bin/python2.6
      $ export PYSPARK_SUBMIT_ARGS='--master yarn-client --conf spark.executor.memory=2g'
      $ export HADOOP_CONF_DIR=`hbase classpath`:/data4/etc/hadoop/conf.cloudera.hdfs/
      $ /path/to/spark/bin/spark-submit --conf spark.yarn.jar=hdfs:///tmp/spark-assembly-1.5.0-SNAPSHOT-hadoop2.5.0-cdh5.2.0.jar --master yarn-client --driver-class-path /path/to/spark/examples/spark-examples_2.10-1.5.0-SNAPSHOT.jar [python file] [args...]
      
  • ref
  • troubleshooting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment