zedar/ApacheHadoopSpark_LZO.adoc

## ApacheHadoopSpark_LZO.adoc

      
    Raw
  

              ApacheHadoopSpark_LZO.adoc
            
          
    Add LZO compresssion codecs to the Apache Hadoop and Spark


LZO is a splittable compression format for files stored in Hadoop’s HDFS. It has valuable combination of speed and compression size. Thanks to hadoop-lzo the .lzo files could be splittable too.


Install lzo and lzop codes [OSX].


$ brew install lzo lzop


Find where the headers and libraries are installed


$ brew list lzo


The output should look like follows


/usr/local/Cellar/lzo/2.06/include/lzo/ (13 files)
/usr/local/Cellar/lzo/2.06/lib/liblzo2.2.dylib
/usr/local/Cellar/lzo/2.06/lib/ (2 other files)
/usr/local/Cellar/lzo/2.06/share/doc/ (7 files)


Clone hadoop-lzo repository.


$ git clone https://github.com/twitter/hadoop-lzo
$ cd hadoop-lzo


Build the project (maven required)


$ C_INCLUDE_PATH=/usr/local/Cellar/lzo/2.06/include/lzo/LIBRARY_PATH=/usr/local/Cellar/lzo/2.06/lib/ mvn clean install


Copy the libraries into the Hadoop installation directory. We assume that the HADOOP_INSTALL points to the hadoop installation folder (for example /usr/local/hadoop)


$ cp target/hadoop-lzo-0.4.20-SNAPSHOT.jar $HADOOP_INSTALL/lib
$ mkdir -p $HADOOP_INSTALL/lib/lzo
$ cp -r target/native/* $HADOOP_INSTALL/lib/lzo


Add hadoop-lzo jar and native libraries to hadoop’s classpath and library path. Do it either in ~/.bash_profile or $HADOOP_INSTALL/etc/hadoop/hadoop-env.sh


export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar
export HADOOP_OPTS=„$HADOOP_OPTS -Djava.library.path=$HADOOP_INSTALL/lib/native/osx:$HADOOP_INSTALL/lib/native/lzo”


Add lzo compression codes to the hadoop’s $HADOOP_INSTALL/etc/hadoop/core-site.xml


<property>
  <name>io.compression.codecs</name>
  <value>org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.BZip2Codec, com.hadoop.compression.lzo.LzoCodec, com.hadoop.compression.lzo.LzopCodec
  </value>
</property>
<property>
  <name>io.compression.codec.lzo.class</name>
  <value>com.hadoop.compression.lzo.LzoCodec</value>
</property>


Add lzo dependencies to the Apache Spark configuration $SPARK_INSTALL/conf/spark-env.sh


export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:$HADOOP_INSTALL/lib/native/osx:$HADOOP_INSTALL/lib/native/lzo
export SPARK_CLASSPATH=$SPARK_CLASSPATH:$HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar


Add lzo compression codec to the Hadoop Configuration instance that you pass to SparkContext (driver) instance


conf.set(„io.compression.codecs”, ”com.hadoop.compression.lzo.LzopCodec”);


Convert file (for example bz2) to the lzo format and import new file to the Hadoop’s HDFS


$ bzip2 --stdout file.bz2 | lzop -o file.lzo
$ hdfs dfs -put file.lzo input


Index lzo compressed files directly in HDFS


$ hadoop jar $HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar com.hadoop.compression.lzo.LzoIndexer input/file.lzo


or index all lzo file in the input folder


$ hadoop jar $HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar com.hadoop.compression.lzo.LzoIndexer input


or index lzo files with map reduce job


$ hadoop jar $HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar com.hadoop.compression.lzo.DistributedLzoIndexer input


REFERENCES


[[[1]]] http://xiaming.me/posts/2014/05/03/enable-lzo-compression-on-hadoop-pig-and-spark/


[[[2]]] https://github.com/twitter/hadoop-lzo


[[[3]]] https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/examples/reading-lzo-files.md


[[[4]]] https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/examples/reading-lzo-files.md