Skip to content

Instantly share code, notes, and snippets.

@wubin1989
Forked from zedar/ApacheHadoopSpark_LZO.adoc
Created October 14, 2016 11:04
Show Gist options
  • Save wubin1989/86fdc16877eb0c0f66a8dae99da0a75a to your computer and use it in GitHub Desktop.
Save wubin1989/86fdc16877eb0c0f66a8dae99da0a75a to your computer and use it in GitHub Desktop.
Add LZO compression codecs to the Apache Hadoop and Spark

Add LZO compresssion codecs to the Apache Hadoop and Spark

LZO is a splittable compression format for files stored in Hadoop’s HDFS. It has valuable combination of speed and compression size. Thanks to hadoop-lzo the .lzo files could be splittable too.

  • Install lzo and lzop codes [OSX].

$ brew install lzo lzop
  • Find where the headers and libraries are installed

$ brew list lzo

The output should look like follows

/usr/local/Cellar/lzo/2.06/include/lzo/ (13 files)
/usr/local/Cellar/lzo/2.06/lib/liblzo2.2.dylib
/usr/local/Cellar/lzo/2.06/lib/ (2 other files)
/usr/local/Cellar/lzo/2.06/share/doc/ (7 files)
  • Clone hadoop-lzo repository.

$ git clone https://github.com/twitter/hadoop-lzo
$ cd hadoop-lzo
  • Build the project (maven required)

$ C_INCLUDE_PATH=/usr/local/Cellar/lzo/2.06/include/lzo/LIBRARY_PATH=/usr/local/Cellar/lzo/2.06/lib/ mvn clean install
  • Copy the libraries into the Hadoop installation directory. We assume that the HADOOP_INSTALL points to the hadoop installation folder (for example /usr/local/hadoop)

$ cp target/hadoop-lzo-0.4.20-SNAPSHOT.jar $HADOOP_INSTALL/lib
$ mkdir -p $HADOOP_INSTALL/lib/lzo
$ cp -r target/native/* $HADOOP_INSTALL/lib/lzo
  • Add hadoop-lzo jar and native libraries to hadoop’s classpath and library path. Do it either in ~/.bash_profile or $HADOOP_INSTALL/etc/hadoop/hadoop-env.sh

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar
export HADOOP_OPTS=„$HADOOP_OPTS -Djava.library.path=$HADOOP_INSTALL/lib/native/osx:$HADOOP_INSTALL/lib/native/lzo”
  • Add lzo compression codes to the hadoop’s $HADOOP_INSTALL/etc/hadoop/core-site.xml

<property>
  <name>io.compression.codecs</name>
  <value>org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.BZip2Codec, com.hadoop.compression.lzo.LzoCodec, com.hadoop.compression.lzo.LzopCodec
  </value>
</property>
<property>
  <name>io.compression.codec.lzo.class</name>
  <value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
  • Add lzo dependencies to the Apache Spark configuration $SPARK_INSTALL/conf/spark-env.sh

export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:$HADOOP_INSTALL/lib/native/osx:$HADOOP_INSTALL/lib/native/lzo
export SPARK_CLASSPATH=$SPARK_CLASSPATH:$HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar
  • Add lzo compression codec to the Hadoop Configuration instance that you pass to SparkContext (driver) instance

conf.set(„io.compression.codecs”, ”com.hadoop.compression.lzo.LzopCodec”);
  • Convert file (for example bz2) to the lzo format and import new file to the Hadoop’s HDFS

$ bzip2 --stdout file.bz2 | lzop -o file.lzo
$ hdfs dfs -put file.lzo input
  • Index lzo compressed files directly in HDFS

$ hadoop jar $HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar com.hadoop.compression.lzo.LzoIndexer input/file.lzo

or index all lzo file in the input folder

$ hadoop jar $HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar com.hadoop.compression.lzo.LzoIndexer input

or index lzo files with map reduce job

$ hadoop jar $HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar com.hadoop.compression.lzo.DistributedLzoIndexer input

REFERENCES

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment