Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
Add LZO compression codecs to the Apache Hadoop and Spark

Add LZO compresssion codecs to the Apache Hadoop and Spark

LZO is a splittable compression format for files stored in Hadoop’s HDFS. It has valuable combination of speed and compression size. Thanks to hadoop-lzo the .lzo files could be splittable too.

  • Install lzo and lzop codes [OSX].

$ brew install lzo lzop
  • Find where the headers and libraries are installed

$ brew list lzo

The output should look like follows

/usr/local/Cellar/lzo/2.06/include/lzo/ (13 files)
/usr/local/Cellar/lzo/2.06/lib/liblzo2.2.dylib
/usr/local/Cellar/lzo/2.06/lib/ (2 other files)
/usr/local/Cellar/lzo/2.06/share/doc/ (7 files)
  • Clone hadoop-lzo repository.

$ git clone https://github.com/twitter/hadoop-lzo
$ cd hadoop-lzo
  • Build the project (maven required)

$ C_INCLUDE_PATH=/usr/local/Cellar/lzo/2.06/include/lzo/LIBRARY_PATH=/usr/local/Cellar/lzo/2.06/lib/ mvn clean install
  • Copy the libraries into the Hadoop installation directory. We assume that the HADOOP_INSTALL points to the hadoop installation folder (for example /usr/local/hadoop)

$ cp target/hadoop-lzo-0.4.20-SNAPSHOT.jar $HADOOP_INSTALL/lib
$ mkdir -p $HADOOP_INSTALL/lib/lzo
$ cp -r target/native/* $HADOOP_INSTALL/lib/lzo
  • Add hadoop-lzo jar and native libraries to hadoop’s classpath and library path. Do it either in ~/.bash_profile or $HADOOP_INSTALL/etc/hadoop/hadoop-env.sh

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar
export HADOOP_OPTS=„$HADOOP_OPTS -Djava.library.path=$HADOOP_INSTALL/lib/native/osx:$HADOOP_INSTALL/lib/native/lzo”
  • Add lzo compression codes to the hadoop’s $HADOOP_INSTALL/etc/hadoop/core-site.xml

<property>
  <name>io.compression.codecs</name>
  <value>org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.BZip2Codec, com.hadoop.compression.lzo.LzoCodec, com.hadoop.compression.lzo.LzopCodec
  </value>
</property>
<property>
  <name>io.compression.codec.lzo.class</name>
  <value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
  • Add lzo dependencies to the Apache Spark configuration $SPARK_INSTALL/conf/spark-env.sh

export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:$HADOOP_INSTALL/lib/native/osx:$HADOOP_INSTALL/lib/native/lzo
export SPARK_CLASSPATH=$SPARK_CLASSPATH:$HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar
  • Add lzo compression codec to the Hadoop Configuration instance that you pass to SparkContext (driver) instance

conf.set(„io.compression.codecs”, ”com.hadoop.compression.lzo.LzopCodec”);
  • Convert file (for example bz2) to the lzo format and import new file to the Hadoop’s HDFS

$ bzip2 --stdout file.bz2 | lzop -o file.lzo
$ hdfs dfs -put file.lzo input
  • Index lzo compressed files directly in HDFS

$ hadoop jar $HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar com.hadoop.compression.lzo.LzoIndexer input/file.lzo

or index all lzo file in the input folder

$ hadoop jar $HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar com.hadoop.compression.lzo.LzoIndexer input

or index lzo files with map reduce job

$ hadoop jar $HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar com.hadoop.compression.lzo.DistributedLzoIndexer input

REFERENCES

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment