Skip to content

Instantly share code, notes, and snippets.

Last active September 3, 2019 03:46
  • Star 5 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
Star You must be signed in to star a gist
What would you like to do?
Add LZO compression codecs to the Apache Hadoop and Spark

Add LZO compresssion codecs to the Apache Hadoop and Spark

LZO is a splittable compression format for files stored in Hadoop’s HDFS. It has valuable combination of speed and compression size. Thanks to hadoop-lzo the .lzo files could be splittable too.

  • Install lzo and lzop codes [OSX].

$ brew install lzo lzop
  • Find where the headers and libraries are installed

$ brew list lzo

The output should look like follows

/usr/local/Cellar/lzo/2.06/include/lzo/ (13 files)
/usr/local/Cellar/lzo/2.06/lib/ (2 other files)
/usr/local/Cellar/lzo/2.06/share/doc/ (7 files)
  • Clone hadoop-lzo repository.

$ git clone
$ cd hadoop-lzo
  • Build the project (maven required)

$ C_INCLUDE_PATH=/usr/local/Cellar/lzo/2.06/include/lzo/LIBRARY_PATH=/usr/local/Cellar/lzo/2.06/lib/ mvn clean install
  • Copy the libraries into the Hadoop installation directory. We assume that the HADOOP_INSTALL points to the hadoop installation folder (for example /usr/local/hadoop)

$ cp target/hadoop-lzo-0.4.20-SNAPSHOT.jar $HADOOP_INSTALL/lib
$ mkdir -p $HADOOP_INSTALL/lib/lzo
$ cp -r target/native/* $HADOOP_INSTALL/lib/lzo
  • Add hadoop-lzo jar and native libraries to hadoop’s classpath and library path. Do it either in ~/.bash_profile or $HADOOP_INSTALL/etc/hadoop/

export HADOOP_OPTS=„$HADOOP_OPTS -Djava.library.path=$HADOOP_INSTALL/lib/native/osx:$HADOOP_INSTALL/lib/native/lzo”
  • Add lzo compression codes to the hadoop’s $HADOOP_INSTALL/etc/hadoop/core-site.xml

  <value>,,, com.hadoop.compression.lzo.LzoCodec, com.hadoop.compression.lzo.LzopCodec
  • Add lzo dependencies to the Apache Spark configuration $SPARK_INSTALL/conf/

  • Add lzo compression codec to the Hadoop Configuration instance that you pass to SparkContext (driver) instance

conf.set(„io.compression.codecs”, ”com.hadoop.compression.lzo.LzopCodec”);
  • Convert file (for example bz2) to the lzo format and import new file to the Hadoop’s HDFS

$ bzip2 --stdout file.bz2 | lzop -o file.lzo
$ hdfs dfs -put file.lzo input
  • Index lzo compressed files directly in HDFS

$ hadoop jar $HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar com.hadoop.compression.lzo.LzoIndexer input/file.lzo

or index all lzo file in the input folder

$ hadoop jar $HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar com.hadoop.compression.lzo.LzoIndexer input

or index lzo files with map reduce job

$ hadoop jar $HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar com.hadoop.compression.lzo.DistributedLzoIndexer input


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment