Skip to content

Instantly share code, notes, and snippets.

@zedar
Last active April 24, 2024 14:33
Show Gist options
  • Star 6 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save zedar/c43cbc7ff7f98abee885 to your computer and use it in GitHub Desktop.
Save zedar/c43cbc7ff7f98abee885 to your computer and use it in GitHub Desktop.
Add LZO compression codecs to the Apache Hadoop and Spark

Add LZO compresssion codecs to the Apache Hadoop and Spark

LZO is a splittable compression format for files stored in Hadoop’s HDFS. It has valuable combination of speed and compression size. Thanks to hadoop-lzo the .lzo files could be splittable too.

  • Install lzo and lzop codes [OSX].

$ brew install lzo lzop
  • Find where the headers and libraries are installed

$ brew list lzo

The output should look like follows

/usr/local/Cellar/lzo/2.06/include/lzo/ (13 files)
/usr/local/Cellar/lzo/2.06/lib/liblzo2.2.dylib
/usr/local/Cellar/lzo/2.06/lib/ (2 other files)
/usr/local/Cellar/lzo/2.06/share/doc/ (7 files)
  • Clone hadoop-lzo repository.

$ git clone https://github.com/twitter/hadoop-lzo
$ cd hadoop-lzo
  • Build the project (maven required)

$ C_INCLUDE_PATH=/usr/local/Cellar/lzo/2.06/include/lzo/LIBRARY_PATH=/usr/local/Cellar/lzo/2.06/lib/ mvn clean install
  • Copy the libraries into the Hadoop installation directory. We assume that the HADOOP_INSTALL points to the hadoop installation folder (for example /usr/local/hadoop)

$ cp target/hadoop-lzo-0.4.20-SNAPSHOT.jar $HADOOP_INSTALL/lib
$ mkdir -p $HADOOP_INSTALL/lib/lzo
$ cp -r target/native/* $HADOOP_INSTALL/lib/lzo
  • Add hadoop-lzo jar and native libraries to hadoop’s classpath and library path. Do it either in ~/.bash_profile or $HADOOP_INSTALL/etc/hadoop/hadoop-env.sh

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar
export HADOOP_OPTS=„$HADOOP_OPTS -Djava.library.path=$HADOOP_INSTALL/lib/native/osx:$HADOOP_INSTALL/lib/native/lzo”
  • Add lzo compression codes to the hadoop’s $HADOOP_INSTALL/etc/hadoop/core-site.xml

<property>
  <name>io.compression.codecs</name>
  <value>org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.BZip2Codec, com.hadoop.compression.lzo.LzoCodec, com.hadoop.compression.lzo.LzopCodec
  </value>
</property>
<property>
  <name>io.compression.codec.lzo.class</name>
  <value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
  • Add lzo dependencies to the Apache Spark configuration $SPARK_INSTALL/conf/spark-env.sh

export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:$HADOOP_INSTALL/lib/native/osx:$HADOOP_INSTALL/lib/native/lzo
export SPARK_CLASSPATH=$SPARK_CLASSPATH:$HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar
  • Add lzo compression codec to the Hadoop Configuration instance that you pass to SparkContext (driver) instance

conf.set(„io.compression.codecs”, ”com.hadoop.compression.lzo.LzopCodec”);
  • Convert file (for example bz2) to the lzo format and import new file to the Hadoop’s HDFS

$ bzip2 --stdout file.bz2 | lzop -o file.lzo
$ hdfs dfs -put file.lzo input
  • Index lzo compressed files directly in HDFS

$ hadoop jar $HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar com.hadoop.compression.lzo.LzoIndexer input/file.lzo

or index all lzo file in the input folder

$ hadoop jar $HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar com.hadoop.compression.lzo.LzoIndexer input

or index lzo files with map reduce job

$ hadoop jar $HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar com.hadoop.compression.lzo.DistributedLzoIndexer input

REFERENCES

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment