Skip to content

Instantly share code, notes, and snippets.

@msukmanowsky
Created November 14, 2014 01:32
Show Gist options
  • Star 7 You must be signed in to star a gist
  • Fork 5 You must be signed in to fork a gist
  • Save msukmanowsky/18531dba2bec928424c0 to your computer and use it in GitHub Desktop.
Save msukmanowsky/18531dba2bec928424c0 to your computer and use it in GitHub Desktop.
Example of how to save Spark RDDs to disk using GZip compression in response to https://twitter.com/rjurney/status/533061960128929793.
from pyspark import SparkContext
def main():
sc = SparkContext(appName="Test Compression")
# RDD has to be key, value pairs
data = sc.parallelize([
("key1", "value1"),
("key2", "value2"),
("key3", "value3"),
])
data.saveAsHadoopFile("/tmp/spark_compressed",
"org.apache.hadoop.mapred.TextOutputFormat",
compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")
sc.stop()
if __name__ == "__main__":
main()
@msukmanowsky
Copy link
Author

You can use any of the Hadoop-supported compression codecs:

  • gzip: org.apache.hadoop.io.compress.GzipCodec
  • bzip2: org.apache.hadoop.io.compress.BZip2Codec
  • LZO: com.hadoop.compression.lzo.LzopCodec

@rjurney
Copy link

rjurney commented Nov 18, 2014

Looks like this will work for me: saveAsTextFile(String path, Class<? extends org.apache.hadoop.io.compress.CompressionCodec> codec)

@JaysonSunshine
Copy link

The parameter types to saveAsHadoopFile require the RDD to be of type pairRDD, and you explicitly made data a key-value object. Is it possible to compress Spark outputs that are not in key-value form? My research indicates no without writing your own method, i.e. the Spark API doesn't support it, which seems strange.

@gshen
Copy link

gshen commented May 4, 2015

Jayson, you can use

rdd.map(line=>(line, "")) 

to turn it into pairRDD.

@dsfarrar
Copy link

Jayson,
Building on what gshen commented, you might be able to use:

rdd.map(line=>(line, None))

before calling saveAsHadoopFile(...). It's not obvious from the documentation, but it looks like None in Python gets mapped to NullWritable when saveAsHadoopFile creates the underlying TextOutputFormat<K,V>. This causes the TextOutputFormat to effectively skip writing the value, leaving just the key text -- no extra whitespace tacked onto the end. You might want to try it and see if it works for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment