Skip to content

Instantly share code, notes, and snippets.

@msukmanowsky
Created November 14, 2014 01:32
Show Gist options
  • Save msukmanowsky/18531dba2bec928424c0 to your computer and use it in GitHub Desktop.
Save msukmanowsky/18531dba2bec928424c0 to your computer and use it in GitHub Desktop.
Example of how to save Spark RDDs to disk using GZip compression in response to https://twitter.com/rjurney/status/533061960128929793.
from pyspark import SparkContext
def main():
sc = SparkContext(appName="Test Compression")
# RDD has to be key, value pairs
data = sc.parallelize([
("key1", "value1"),
("key2", "value2"),
("key3", "value3"),
])
data.saveAsHadoopFile("/tmp/spark_compressed",
"org.apache.hadoop.mapred.TextOutputFormat",
compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")
sc.stop()
if __name__ == "__main__":
main()
@gshen
Copy link

gshen commented May 4, 2015

Jayson, you can use

rdd.map(line=>(line, "")) 

to turn it into pairRDD.

@dsfarrar
Copy link

Jayson,
Building on what gshen commented, you might be able to use:

rdd.map(line=>(line, None))

before calling saveAsHadoopFile(...). It's not obvious from the documentation, but it looks like None in Python gets mapped to NullWritable when saveAsHadoopFile creates the underlying TextOutputFormat<K,V>. This causes the TextOutputFormat to effectively skip writing the value, leaving just the key text -- no extra whitespace tacked onto the end. You might want to try it and see if it works for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment