-
-
Save msukmanowsky/18531dba2bec928424c0 to your computer and use it in GitHub Desktop.
from pyspark import SparkContext | |
def main(): | |
sc = SparkContext(appName="Test Compression") | |
# RDD has to be key, value pairs | |
data = sc.parallelize([ | |
("key1", "value1"), | |
("key2", "value2"), | |
("key3", "value3"), | |
]) | |
data.saveAsHadoopFile("/tmp/spark_compressed", | |
"org.apache.hadoop.mapred.TextOutputFormat", | |
compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec") | |
sc.stop() | |
if __name__ == "__main__": | |
main() |
Looks like this will work for me: saveAsTextFile(String path, Class<? extends org.apache.hadoop.io.compress.CompressionCodec> codec)
The parameter types to saveAsHadoopFile require the RDD to be of type pairRDD, and you explicitly made data a key-value object. Is it possible to compress Spark outputs that are not in key-value form? My research indicates no without writing your own method, i.e. the Spark API doesn't support it, which seems strange.
Jayson, you can use
rdd.map(line=>(line, ""))
to turn it into pairRDD.
Jayson,
Building on what gshen commented, you might be able to use:
rdd.map(line=>(line, None))
before calling saveAsHadoopFile(...). It's not obvious from the documentation, but it looks like None in Python gets mapped to NullWritable when saveAsHadoopFile creates the underlying TextOutputFormat<K,V>. This causes the TextOutputFormat to effectively skip writing the value, leaving just the key text -- no extra whitespace tacked onto the end. You might want to try it and see if it works for you.
You can use any of the Hadoop-supported compression codecs:
org.apache.hadoop.io.compress.GzipCodec
org.apache.hadoop.io.compress.BZip2Codec
com.hadoop.compression.lzo.LzopCodec