aseigneurin/Spark file formats and storage.md

## Spark file formats and storage.md

      
    Raw
  

              Spark file formats and storage.md
            
          
    Spark - File formats and storage options

In this document, I'm using a data file containing 40 million records. The file is a text file with one record per line.
The following Scala code is run in a spark-shell:
val filename = "<path to the file>"
val file = sc.textFile(filename)
file.count()

This simply reads the file and counts the number of lines. The number of partitions and the time taken to read the file can be read in the UI.
In the table below, when the test says "Read from ... + repartition", the file is repartitioned before counting the lines. Repartitioning is required in real-life applications when the initial number of partitions is too low. This ensures that all the cores available on the cluster are used. The code used in this case is the following:
val filename = "<path to the file>"
val file = sc.textFile(filename).reparition(460)
file.count()

Tests are run on a Spark cluster with 3 c4.4xlarge workers.
Measures

In the following table, we perform a combination of tests:

File is either uncompressed or compressed. We test 2 compression formats: GZ (very common, fast, but not splittable) and BZ2 (splittable but very CPU expensive).
File is read from the an EBS drive or from S3. S3 is what will be used in real life but the disk serves as a baseline to assess the performance of S3.
With or without repartitioning. In a real use case, repartitioning is mandatory to achieve good parallelism when the initial partitioning is not adequate. This does not apply to uncompressed files as they already generate enough partitions. When repartitioning, I asked for 460 partitions as this is the number of partitions created when reading the uncompressed file.
Spark versions 2.0.1 vs Spark 1.6.0. I tested the latest version available with this client's Chef scripts (1.6.0) and the latest version available from Apache (2.0.1).

Spark 2.0.1:


#
Format
File size
Test
Time
# partitions


1
Uncompressed
14.4 GB
Read from EBS
3 s
460


2
Uncompressed
14.4 GB
Read from S3
13 s
460


3
GZ
419.3 MB
Read from EBS
47 s
1


4
GZ
419.3 MB
Read from EBS + repartition
1.5 min
460


5
GZ
419.3 MB
Read from S3
44 s
1


6
GZ
419.3 MB
Read from S3 + repartition
1.4 min
460


7
BZ2
236.3 MB
Read from EBS
55 s
8


8
BZ2
236.3 MB
Read from EBS + repartition
1.2 min
460


9
BZ2
236.3 MB
Read from S3
1.1 min
8


10
BZ2
236.3 MB
Read from S3 + repartition
1.5 min
460


Spark 1.6.0:


#
Format
File size
Test
Time
# partitions


1
Uncompressed
14.4 GB
Read from EBS
3 s
460


2
Uncompressed
14.4 GB
Read from S3
19 s
460


3
GZ
419.3 MB
Read from EBS
44 s
1


4
GZ
419.3 MB
Read from EBS + repartition
1.7 min
460


5
GZ
419.3 MB
Read from S3
46 s
1


6
GZ
419.3 MB
Read from S3 + repartition
1.8 min
460


7
BZ2
236.3 MB
Read from EBS
53 s
8


8
BZ2
236.3 MB
Read from EBS + repartition
1.2 min
460


9
BZ2
236.3 MB
Read from S3
57 s
8


10
BZ2
236.3 MB
Read from S3 + repartition
1.1 min
460


Conclusions

Spark version - Measures are very similar between Spark 1.6 and Spark 2.0. This makes sense as this test uses RDDs (Catalyst or Tungsten cannot perform any optimization).
EBS vs S3 - S3 is slower than the EBS drive (#1 vs #2). Performance of S3 is still very good, though, with a combined throughput of 1.1 GB/s. Also, keep in mind that EBS drives have drawbacks: data is not shared between servers (it has to be replicated manually) and IOPS can be throttled.
Compression - GZ files are not ideal because they are not splittable and therefore require repartitioning. BZ2 files suffer from a similar problem: although they are splittable, they are so much compressed that you get very few partitions (8, in this case, on a cluster with 48 cores). The other problem is that the performance of BZ2 files is poor compared to uncompressed files. In the end, we see that uncompressed files clearly outperform compressed files. This is because uncompressed files are I/O bound and compressed files are CPU bound, but I/Os are really good here.
Recommendation

Given this, I recommend storing files on S3 as uncompressed files. This allows to achieve great performance while providing a safe storage.
#	Format	File size	Test	Time	# partitions
1	Uncompressed	14.4 GB	Read from EBS	3 s	460
2	Uncompressed	14.4 GB	Read from S3	13 s	460
3	GZ	419.3 MB	Read from EBS	47 s	1
4	GZ	419.3 MB	Read from EBS + repartition	1.5 min	460
5	GZ	419.3 MB	Read from S3	44 s	1
6	GZ	419.3 MB	Read from S3 + repartition	1.4 min	460
7	BZ2	236.3 MB	Read from EBS	55 s	8
8	BZ2	236.3 MB	Read from EBS + repartition	1.2 min	460
9	BZ2	236.3 MB	Read from S3	1.1 min	8
10	BZ2	236.3 MB	Read from S3 + repartition	1.5 min	460