Skip to content

Instantly share code, notes, and snippets.

@aseigneurin
Last active December 17, 2018 10:09
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save aseigneurin/2bb05d1f63d71ddd1db565e157f626c8 to your computer and use it in GitHub Desktop.
Save aseigneurin/2bb05d1f63d71ddd1db565e157f626c8 to your computer and use it in GitHub Desktop.
Spark - File formats and storage options

Spark - File formats and storage options

In this document, I'm using a data file containing 40 million records. The file is a text file with one record per line.

The following Scala code is run in a spark-shell:

val filename = "<path to the file>"
val file = sc.textFile(filename)
file.count()

This simply reads the file and counts the number of lines. The number of partitions and the time taken to read the file can be read in the UI.

In the table below, when the test says "Read from ... + repartition", the file is repartitioned before counting the lines. Repartitioning is required in real-life applications when the initial number of partitions is too low. This ensures that all the cores available on the cluster are used. The code used in this case is the following:

val filename = "<path to the file>"
val file = sc.textFile(filename).reparition(460)
file.count()

Tests are run on a Spark cluster with 3 c4.4xlarge workers.

Measures

In the following table, we perform a combination of tests:

  • File is either uncompressed or compressed. We test 2 compression formats: GZ (very common, fast, but not splittable) and BZ2 (splittable but very CPU expensive).
  • File is read from the an EBS drive or from S3. S3 is what will be used in real life but the disk serves as a baseline to assess the performance of S3.
  • With or without repartitioning. In a real use case, repartitioning is mandatory to achieve good parallelism when the initial partitioning is not adequate. This does not apply to uncompressed files as they already generate enough partitions. When repartitioning, I asked for 460 partitions as this is the number of partitions created when reading the uncompressed file.
  • Spark versions 2.0.1 vs Spark 1.6.0. I tested the latest version available with this client's Chef scripts (1.6.0) and the latest version available from Apache (2.0.1).

Spark 2.0.1:

# Format File size Test Time # partitions
1 Uncompressed 14.4 GB Read from EBS 3 s 460
2 Uncompressed 14.4 GB Read from S3 13 s 460
3 GZ 419.3 MB Read from EBS 47 s 1
4 GZ 419.3 MB Read from EBS + repartition 1.5 min 460
5 GZ 419.3 MB Read from S3 44 s 1
6 GZ 419.3 MB Read from S3 + repartition 1.4 min 460
7 BZ2 236.3 MB Read from EBS 55 s 8
8 BZ2 236.3 MB Read from EBS + repartition 1.2 min 460
9 BZ2 236.3 MB Read from S3 1.1 min 8
10 BZ2 236.3 MB Read from S3 + repartition 1.5 min 460

Spark 1.6.0:

# Format File size Test Time # partitions
1 Uncompressed 14.4 GB Read from EBS 3 s 460
2 Uncompressed 14.4 GB Read from S3 19 s 460
3 GZ 419.3 MB Read from EBS 44 s 1
4 GZ 419.3 MB Read from EBS + repartition 1.7 min 460
5 GZ 419.3 MB Read from S3 46 s 1
6 GZ 419.3 MB Read from S3 + repartition 1.8 min 460
7 BZ2 236.3 MB Read from EBS 53 s 8
8 BZ2 236.3 MB Read from EBS + repartition 1.2 min 460
9 BZ2 236.3 MB Read from S3 57 s 8
10 BZ2 236.3 MB Read from S3 + repartition 1.1 min 460

Conclusions

Spark version - Measures are very similar between Spark 1.6 and Spark 2.0. This makes sense as this test uses RDDs (Catalyst or Tungsten cannot perform any optimization).

EBS vs S3 - S3 is slower than the EBS drive (#1 vs #2). Performance of S3 is still very good, though, with a combined throughput of 1.1 GB/s. Also, keep in mind that EBS drives have drawbacks: data is not shared between servers (it has to be replicated manually) and IOPS can be throttled.

Compression - GZ files are not ideal because they are not splittable and therefore require repartitioning. BZ2 files suffer from a similar problem: although they are splittable, they are so much compressed that you get very few partitions (8, in this case, on a cluster with 48 cores). The other problem is that the performance of BZ2 files is poor compared to uncompressed files. In the end, we see that uncompressed files clearly outperform compressed files. This is because uncompressed files are I/O bound and compressed files are CPU bound, but I/Os are really good here.

Recommendation

Given this, I recommend storing files on S3 as uncompressed files. This allows to achieve great performance while providing a safe storage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment