Skip to content

Instantly share code, notes, and snippets.

@clairemcginty
Created September 14, 2023 18:46
Show Gist options
  • Save clairemcginty/c3c0be85f51bc23db45a75e8d8a18806 to your computer and use it in GitHub Desktop.
Save clairemcginty/c3c0be85f51bc23db45a75e8d8a18806 to your computer and use it in GitHub Desktop.
Parquet dictionary tip-over point

Setup

Parquet-mr 0.14.0-SNAPSHOT.

Usage: Parquet-Avro with following schema:

{
    "name": "TestRecord",
    "type": "record",
    "namespace": "testdata",
    "fields": [
        {
            "name": "stringField",
            "type": "string"
        }
    ]
}

Case 1 - Small # of Unique Values

val NumRecords = 5_000_000

val records = (0 to NumRecords).map(i =>
  TestRecord
    .newBuilder()
    .setStringField((i % 5_000).toString) // StringField has 5,000 distinct values
    .build()
)

val writer = AvroParquetWriter.builder[TestRecord](new Path("testdata-case1.parquet"))
  .withSchema(testdata.TestRecord.SCHEMA$)
  .withCompressionCodec(CompressionCodecName.UNCOMPRESSED)
  .withRowGroupSize(256 * 1024 * 1024L)
  .withPageSize(1024 * 1024)
  .withBloomFilterEnabled(false)
  .withDictionaryEncoding(true)
  .build()

In this configuration, the column successfully includes a dictionary encoding:

% parquet-tools meta testdata-case1.parquet

file schema:  testdata.TestRecord 
--------------------------------------------------------------------------------
stringField:  REQUIRED BINARY L:STRING R:0 D:0

row group 1:  RC:5000001 TS:18262874 OFFSET:4 
--------------------------------------------------------------------------------
stringField:   BINARY UNCOMPRESSED DO:4 FPO:38918 SZ:8181452/8181452/1.00 VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0, max: 999, num_nulls: 0]

Case 2 - Larger # of Unique Values

val NumRecords = 5_000_000

val records = (0 to NumRecords).map(i =>
  TestRecord
    .newBuilder()
    .setStringField((i % 50_000).toString) // StringField has 50,000 distinct values
    .build()
)

val writer = AvroParquetWriter.builder[TestRecord](new Path("testdata-case2.parquet"))
  .withSchema(testdata.TestRecord.SCHEMA$)
  .withCompressionCodec(CompressionCodecName.UNCOMPRESSED)
  .withRowGroupSize(256 * 1024 * 1024L)
  .withPageSize(1024 * 1024)
  .withBloomFilterEnabled(false)
  .withDictionaryEncoding(true)
  .build()

In this case, the resulting file has no dictionary encoding:

% parquet-tools meta testdata-case2.parquet

file schema:  testdata.TestRecord 
--------------------------------------------------------------------------------
stringField:  REQUIRED BINARY L:STRING R:0 D:0

row group 1:  RC:5000001 TS:18262874 OFFSET:4 
--------------------------------------------------------------------------------
stringField:  BINARY UNCOMPRESSED DO:0 FPO:4 SZ:43896278/43896278/1.00 VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: 9999, num_nulls: 0]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment