clairemcginty/Parquet experiment.md Secret

## Parquet experiment.md

      
    Raw
  

              Parquet experiment.md
            
          
    Setup

Parquet-mr 0.14.0-SNAPSHOT.
Usage: Parquet-Avro with following schema:
{
    "name": "TestRecord",
    "type": "record",
    "namespace": "testdata",
    "fields": [
        {
            "name": "stringField",
            "type": "string"
        }
    ]
}
Case 1 - Small # of Unique Values

val NumRecords = 5_000_000

val records = (0 to NumRecords).map(i =>
  TestRecord
    .newBuilder()
    .setStringField((i % 5_000).toString) // StringField has 5,000 distinct values
    .build()
)

val writer = AvroParquetWriter.builder[TestRecord](new Path("testdata-case1.parquet"))
  .withSchema(testdata.TestRecord.SCHEMA$)
  .withCompressionCodec(CompressionCodecName.UNCOMPRESSED)
  .withRowGroupSize(256 * 1024 * 1024L)
  .withPageSize(1024 * 1024)
  .withBloomFilterEnabled(false)
  .withDictionaryEncoding(true)
  .build()
In this configuration, the column successfully includes a dictionary encoding:
% parquet-tools meta testdata-case1.parquet

file schema:  testdata.TestRecord 
--------------------------------------------------------------------------------
stringField:  REQUIRED BINARY L:STRING R:0 D:0

row group 1:  RC:5000001 TS:18262874 OFFSET:4 
--------------------------------------------------------------------------------
stringField:   BINARY UNCOMPRESSED DO:4 FPO:38918 SZ:8181452/8181452/1.00 VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0, max: 999, num_nulls: 0]
Case 2 - Larger # of Unique Values

val NumRecords = 5_000_000

val records = (0 to NumRecords).map(i =>
  TestRecord
    .newBuilder()
    .setStringField((i % 50_000).toString) // StringField has 50,000 distinct values
    .build()
)

val writer = AvroParquetWriter.builder[TestRecord](new Path("testdata-case2.parquet"))
  .withSchema(testdata.TestRecord.SCHEMA$)
  .withCompressionCodec(CompressionCodecName.UNCOMPRESSED)
  .withRowGroupSize(256 * 1024 * 1024L)
  .withPageSize(1024 * 1024)
  .withBloomFilterEnabled(false)
  .withDictionaryEncoding(true)
  .build()
In this case, the resulting file has no dictionary encoding:
% parquet-tools meta testdata-case2.parquet

file schema:  testdata.TestRecord 
--------------------------------------------------------------------------------
stringField:  REQUIRED BINARY L:STRING R:0 D:0

row group 1:  RC:5000001 TS:18262874 OFFSET:4 
--------------------------------------------------------------------------------
stringField:  BINARY UNCOMPRESSED DO:0 FPO:4 SZ:43896278/43896278/1.00 VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: 9999, num_nulls: 0]