Parquet-mr 0.14.0-SNAPSHOT.
Usage: Parquet-Avro with following schema:
{
"name": "TestRecord",
"type": "record",
"namespace": "testdata",
"fields": [
{
"name": "stringField",
"type": "string"
}
]
}
val NumRecords = 5_000_000
val records = (0 to NumRecords).map(i =>
TestRecord
.newBuilder()
.setStringField((i % 5_000).toString) // StringField has 5,000 distinct values
.build()
)
val writer = AvroParquetWriter.builder[TestRecord](new Path("testdata-case1.parquet"))
.withSchema(testdata.TestRecord.SCHEMA$)
.withCompressionCodec(CompressionCodecName.UNCOMPRESSED)
.withRowGroupSize(256 * 1024 * 1024L)
.withPageSize(1024 * 1024)
.withBloomFilterEnabled(false)
.withDictionaryEncoding(true)
.build()
In this configuration, the column successfully includes a dictionary encoding:
% parquet-tools meta testdata-case1.parquet
file schema: testdata.TestRecord
--------------------------------------------------------------------------------
stringField: REQUIRED BINARY L:STRING R:0 D:0
row group 1: RC:5000001 TS:18262874 OFFSET:4
--------------------------------------------------------------------------------
stringField: BINARY UNCOMPRESSED DO:4 FPO:38918 SZ:8181452/8181452/1.00 VC:5000001 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0, max: 999, num_nulls: 0]
val NumRecords = 5_000_000
val records = (0 to NumRecords).map(i =>
TestRecord
.newBuilder()
.setStringField((i % 50_000).toString) // StringField has 50,000 distinct values
.build()
)
val writer = AvroParquetWriter.builder[TestRecord](new Path("testdata-case2.parquet"))
.withSchema(testdata.TestRecord.SCHEMA$)
.withCompressionCodec(CompressionCodecName.UNCOMPRESSED)
.withRowGroupSize(256 * 1024 * 1024L)
.withPageSize(1024 * 1024)
.withBloomFilterEnabled(false)
.withDictionaryEncoding(true)
.build()
In this case, the resulting file has no dictionary encoding:
% parquet-tools meta testdata-case2.parquet
file schema: testdata.TestRecord
--------------------------------------------------------------------------------
stringField: REQUIRED BINARY L:STRING R:0 D:0
row group 1: RC:5000001 TS:18262874 OFFSET:4
--------------------------------------------------------------------------------
stringField: BINARY UNCOMPRESSED DO:0 FPO:4 SZ:43896278/43896278/1.00 VC:5000001 ENC:PLAIN,BIT_PACKED ST:[min: 0, max: 9999, num_nulls: 0]