Skip to content

Instantly share code, notes, and snippets.

@qqibrow
Last active March 28, 2024 20:16
Show Gist options
  • Save qqibrow/689ed97b91cc0b58337be96a86291301 to your computer and use it in GitHub Desktop.
Save qqibrow/689ed97b91cc0b58337be96a86291301 to your computer and use it in GitHub Desktop.
Test velox parquet reader using parquet unit tests in presto

Test Velox Parquet Reader Using Presto

Background

Velox is a C++ database acceleration library that can be integrated with Spark or Presto to enhace query performance and reduce infrastructure costs. It includes a custom c++ Parquet reader for better performance and integration. During testing, errors related to the native Parquet reader were discovered, highlighting the need for an improved testing infrastructure to catch all issues and enhance testing coverage before production.

Proposal

The proposal is to leverage the existing unit tests in the Presto project to test the Velox Parquet reader as an interim solution. The reasons for choosing this approach are as follows:

  1. The unit tests in presto is comprehensive, covering all data types, including complex types like struct, array, and map. Each unit test includes forward ordering, backward ordering, and randomly inserted nulls testing. To facilitate complex type testing, A custom hive parquet writer (e.g, SingleLevelArraySchemaConverter) is used to surface issues. (this is the reason why #7002 is blocked at first because it's hard to reproduce)
  2. Extensibility: Each unit test verifies a pair of writer and reader, such as native Parquet writer and native Parquet reader or Hive Parquet writer and native Parquet reader. The unit test is easy to extend when implementing native parquet writer in velox.
  3. Better Debuggability: The unit test inputs can be examined to understand failures, and the native reader in Presto can be referred to understand the correct logic.
  4. Simple Implementation: It is estimated that implementing this solution will take approximately one week. Backporting all testing infrastructure from Presto in Java to Velox in C++ could be time-consuming.

High level implementation

  1. Implement a Parquet reader in the Velox project that takes a Parquet file as input and outputs a binary file with data in SerializedPage format. Draft PR
  2. In the Presto project, within the unit test in AbstractParquetReader, create a new process and call the Velox Parquet reader with a Parquet file input. Read the generated output file, decode it back into a Page, and compare the data against expected results. Draft PR

Run the test

  1. Checkout https://github.com/qqibrow/velox/tree/new_test_base and build velox. expect output binary: {velox_base_dir}/_build/debug/velox/dwio/parquet/tests/reader/velox_scan_parquet
  2. Checkout https://github.com/qqibrow/presto/tree/velox_parquet_test_base
  3. Run all tests:

./mvnw -Dtest=TestParquetReader test -B -Dair.check.skip-all -Dmaven.javadoc.skip=true -DLogTestDurationListener.enabled=true -Dvelox_parquet_reader_path={velox_base_dir}/_build/debug/velox/dwio/parquet/tests/reader/velox_scan_parquet -Dfailed_parquet_files_dir=/tmp/velox_test_data -pl :presto-hive

  1. Run one test:

./mvnw -Dtest=TestParquetReader#testStruct test -B -Dair.check.skip-all -Dmaven.javadoc.skip=true -DLogTestDurationListener.enabled=true -Dvelox_parquet_reader_path={velox_base_dir}/_build/debug/velox/dwio/parquet/tests/reader/velox_scan_parquet -Dfailed_parquet_files_dir=/tmp/velox_test_data -pl :presto-hive

-Dfailed_parquet_files_dir=/tmp/velox_test_data is dir to store test parquet files that fails the test.

Testing Result

[INFO] Results:
[INFO]
[ERROR] Failures:
[ERROR]   TestParquetReader.testArray:34->AbstractTestParquetReader.testArray:157 expected [[1]] but found [[null]]
[ERROR]   TestParquetReader.testArrayOfArrayOfStructOfArray:83->AbstractTestParquetReader.testArrayOfArrayOfStructOfArray:267 exitCode should be 0 expected [0] but found [134]
[ERROR]   TestParquetReader.testArrayOfMapOfArray:167->AbstractTestParquetReader.testArrayOfMapOfArray:455 exitCode should be 0 expected [0] but found [134]
[ERROR]   TestParquetReader.testArrayOfMapOfStruct:139->AbstractTestParquetReader.testArrayOfMapOfStruct:385 exitCode should be 0 expected [0] but found [134]
[ERROR]   TestParquetReader.testArrayOfMaps:125->AbstractTestParquetReader.testArrayOfMaps:361 expected [[{0=0, 1=1}]] but found [[null]]
[ERROR]   TestParquetReader.testArrayOfStructOfArray:97->AbstractTestParquetReader.testArrayOfStructOfArray:304 exitCode should be 0 expected [0] but found [134]
[ERROR]   TestParquetReader.testArrayOfStructs:62->AbstractTestParquetReader.testArrayOfStructs:210 exitCode should be 0 expected [0] but found [134]
[ERROR]   TestParquetReader.testArraySchemas:468->AbstractTestParquetReader.testArraySchemas:1252 expected [[19, 20, 21, 22]] but found [[null, 19, 20, 21]]
[ERROR]   TestParquetReader.testComplexNestedStructs:230->AbstractTestParquetReader.testComplexNestedStructs:650 » IllegalState Map key is null at position: 22
[ERROR]   TestParquetReader.testCustomSchemaArrayOfStructs:69->AbstractTestParquetReader.testCustomSchemaArrayOfStructs:236 exitCode should be 0 expected [0] but found [134]
[ERROR]   TestParquetReader.testDoubleSequence:503->AbstractTestParquetReader.testDoubleSequence:1456 exitCode should be 0 expected [0] but found [134]
[ERROR]   TestParquetReader.testLongDirect:335->AbstractTestParquetReader.testLongDirect:847->AbstractTestParquetReader.testRoundTripNumeric:1408 » UnsupportedOperation com.facebook.presto.common.block.ByteArrayBlock
[ERROR]   TestParquetReader.testLongDirect2:342->AbstractTestParquetReader.testLongDirect2:859->AbstractTestParquetReader.testRoundTripNumeric:1408 » UnsupportedOperation com.facebook.presto.common.block.ByteArrayBlock
[ERROR]   TestParquetReader.testLongPatchedBase:356->AbstractTestParquetReader.testLongPatchedBase:873->AbstractTestParquetReader.testRoundTripNumeric:1408 » UnsupportedOperation com.facebook.presto.common.block.ByteArrayBlock
[ERROR]   TestParquetReader.testLongSequence:321->AbstractTestParquetReader.testLongSequence:833->AbstractTestParquetReader.testRoundTripNumeric:1408 » UnsupportedOperation com.facebook.presto.common.block.ByteArrayBlock
[ERROR]   TestParquetReader.testLongSequenceWithHoles:328->AbstractTestParquetReader.testLongSequenceWithHoles:840->AbstractTestParquetReader.testRoundTripNumeric:1408 » UnsupportedOperation com.facebook.presto.common.block.ByteArrayBlock
[ERROR]   TestParquetReader.testLongShortRepeat:349->AbstractTestParquetReader.testLongShortRepeat:866->AbstractTestParquetReader.testRoundTripNumeric:1408 » UnsupportedOperation com.facebook.presto.common.block.ByteArrayBlock
[ERROR]   TestParquetReader.testLongStrideDictionary:482->AbstractTestParquetReader.testLongStrideDictionary:1402->AbstractTestParquetReader.testRoundTripNumeric:1408 » UnsupportedOperation com.facebook.presto.common.block.ByteArrayBlock
[ERROR]   TestParquetReader.testMap:111->AbstractTestParquetReader.testMap:334 » IllegalState Map key is null at position: 0
[ERROR]   TestParquetReader.testMapOfArrayKeys:188->AbstractTestParquetReader.testMapOfArrayKeys:497 exitCode should be 0 expected [0] but found [134]
[ERROR]   TestParquetReader.testMapOfArrayValues:181->AbstractTestParquetReader.testMapOfArrayValues:484 exitCode should be 0 expected [0] but found [134]
[ERROR]   TestParquetReader.testMapOfSingleLevelArray:195->AbstractTestParquetReader.testMapOfSingleLevelArray:513 exitCode should be 0 expected [0] but found [134]
[ERROR]   TestParquetReader.testMapOfStruct:202->AbstractTestParquetReader.testMapOfStruct:528 exitCode should be 0 expected [0] but found [134]
[ERROR]   TestParquetReader.testMapSchemas:475->AbstractTestParquetReader.testMapSchemas:1303 » IllegalState Map key is null at position: 0
[ERROR]   TestParquetReader.testMapWithNullValues:209->AbstractTestParquetReader.testMapWithNullValues:541 » IllegalState Map key is null at position: 0
[ERROR]   TestParquetReader.testNestedArrays:48->AbstractTestParquetReader.testNestedArrays:182 exitCode should be 0 expected [0] but found [134]
[ERROR]   TestParquetReader.testNestedMaps:118->AbstractTestParquetReader.testNestedMaps:352 exitCode should be 0 expected [0] but found [134]
[ERROR]   TestParquetReader.testNewAvroArray:461->AbstractTestParquetReader.testNewAvroArray:1233 expected [[1]] but found [[null]]
[ERROR]   TestParquetReader.testOldAvroArray:454->AbstractTestParquetReader.testOldAvroArray:1218 expected [[10]] but found [[null]]
[ERROR]   TestParquetReader.testSchemaWithRepeatedOptionalRequiredFields:398->AbstractTestParquetReader.testSchemaWithRepeatedOptionalRequiredFields:995 exitCode should be 0 expected [0] but found [134]
[ERROR]   TestParquetReader.testSingleLevelArrayOfMapOfArray:174->AbstractTestParquetReader.testSingleLevelArrayOfMapOfArray:470 exitCode should be 0 expected [0] but found [134]
[ERROR]   TestParquetReader.testSingleLevelArrayOfMapOfStruct:146->AbstractTestParquetReader.testSingleLevelArrayOfMapOfStruct:402 exitCode should be 0 expected [0] but found [134]
[ERROR]   TestParquetReader.testSingleLevelArrayOfStructOfSingleElement:153->AbstractTestParquetReader.testSingleLevelArrayOfStructOfSingleElement:417 exitCode should be 0 expected [0] but found [134]
[ERROR]   TestParquetReader.testSingleLevelArrayOfStructOfStructOfSingleElement:160->AbstractTestParquetReader.testSingleLevelArrayOfStructOfStructOfSingleElement:437 exitCode should be 0 expected [0] but found [134]
[ERROR]   TestParquetReader.testSingleLevelSchemaArrayOfArrayOfStructOfArray:90->AbstractTestParquetReader.testSingleLevelSchemaArrayOfArrayOfStructOfArray:286 exitCode should be 0 expected [0] but found [134]
[ERROR]   TestParquetReader.testSingleLevelSchemaArrayOfMaps:132->AbstractTestParquetReader.testSingleLevelSchemaArrayOfMaps:372 » IllegalState Map key is null at position: 0
[ERROR]   TestParquetReader.testSingleLevelSchemaArrayOfStructOfArray:104->AbstractTestParquetReader.testSingleLevelSchemaArrayOfStructOfArray:321 exitCode should be 0 expected [0] but found [134]
[ERROR]   TestParquetReader.testSingleLevelSchemaArrayOfStructs:76->AbstractTestParquetReader.testSingleLevelSchemaArrayOfStructs:254 exitCode should be 0 expected [0] but found [134]
[ERROR]   TestParquetReader.testSingleLevelSchemaNestedArrays:55->AbstractTestParquetReader.testSingleLevelSchemaNestedArrays:199 expected [[[]]] but found [[null]]
[ERROR]   TestParquetReader.testSmallIntSequence:314->AbstractTestParquetReader.testSmallIntSequence:826 » UnsupportedOperation com.facebook.presto.common.block.IntArrayBlock
[ERROR]   TestParquetReader.testStruct:216->AbstractTestParquetReader.testStruct:551 exitCode should be 0 expected [0] but found [134]
[ERROR]   TestParquetReader.testStructOfArrayAndPrimitive:258->AbstractTestParquetReader.testStructOfArrayAndPrimitive:721 expected [[[13, 14, 15, 16, 17, 18], 13]] but found [[[null, null, null, 13, 14, 15], 13]]
[ERROR]   TestParquetReader.testStructOfMaps:237->AbstractTestParquetReader.testStructOfMaps:666 » IllegalState Map key is null at position: 10
[ERROR]   TestParquetReader.testStructOfNullableArrayBetweenNonNullFields:251->AbstractTestParquetReader.testStructOfNullableArrayBetweenNonNullFields:704 expected [[1, [null, value2, value3], 1]] but found [[1, [null, null, value2], 1]]
[ERROR]   TestParquetReader.testStructOfNullableMapBetweenNonNullFields:244->AbstractTestParquetReader.testStructOfNullableMapBetweenNonNullFields:685 » IllegalState Map key is null at position: 20
[ERROR]   TestParquetReader.testStructOfPrimitiveAndArray:272->AbstractTestParquetReader.testStructOfPrimitiveAndArray:748 expected [[11, [2, 3]]] but found [[11, [null, 2]]]
[ERROR]   TestParquetReader.testStructOfPrimitiveAndSingleLevelArray:279->AbstractTestParquetReader.testStructOfPrimitiveAndSingleLevelArray:762 expected [[3, [0]]] but found [[3, [null]]]
[ERROR]   TestParquetReader.testStructOfSingleLevelArrayAndPrimitive:265->AbstractTestParquetReader.testStructOfSingleLevelArrayAndPrimitive:734 exitCode should be 0 expected [0] but found [134]
[ERROR]   TestParquetReader.testStructOfTwoArrays:286->AbstractTestParquetReader.testStructOfTwoArrays:776 expected [[[2], [1, 3, 5, 7]]] but found [[[2], [null, 1, 3, 5]]]
[ERROR]   TestParquetReader.testStructOfTwoNestedArrays:293->AbstractTestParquetReader.testStructOfTwoNestedArrays:789 exitCode should be 0 expected [0] but found [134]
[ERROR]   TestParquetReader.testStructOfTwoNestedSingleLevelSchemaArrays:300->AbstractTestParquetReader.testStructOfTwoNestedSingleLevelSchemaArrays:810 exitCode should be 0 expected [0] but found [134]
[ERROR]   TestParquetReader.testTimestampMicrosBackedByINT64:370->AbstractTestParquetReader.testTimestampMicrosBackedByINT64:909 exitCode should be 0 expected [0] but found [134]
[ERROR]   TestParquetReader.testTimestampMillisBackedByINT64:377->AbstractTestParquetReader.testTimestampMillisBackedByINT64:922 exitCode should be 0 expected [0] but found [134]
[INFO]
[ERROR] Tests run: 82, Failures: 53, Errors: 0, Skipped: 0
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  30.736 s
[INFO] Finished at: 2023-11-21T00:31:52Z
[INFO] ------------------------------------------------------------------------

We are triaging the failures and will open issues in velox community.

@qqibrow
Copy link
Author

qqibrow commented Mar 28, 2024

Latest Update:

The number of failure tests has reduced from 53 to 13. current result:

[ERROR] Failures:
[ERROR]   TestParquetReader.testLongDirect:335->AbstractTestParquetReader.testLongDirect:847->AbstractTestParquetReader.testRoundTripNumeric:1408 » UnsupportedOperation com.facebook.presto.common.block.ByteArrayBlock
[ERROR]   TestParquetReader.testLongDirect2:342->AbstractTestParquetReader.testLongDirect2:859->AbstractTestParquetReader.testRoundTripNumeric:1408 » UnsupportedOperation com.facebook.presto.common.block.ByteArrayBlock
[ERROR]   TestParquetReader.testLongPatchedBase:356->AbstractTestParquetReader.testLongPatchedBase:873->AbstractTestParquetReader.testRoundTripNumeric:1408 » UnsupportedOperation com.facebook.presto.common.block.ByteArrayBlock
[ERROR]   TestParquetReader.testLongSequence:321->AbstractTestParquetReader.testLongSequence:833->AbstractTestParquetReader.testRoundTripNumeric:1408 » UnsupportedOperation com.facebook.presto.common.block.ByteArrayBlock
[ERROR]   TestParquetReader.testLongSequenceWithHoles:328->AbstractTestParquetReader.testLongSequenceWithHoles:840->AbstractTestParquetReader.testRoundTripNumeric:1408 » UnsupportedOperation com.facebook.presto.common.block.ByteArrayBlock
[ERROR]   TestParquetReader.testLongShortRepeat:349->AbstractTestParquetReader.testLongShortRepeat:866->AbstractTestParquetReader.testRoundTripNumeric:1408 » UnsupportedOperation com.facebook.presto.common.block.ByteArrayBlock
[ERROR]   TestParquetReader.testLongStrideDictionary:482->AbstractTestParquetReader.testLongStrideDictionary:1402->AbstractTestParquetReader.testRoundTripNumeric:1408 » UnsupportedOperation com.facebook.presto.common.block.ByteArrayBlock
[ERROR]   TestParquetReader.testSingleLevelSchemaArrayOfArrayOfStructOfArray:90->AbstractTestParquetReader.testSingleLevelSchemaArrayOfArrayOfStructOfArray:286 exitCode should be 0 expected [0] but found [134]
[ERROR]   TestParquetReader.testSingleLevelSchemaNestedArrays:55->AbstractTestParquetReader.testSingleLevelSchemaNestedArrays:199 » UnsupportedOperation com.facebook.presto.common.block.IntArrayBlock
[ERROR]   TestParquetReader.testSmallIntSequence:314->AbstractTestParquetReader.testSmallIntSequence:826 » UnsupportedOperation com.facebook.presto.common.block.IntArrayBlock
[ERROR]   TestParquetReader.testStructOfTwoNestedSingleLevelSchemaArrays:300->AbstractTestParquetReader.testStructOfTwoNestedSingleLevelSchemaArrays:810 » UnsupportedOperation com.facebook.presto.common.block.IntArrayBlock
[ERROR]   TestParquetReader.testTimestampMicrosBackedByINT64:370->AbstractTestParquetReader.testTimestampMicrosBackedByINT64:909 exitCode should be 0 expected [0] but found [134]
[ERROR]   TestParquetReader.testTimestampMillisBackedByINT64:377->AbstractTestParquetReader.testTimestampMillisBackedByINT64:922 exitCode should be 0 expected [0] but found [134]
[INFO]
[ERROR] Tests run: 82, Failures: 13, Errors: 0, Skipped: 0
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  44.617 s
[INFO] Finished at: 2024-03-28T19:16:57Z
[INFO] ------------------------------------------------------------------------

Findings:

  1. the failure from testLongXXX and testSmallIntXXX might because the implementation of schema conversion from parquet type is different. e.g, presto side code https://github.com/prestodb/presto/blob/fd4c51758ba8c120c3decb8e5c4850c98b20c9cf/presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetPageSourceFactory.java#L389
  2. testTimestampMillisBackedByINT64 failure is supposed to be resolved by https://github.com/facebookincubator/velox/pull/4680/files
  3. the rest of failures are testSingleLevelSchemaArrayOfArrayOfStructOfArray, testSingleLevelSchemaNestedArrays and testStructOfTwoNestedSingleLevelSchemaArrays. the root cause is Backward-compatibility rule support in velox , leading to incorrect schema.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment