The idea is to analyze the cost of the metafields that Hudi stores at record level.
Choose narrow to wide tables with varying columns ➝. 10, 30, 100, 1000 columns etc. Use Auto keygen with bulk_insert operation for Hudi. Generate vanilla parquet data via spark and non-partitioned HUDI COW table for comparison. For spark we can reduce the partition to 1 to compare it to non-partitioned Hudi table. We will assume the input json data size is roughly the same for all three tables. Here we take ~ 350MB input json file size.
I used chat gpt to generate a random json schema with # of columns that have primitive data types and built-in formats. One such schema for a 10-column table looks like below. Built-in formats in Json allow for more realistic data. The columns' names are boring though.
/tmp/hudi-metafields-benchmark/ten-columns-schema.json