Skip to content

Instantly share code, notes, and snippets.

View bhasudha's full-sized avatar

Bhavani Sudha Saktheeswaran bhasudha

View GitHub Profile
@bhasudha
bhasudha / Hudi record level meta field benchmark.md
Created May 19, 2023 00:15
Hudi record level meta field benchmark

Goal:

The idea is to analyze the cost of the metafields that Hudi stores at record level.

Setup:

Choose narrow to wide tables with varying columns ➝. 10, 30, 100, 1000 columns etc. Use Auto keygen with bulk_insert operation for Hudi. Generate vanilla parquet data via spark and non-partitioned HUDI COW table for comparison. For spark we can reduce the partition to 1 to compare it to non-partitioned Hudi table. We will assume the input json data size is roughly the same for all three tables. Here we take ~ 350MB input json file size.

Schema generation:

I used chat gpt to generate a random json schema with # of columns that have primitive data types and built-in formats. One such schema for a 10-column table looks like below. Built-in formats in Json allow for more realistic data. The columns' names are boring though.

/tmp/hudi-metafields-benchmark/ten-columns-schema.json

root@adhoc-1:/opt# jstack 1025
2019-10-25 12:16:52
Full thread dump OpenJDK 64-Bit Server VM (25.212-b01 mixed mode):
"Attach Listener" #76 daemon prio=9 os_prio=0 tid=0x00007f5620001000 nid=0x49b waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"DestroyJavaVM" #73 prio=5 os_prio=0 tid=0x00007f565c00f800 nid=0x417 waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
@bhasudha
bhasudha / gist:5710154
Created June 4, 2013 22:24
Netty 3.x unbounded DynamicChannelBuffer growth. See http://stackoverflow.com/questions/12134212/trim-dynamicbuffer-to-maintain-size for more information.
@Override
public void ensureWritableBytes(int minWritableBytes) {
if (minWritableBytes <= writableBytes()) {
return;
}
int newCapacity;
if (capacity() == 0) {
newCapacity = 1;
} else {