Skip to content

Instantly share code, notes, and snippets.

@vinothchandar
vinothchandar / gist:45f4209cc01daaba99324de390a97406
Last active November 19, 2021 02:12
hudi table config update/delete
hudi:hoodie_benchmark->desc
╔═════════════════════════════════════════════════╤══════════════════════════════════════════════════════════════════════════════╗
║ Property │ Value ║
╠═════════════════════════════════════════════════╪══════════════════════════════════════════════════════════════════════════════╣
║ basePath │ file:/Users/vs/Cache/hudi-test-data/output-mor-smoke/org.apache.hudi ║
╟─────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────╢
║ metaPath │ file:/Users/vs/Cache/hudi-test-data/output-mor-smoke/org.apache.hudi/.hoodie ║
╟─────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────╢
║ fileSystem │ file
>>>> TestBootstrap :
files:
[file:/tmp/junit4111271187227693299/data/datestr=2020%252F04%252F03/part-00000-6d89c75e-d3e2-49ee-9b89-db88bb4dbb36.c000.snappy.parquet,
file:/tmp/junit4111271187227693299/data/datestr=2020%252F04%252F03/part-00001-6d89c75e-d3e2-49ee-9b89-db88bb4dbb36.c000.snappy.parquet,
file:/tmp/junit4111271187227693299/data/datestr=2020%252F04%252F01/part-00000-6d89c75e-d3e2-49ee-9b89-db88bb4dbb36.c000.snappy.parquet,
file:/tmp/junit4111271187227693299/data/datestr=2020%252F04%252F01/part-00001-6d89c75e-d3e2-49ee-9b89-db88bb4dbb36.c000.snappy.parquet,
file:/tmp/junit4111271187227693299/data/datestr=2020%252F04%252F02/part-00000-6d89c75e-d3e2-49ee-9b89-db88bb4dbb36.c000.snappy.parquet,
file:/tmp/junit4111271187227693299/data/datestr=2020%252F04%252F02/part-00001-6d89c75e-d3e2-49ee-9b89-db88bb4dbb36.c000.snappy.parquet]
numVersions:2
numFiles:6
@vinothchandar
vinothchandar / hyperspace.md
Last active August 31, 2021 17:11
hyperspace - demo

TL;DR :

  • Was exploring if hyperspace can be used an alternative for our record/bloom indexes
  • For the needle-in-a-haystack search i.e a single id out of all the records, hyperspace also seems to be not very effective atm (might not be suprising given the covered indexes recommendations so far).
  • Our old workhorse BLOOM_INDEX still significantly outperforms. But we should really step on the gas for RFC-15 like efforts/RFC-08 to make this much faster

https://microsoft.github.io/hyperspace/docs/ug-quick-start-guide/

@vinothchandar
vinothchandar / small_file_size_impact.md
Last active January 27, 2021 08:51
Spark SQL Amazon Reviews Dataset - Small file size impact

https://s3.amazonaws.com/amazon-reviews-pds/readme.html

vmacs:amazon-reviews vs$ find . -type f | cut -d/ -f2 | sort | uniq -c
  10 product_category=Apparel
  10 product_category=Automotive
  10 product_category=Baby
  10 product_category=Beauty
  10 product_category=Books
  10 product_category=Camera
@vinothchandar
vinothchandar / spark-sql-amazon-reviews.md
Last active July 20, 2020 03:12
Spark SQL Plans on Amazon Reviews Dataset
@vinothchandar
vinothchandar / rc_check.sh
Last active August 17, 2021 20:52
Apache Hudi RC Check
RC_NUM=rc1
RC_VERSION=0.9.0
# Checksums and Signatures OK
shasum -a 512 hudi-${RC_VERSION}-${RC_NUM}.src.tgz > sha512
diff sha512 hudi-${RC_VERSION}-${RC_NUM}.src.tgz.sha512 | wc -l
0
#user nobody;
worker_processes 1;
error_log /tmp/ngnix-error.log;
#error_log logs/error.log notice;
#error_log logs/error.log info;
pid /tmp/nginx.pid;
#user nobody;
worker_processes 1;
error_log /tmp/ngnix-error.log;
#error_log logs/error.log notice;
#error_log logs/error.log info;
pid /tmp/nginx.pid;
@vinothchandar
vinothchandar / check.sh
Last active October 16, 2019 18:26
Apache Hudi Release Check
07:26:38 [Cache]$ RC_NUM=rc6
# Checksums and Signatures OK
07:26:42 [Cache]$ shasum -a 512 hudi-0.5.0-incubating-${RC_NUM}.src.tgz > sha512
07:26:58 [Cache]$ diff sha512 hudi-0.5.0-incubating-${RC_NUM}.src.tgz.sha512.txt | wc -l
0
07:27:19 [Cache]$ gpg --verify hudi-0.5.0-incubating-${RC_NUM}.src.tgz.asc.txt hudi-0.5.0-incubating-${RC_NUM}.src.tgz
gpg: Signature made Wed Oct 16 03:34:37 2019 PDT
gpg: using RSA key AF9BAF79D311A3D3288E583F24A499037262AAA4
@vinothchandar
vinothchandar / HoodieMicroBench.java
Created June 4, 2018 14:07
Hoodie MicroBenchmark
import com.uber.hoodie.common.model.HoodieLogFile;
import com.uber.hoodie.common.table.log.HoodieLogFileReader;
import com.uber.hoodie.common.table.log.block.HoodieAvroDataBlock;
import com.uber.hoodie.common.table.log.block.HoodieLogBlock;
import com.uber.hoodie.common.table.log.block.HoodieLogBlock.HoodieLogBlockType;
import com.uber.hoodie.common.util.FSUtils;
import com.uber.hoodie.common.util.ParquetUtils;
import com.uber.hoodie.exception.HoodieIOException;
import java.io.IOException;
import java.util.ArrayList;