Skip to content

Instantly share code, notes, and snippets.

View bbhavsar's full-sized avatar

Bankim Bhavsar bbhavsar

View GitHub Profile
@bbhavsar
bbhavsar / Kudu Bloom filter performance test results.txt
Last active July 2, 2021 01:45
Performance test with Bloom filter support in Apache Kudu
Environment:
CDP 7.1.5
6 nodes (Dell PowerEdge R430, 20c/40t Xeon e5-2630 v4 @ 2.2Ghz, 128GB Ram, 4-2TB disks)
1) generate big table (260M) with all random data
2) copy big table to parquet
3) generate small table with top 1000 and bottom 1000 keys off big one
4) generate small table with top 1000 and bottom 1000 of non-key field off big one
5) compute stats for all tables
6) select big kudu based on half of small (filter by some int field mod 2), joining on key
@bbhavsar
bbhavsar / TicketCache.java
Created October 30, 2020 22:19
Kudu Kerberos Ticket Cache
static KuduClient GetTicketCacheClient() {
Subject subject = SecurityUtil.getSubjectFromTicketCacheOrNull();
if (subject == null) {
System.out.println("Subject not available from ticket cache");
System.exit(1);
}
KuduClient client = null;
try {
client = Subject.doAs(subject, new PrivilegedExceptionAction<KuduClient>() {
public KuduClient run() throws Exception {
@bbhavsar
bbhavsar / KerberosUGI.java
Created October 30, 2020 21:33
Kerberos with UGI
import org.apache.hadoop.security.UserGroupInformation;
import org.apache.kudu.client.CreateTableOptions;
import org.apache.kudu.client.KuduClient;
import org.apache.kudu.client.KuduClient.KuduClientBuilder;
import org.apache.kudu.ColumnSchema.ColumnSchemaBuilder;
import org.apache.kudu.Schema;
import org.apache.kudu.Type;
import org.apache.kudu.client.KuduException;
import org.apache.kudu.client.ListTablesResponse;
+----------+--------------------+---------+------------+------------+----------------+
| Workload | File Format | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+----------+--------------------+---------+------------+------------+----------------+
| TPCH(30) | kudu / none / none | 13.85 | -28.89% | 8.79 | -34.71% |
+----------+--------------------+---------+------------+------------+----------------+
+----------+----------+--------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+---------+
| Workload | Query | File Format | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval |
+----------+----------+--------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+---------+
| TPCH(30) | TPCH-Q9 | kudu / none / none | 41.61 | 40.81 | +1.95% | * 16.94% * | * 14.71% * | 5 |
block_cache_capacity_mb = 256 (default)
+----------+--------------------+---------+------------+------------+----------------+
| Workload | File Format | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+----------+--------------------+---------+------------+------------+----------------+
| TPCH(30) | kudu / none / none | 15.84 | -5.70% | 9.93 | -13.69% |
+----------+--------------------+---------+------------+------------+----------------+
+----------+----------+--------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+---------+
| Workload | Query | File Format | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval |
+----------+----------+--------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+---------+
| TPCH(30) | TPCH-Q9 | kudu / none / none | 62.87 | 31.97 | R +96.64% |
+----------+--------------------+---------+------------+------------+----------------+
| Workload | File Format | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+----------+--------------------+---------+------------+------------+----------------+
| TPCH(30) | kudu / none / none | 12.53 | -21.67% | 8.44 | -23.04% |
+----------+--------------------+---------+------------+------------+----------------+
+----------+----------+--------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+---------+
| Workload | Query | File Format | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval |
+----------+----------+--------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+---------+
| TPCH(30) | TPCH-Q15 | kudu / none / none | 5.98 | 4.37 | +37.03% | * 51.70% * | * 45.08% * | 5 |
@bbhavsar
bbhavsar / TPCHQ9.txt
Last active June 2, 2020 18:42
Regression observed with TPCH-Q9 query with Impala when pushing down Bloom filter predicate
4.2.1 TPCH-Q9 SQL Statement:
select nation, o_year, sum(amount) as sum_profit
from (select n_name as nation, extract(year from o_orderdate) as o_year,
l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity as amount
from part, supplier, lineitem, partsupp, orders, nation
where s_suppkey = l_suppkey and ps_suppkey = l_suppkey and ps_partkey = l_partkey
and p_partkey = l_partkey and o_orderkey = l_orderkey and s_nationkey = n_nationkey
and p_name like '%:1%')
as profit group by nation, o_year order by nation, o_year desc LIMIT 1;
# Modification of codec-test.py from Todd Lipcon to be python3 compatible and some formatting of output.
# https://github.infra.cloudera.com/raw/todd/experiments/master/kudu/codec-test.py
import pyfastpfor
import numpy as np
import pandas as pd
from timeit import Timer
import bitshuffle
import sys
from prettytable import PrettyTable
import random
import sys
# 10M values to be generated
count = 10 * 1024 * 1024
def gen_repeat_in_small_range():
for i in range(0, int(count/256/256)):
for j in range(0, 256):
for k in range(0, 256):
The _for128 and _for256 basically uses blocks of 128/256 input integers to calculate the diff
and min across the block simulating the mechanism used in Kudu's encoding implementation.
$ python codec-test.py repeat_small_range.csv
+--------------------------+----------------------+------------------------+--------------+
| codec | comp_time(millisecs) | decomp_time(millisecs) | bits_per_int |
+--------------------------+----------------------+------------------------+--------------+
| bitshuffle | 42.511 | 165.078 | 0.4868 |
| simdbinarypacking | 33.127 | 34.271 | 7.0821 |