Skip to content

Instantly share code, notes, and snippets.

View zhenxiao's full-sized avatar

Zhenxiao Luo zhenxiao

View GitHub Profile
@zhenxiao
zhenxiao / VectorizedParquet
Last active August 29, 2015 14:08
Supporting Vectorized APIs in Parquet
Supporting Vectorized APIs in Parquet
Motivation
Vectorized Query Execution could have big performance improvement for SQL engines like Hive, Drill, and Presto. Instead of processing one row at a time, Vectorized Query Execution could streamline operations by processing a batch of rows at a time. Within one batch, each column is represented as a vector of a primitive data type. SQL engines could apply predicates very efficiently on these vectors, avoiding a single row going through all the operators before the next row can be processed.
As an efficient columnar data representation, it would be nice if Parquet could support Vectorized APIs, so that all SQL engines could read vectors from Parquet files, and do vectorized execution for Parquet File Format.
Requirement
Support Vectorized APIs in Parquet