Skip to content

Instantly share code, notes, and snippets.

@zhenxiao
Last active August 29, 2015 14:08
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save zhenxiao/2728ce4fe0a7be2d3b30 to your computer and use it in GitHub Desktop.
Save zhenxiao/2728ce4fe0a7be2d3b30 to your computer and use it in GitHub Desktop.
Supporting Vectorized APIs in Parquet
Supporting Vectorized APIs in Parquet
Motivation
Vectorized Query Execution could have big performance improvement for SQL engines like Hive, Drill, and Presto. Instead of processing one row at a time, Vectorized Query Execution could streamline operations by processing a batch of rows at a time. Within one batch, each column is represented as a vector of a primitive data type. SQL engines could apply predicates very efficiently on these vectors, avoiding a single row going through all the operators before the next row can be processed.
As an efficient columnar data representation, it would be nice if Parquet could support Vectorized APIs, so that all SQL engines could read vectors from Parquet files, and do vectorized execution for Parquet File Format.
Requirement
Support Vectorized APIs in Parquet
A generalized Vectorized API, which could be used by any SQL engines, eg. Hive, Drill, Presto
Backward compatible with the existing row-by-row APIs
Proposed APIs
Readers
Add new VectorizedRecordReader:
public abstract class VectorizedRecordReader<KEYIN, VALUEIN> implements Closeable {
/**
* Called once at initialization.
* @param split the split that defines the range of records to read
* @param context the information about the task
* @throws IOException
* @throws InterruptedException
*/
public abstract void initialize(InputSplit split,
TaskAttemptContext context
) throws IOException, InterruptedException;
/**
* @param {ColumnDescriptor} the column to read
* @param {ColumnVector} the vector to fill
*/
public void readVector(ColumnDescriptor descriptor, ColumnVector vector);
/**
* The current progress of the record reader through its data.
* @return a number between 0.0 and 1.0 that is the fraction of the data read
* @throws IOException
* @throws InterruptedException
*/
public abstract float getProgress() throws IOException, InterruptedException;
/**
* Close the record reader.
*/
public abstract void close() throws IOException;
}
Add new ParquetVectorizedRecordReader, which extends the VectorizedRecordReader.
Add new VectorizedColumnReader:
public interface VectorizedColumnReader {
/**
* Read a batch of values in this column and fill the vector
* @param {ColumnVector} the vector to fill
*/
void readVector(ColumnVector vector);
/**
* @return Descriptor of the column.
*/
ColumnDescriptor getDescriptor();
}
In parquet-column/src/main/java/parquet/column/page/PageReader.java:
+ /**
+ * Read a batch of values in this column and fill the vector
+ * @param {ColumnVector} the vector to fill
+ */
+ public void readVector(ColumnVector vector);
Add new VectorReader:
public abstract class VectorReader {
/**
* Called to initialize the column reader from a part of a page.
*
* The underlying implementation knows how much data to read, so a length
* is not provided.
*
* Each page may contain several sections:
* <ul>
* <li> repetition levels column
* <li> definition levels column
* <li> data column
* </ul>
*
* This function is called with 'offset' pointing to the beginning of one of these sections,
* and should return the offset to the section following it.
*
* @param valueCount count of values in this page
* @param page the array to read from containing the page data (repetition levels, definition levels, data)
* @param offset where to start reading from in the page
*
* @throws IOException
*/
public abstract void initFromPage(int valueCount, ByteBuffer page, int offset) throws IOException;
/**
* Read a batch of values in this column and fill the vector
* @param {ColumnVector} the vector to fill
*/
public void readVector(ColumnVector vector);
}
Add following specific VectorReaders:
BitPackingVectorReader
ByteBitPackingVectorReader
BoundedIntVectorReader
ZeroIntegerVectorReader
DeltaBinaryPackingVectorReader
DeltaLengthByteArrayVectorReader
DeltaByteArrayVectorReader
DictionaryVectorReader
BinaryPlainVectorReader
BooleanPlainVectorReader
FixedLenByteArrayPlainVectorReader
RunLengthBitPackingHybridVectorReader
In parquet-column/src/main/java/parquet/column/Encoding.java:
Add getVectorReader for each encoding:
/**
* get Vector reader for the specified column, with specified type
* @param the column to read
* @param the value type
* @returns {VectorReader} the vector reader
*/
public VectorReader getVectorReader(ColumnDescriptor descriptor, ValuesType valuesType);
Vectors
public interface ColumnVector
{
/**
* @return whether is null values
*/
boolean[] isNull();
/**
* @return Vector Type, long, double, etc
*/
VectorType getValueType();
/**
* @return ByteBuffer wrapped primitive arrays
*/
ByteBuffer getValues();
/**
* @return Fill the column vector with the provided value
*/
void fill(ByteBuffer);
/**
* @return the number of values in this column vector
*/
long getSize();
}
Each primitive data types will have its own implementation of ColumnVector interface, eg. LongColumnVector, DoubleColumnVector, etc.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment