zhenxiao/VectorizedParquet

## VectorizedParquet
Supporting Vectorized APIs in Parquet

Motivation

Vectorized Query Execution could have big performance improvement for SQL engines like Hive, Drill, and Presto. Instead of processing one row at a time, Vectorized Query Execution could streamline operations by processing a batch of rows at a time. Within one batch, each column is represented as a vector of a primitive data type. SQL engines could apply predicates very efficiently on these vectors, avoiding a single row going through all the operators before the next row can be processed.
As an efficient columnar data representation, it would be nice if Parquet could support Vectorized APIs, so that all SQL engines could read vectors from Parquet files, and do vectorized execution for Parquet File Format.

Requirement

Support Vectorized APIs in Parquet
A generalized Vectorized API, which could be used by any SQL engines, eg. Hive, Drill, Presto
Backward compatible with the existing row-by-row APIs

Proposed APIs

Readers

Add new VectorizedRecordReader:
public abstract class VectorizedRecordReader<KEYIN, VALUEIN> implements Closeable {
  /**
   * Called once at initialization.
   * @param split the split that defines the range of records to read
   * @param context the information about the task
   * @throws IOException
   * @throws InterruptedException
   */
  public abstract void initialize(InputSplit split,
                                  TaskAttemptContext context
                                  ) throws IOException, InterruptedException;

/**
 * @param {ColumnDescriptor} the column to read
 * @param {ColumnVector} the vector to fill
 */
public void readVector(ColumnDescriptor descriptor, ColumnVector vector);


  /**
   * The current progress of the record reader through its data.
   * @return a number between 0.0 and 1.0 that is the fraction of the data read
   * @throws IOException
   * @throws InterruptedException
   */
  public abstract float getProgress() throws IOException, InterruptedException;

  /**
   * Close the record reader.
   */
  public abstract void close() throws IOException;
}

Add new ParquetVectorizedRecordReader, which extends the VectorizedRecordReader.

Add new VectorizedColumnReader:

public interface VectorizedColumnReader {
  /**
   * Read a batch of values in this column and fill the vector
   * @param {ColumnVector} the vector to fill
   */
  void readVector(ColumnVector vector);

  /**
   * @return Descriptor of the column.
   */
  ColumnDescriptor getDescriptor();
}

In parquet-column/src/main/java/parquet/column/page/PageReader.java:
+ /**
+  * Read a batch of values in this column and fill the vector
+  * @param {ColumnVector} the vector to fill
+  */
+ public void readVector(ColumnVector vector);

Add new VectorReader:

public abstract class VectorReader {

  /**
   * Called to initialize the column reader from a part of a page.
   *
   * The underlying implementation knows how much data to read, so a length
   * is not provided.
   *
   * Each page may contain several sections:
   * <ul>
   *  <li> repetition levels column
   *  <li> definition levels column
   *  <li> data column
   * </ul>
   *
   * This function is called with 'offset' pointing to the beginning of one of these sections,
   * and should return the offset to the section following it.
   *
   * @param valueCount count of values in this page
   * @param page the array to read from containing the page data (repetition levels, definition levels, data)
   * @param offset where to start reading from in the page
   *
   * @throws IOException
   */
  public abstract void initFromPage(int valueCount, ByteBuffer page, int offset) throws IOException;

  /**
   * Read a batch of values in this column and fill the vector
   * @param {ColumnVector} the vector to fill
   */
  public void readVector(ColumnVector vector);
}

Add following specific VectorReaders:

BitPackingVectorReader
ByteBitPackingVectorReader
BoundedIntVectorReader
ZeroIntegerVectorReader
DeltaBinaryPackingVectorReader
DeltaLengthByteArrayVectorReader
DeltaByteArrayVectorReader
DictionaryVectorReader
BinaryPlainVectorReader
BooleanPlainVectorReader
FixedLenByteArrayPlainVectorReader
RunLengthBitPackingHybridVectorReader

In parquet-column/src/main/java/parquet/column/Encoding.java:
Add getVectorReader for each encoding:

/**
 * get Vector reader for the specified column, with specified type
 * @param the column to read
 * @param the value type
 * @returns {VectorReader} the vector reader
 */
public VectorReader getVectorReader(ColumnDescriptor descriptor, ValuesType valuesType);


Vectors

public interface ColumnVector
{
    /**
     * @return whether is null values
     */
    boolean[] isNull();

    /**
     * @return Vector Type, long, double, etc
     */
    VectorType getValueType();

    /**
     * @return ByteBuffer wrapped primitive arrays
     */
    ByteBuffer getValues();

    /**
     * @return Fill the column vector with the provided value
     */
    void fill(ByteBuffer);

    /**
     * @return the number of values in this column vector
     */
    long getSize();
}

Each primitive data types will have its own implementation of ColumnVector interface, eg. LongColumnVector, DoubleColumnVector, etc.
	Supporting Vectorized APIs in Parquet

	Motivation

	Vectorized Query Execution could have big performance improvement for SQL engines like Hive, Drill, and Presto. Instead of processing one row at a time, Vectorized Query Execution could streamline operations by processing a batch of rows at a time. Within one batch, each column is represented as a vector of a primitive data type. SQL engines could apply predicates very efficiently on these vectors, avoiding a single row going through all the operators before the next row can be processed.
	As an efficient columnar data representation, it would be nice if Parquet could support Vectorized APIs, so that all SQL engines could read vectors from Parquet files, and do vectorized execution for Parquet File Format.

	Requirement

	Support Vectorized APIs in Parquet
	A generalized Vectorized API, which could be used by any SQL engines, eg. Hive, Drill, Presto
	Backward compatible with the existing row-by-row APIs

	Proposed APIs

	Readers

	Add new VectorizedRecordReader:
	public abstract class VectorizedRecordReader<KEYIN, VALUEIN> implements Closeable {
	/**
	* Called once at initialization.
	* @param split the split that defines the range of records to read
	* @param context the information about the task
	* @throws IOException
	* @throws InterruptedException
	*/
	public abstract void initialize(InputSplit split,
	TaskAttemptContext context
	) throws IOException, InterruptedException;

	/**
	* @param {ColumnDescriptor} the column to read
	* @param {ColumnVector} the vector to fill
	*/
	public void readVector(ColumnDescriptor descriptor, ColumnVector vector);


	/**
	* The current progress of the record reader through its data.
	* @return a number between 0.0 and 1.0 that is the fraction of the data read
	* @throws IOException
	* @throws InterruptedException
	*/
	public abstract float getProgress() throws IOException, InterruptedException;

	/**
	* Close the record reader.
	*/
	public abstract void close() throws IOException;
	}

	Add new ParquetVectorizedRecordReader, which extends the VectorizedRecordReader.

	Add new VectorizedColumnReader:

	public interface VectorizedColumnReader {
	/**
	* Read a batch of values in this column and fill the vector
	* @param {ColumnVector} the vector to fill
	*/
	void readVector(ColumnVector vector);

	/**
	* @return Descriptor of the column.
	*/
	ColumnDescriptor getDescriptor();
	}

	In parquet-column/src/main/java/parquet/column/page/PageReader.java:
	+ /**
	+ * Read a batch of values in this column and fill the vector
	+ * @param {ColumnVector} the vector to fill
	+ */
	+ public void readVector(ColumnVector vector);

	Add new VectorReader:

	public abstract class VectorReader {

	/**
	* Called to initialize the column reader from a part of a page.
	*
	* The underlying implementation knows how much data to read, so a length
	* is not provided.
	*
	* Each page may contain several sections:
	* <ul>
	* <li> repetition levels column
	* <li> definition levels column
	* <li> data column
	* </ul>
	*
	* This function is called with 'offset' pointing to the beginning of one of these sections,
	* and should return the offset to the section following it.
	*
	* @param valueCount count of values in this page
	* @param page the array to read from containing the page data (repetition levels, definition levels, data)
	* @param offset where to start reading from in the page
	*
	* @throws IOException
	*/
	public abstract void initFromPage(int valueCount, ByteBuffer page, int offset) throws IOException;

	/**
	* Read a batch of values in this column and fill the vector
	* @param {ColumnVector} the vector to fill
	*/
	public void readVector(ColumnVector vector);
	}

	Add following specific VectorReaders:

	BitPackingVectorReader
	ByteBitPackingVectorReader
	BoundedIntVectorReader
	ZeroIntegerVectorReader
	DeltaBinaryPackingVectorReader
	DeltaLengthByteArrayVectorReader
	DeltaByteArrayVectorReader
	DictionaryVectorReader
	BinaryPlainVectorReader
	BooleanPlainVectorReader
	FixedLenByteArrayPlainVectorReader
	RunLengthBitPackingHybridVectorReader

	In parquet-column/src/main/java/parquet/column/Encoding.java:
	Add getVectorReader for each encoding:

	/**
	* get Vector reader for the specified column, with specified type
	* @param the column to read
	* @param the value type
	* @returns {VectorReader} the vector reader
	*/
	public VectorReader getVectorReader(ColumnDescriptor descriptor, ValuesType valuesType);


	Vectors

	public interface ColumnVector
	{
	/**
	* @return whether is null values
	*/
	boolean[] isNull();

	/**
	* @return Vector Type, long, double, etc
	*/
	VectorType getValueType();

	/**
	* @return ByteBuffer wrapped primitive arrays
	*/
	ByteBuffer getValues();

	/**
	* @return Fill the column vector with the provided value
	*/
	void fill(ByteBuffer);

	/**
	* @return the number of values in this column vector
	*/
	long getSize();
	}

	Each primitive data types will have its own implementation of ColumnVector interface, eg. LongColumnVector, DoubleColumnVector, etc.