Skip to content

Instantly share code, notes, and snippets.

@cpcloud
Created November 14, 2017 22:10
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save cpcloud/91291986581694858171aa795f12ea5d to your computer and use it in GitHub Desktop.
Save cpcloud/91291986581694858171aa795f12ea5d to your computer and use it in GitHub Desktop.
Decimals

Decimal Values in SQL-on-Hadoop

This document lays out the ways in which a few prominent SQL-on-Hadoop systems read and write decimal values from and to parquet files, and their respective in-memory formats.

Parquet's logical DECIMAL type can to be represented by the following physical types.

  1. A 32-bit integer for values that occupy between 1 and 4 bytes
  2. A 64-bit integer for values that occupy between 1 and 8 bytes
  3. Fixed size array of bytes for values that fit into 1 or more bytes
  4. Variable size array of bytes for values that fit into 1 or more bytes

Note some possible ambiguity around how many bytes to write for different precisions:

  1. For option 4 (variable size byte array), the specification says:

    The minimum number of bytes to store the unscaled value should be used.

    For option 3 (fixed size byte array), there's no such requirement. For example, if a system wanted to write decimal values as 16 byte arrays even though each individual value fit into a 32-bit integer, that's allowed.

    In practice, most systems write the minimum number of bytes needed to store a decimal value of a given precision.

Existing Systems

Each system supports reading and writing parquet files. If you're just interested in the TL;DR, then see the compatibility matrix below.

Compatibility Matrix

Parquet-MR Hive Impala Spark
Parquet-MR R/W
Hive R/W
Impala R/W
Spark R/W

This is the original implementation of the Parquet format in Java. All of the below systems except Impala use parquet-mr to implement read and write support for Parquet.

Reading

parquet-mr requires client libraries to extend a set of abstract classes, e.g., PrimitiveConverter, that convert Parquet physical types into the client's preferred logical type.

Writing

parquet-mr requires client libraries to extend a set of abstract classes, e.g., RecordConsumer, that convert the client's value of a particular logical type to the client's preferred physical type.

Scalar in-memory format

Hive uses custom FastHiveDecimal/FastHiveDecimalImpl classes which are backed by 3 Java long values, each of which is 8 bytes in memory.

Vector in-memory format

Hive uses DecimalColumnVectors which contain an array of HiveDecimalWritable objects. HiveDecimalWritable is a subclass of FastHiveDecimal that produces mutable instances.

Reading

Hive calls Binary.getBytes() from the parquet-mr API to read an array of bytes into a HiveDecimalWritable.

Questions:

  1. Does Hive support reading Parquet decimals written in 32- or 64-bit integer format?

Writing

Hive writes the minimum number of bytes necessary to represent decimal values.

Scalar in-memory format

Impala has three scalar in-memory decimal types:

  1. Decimal4, backed by int32_t
  2. Decimal8, backed by int64_t
  3. Decimal16, backed by __int128_t

Vector in-memory format

Reading

Writing

Impala writes every Decimal value in the FIXED_LEN_BYTE_ARRAY format. It writes the minimum number of bytes necessary to represent decimal values.

Scalar in-memory format

Spark uses a custom Decimal class to represent scalar decimals. This class stores either a BigDecimal or Long depending on the precision. It also stores the precision and scale of the decimal.

Vector in-memory format

Spark uses an Array[Decimal] collection to hold vectors of decimals.

Reading

Spark reads bytes given to it by the parquet-mr API into its Decimal object mentioned above.

Writing

Spark writes decimal values to Parquet files in two modes: legacy and non-legacy.

  • Legacy mode writes all decimal values as fixed size byte arrays.
  • Non-legacy mode writes 32-bit integers, 64-bit integers, or fixed size byte arrays depending on the decimal precision.

In both cases, Spark only writes the minimum number of bytes needed to represent the level of precision of the decimal type.

What should Arrow + parquet-cpp do?

Scalar in-memory format

Arrow uses a 2's complement based, 16-byte custom Decimal128 class backed by 1 int64_t and 1 uint64_t.

Vector in-memory format

Arrow converts the scalar in-memory format values to their bytes (in big-endian byte order). The vector format is an array whose elements are fixed size byte arrays of size 16 bytes.

Reading

Every system in this document supports writing decimal values as FIXED_LEN_BYTE_ARRAYs. Some systems support writing decimals as 32- or 64-bit integers. Therefore, for maximum compatibility parquet-cpp should support reading of INT32, INT64, and FIXED_LEN_BYTE_ARRAY backed decimal values.

Writing

Every system in this document supports reading of parquet files containing decimal values represented as FIXED_LEN_BYTE_ARRAYs. Therefore, for maximum compatibility parquet-cpp should write decimals in the FIXED_LEN_BYTE_ARRAY physical format using the smallest number of bytes possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment