cpcloud/decimalz.md

## decimalz.md

      
    Raw
  

              decimalz.md
            
          
    Decimal Values in SQL-on-Hadoop

This document lays out the ways in which a few prominent SQL-on-Hadoop systems
read and write decimal values from and to parquet files, and their respective
in-memory formats.
Parquet Decimal Specification

Parquet's logical DECIMAL type can to be represented by the following
physical types.

A 32-bit integer for values that occupy between 1 and 4 bytes
A 64-bit integer for values that occupy between 1 and 8 bytes
Fixed size array of bytes for values that fit into 1 or more bytes
Variable size array of bytes for values that fit into 1 or more bytes

Note some possible ambiguity around how many bytes to write for different
precisions:


For option 4 (variable size byte array), the specification says:

The minimum number of bytes to store the unscaled value should be used.

For option 3 (fixed size byte array), there's no such requirement. For
example, if a system wanted to write decimal values as 16 byte arrays even
though each individual value fit into a 32-bit integer, that's allowed.
In practice, most systems write the minimum number of bytes needed to store
a decimal value of a given precision.


Existing Systems

Each system supports reading and writing parquet files. If you're just
interested in the TL;DR, then see the compatibility matrix below.
Compatibility Matrix


Parquet-MR
Hive
Impala
Spark


Parquet-MR
R/W


Hive

R/W


Impala


R/W


Spark


R/W


Apache Parquet (parquet-mr source)

This is the original implementation of the Parquet format in Java. All of the
below systems except Impala use parquet-mr to implement read and write
support for Parquet.
Reading

parquet-mr requires client libraries to extend a set of abstract classes,
e.g.,
PrimitiveConverter,
that convert Parquet physical types into the client's preferred logical type.
Writing

parquet-mr requires client libraries to extend a set of abstract classes,
e.g.,
RecordConsumer,
that convert the client's value of a particular logical type to the client's
preferred physical type.
Apache Hive (source)

Scalar in-memory format

Hive uses custom FastHiveDecimal/FastHiveDecimalImpl
classes which are backed by 3 Java long values, each of which is 8 bytes in memory.
Vector in-memory format

Hive uses DecimalColumnVectors which contain an array of
HiveDecimalWritable objects.
HiveDecimalWritable is a subclass of FastHiveDecimal that produces mutable instances.
Reading

Hive calls Binary.getBytes() from the parquet-mr API to read an array of bytes into a HiveDecimalWritable.
Questions:

Does Hive support reading Parquet decimals written in 32- or 64-bit integer
format?

Writing

Hive writes the minimum number of bytes necessary to represent decimal values.
Cloudera Impala (source)

Scalar in-memory format

Impala has three scalar in-memory decimal types:

Decimal4, backed by int32_t
Decimal8, backed by int64_t
Decimal16, backed by __int128_t

Vector in-memory format

Reading

Writing

Impala writes every Decimal value in the FIXED_LEN_BYTE_ARRAY
format.
It writes the minimum number of bytes necessary to represent decimal values.
Apache Spark (source)

Scalar in-memory format

Spark uses a custom Decimal class to represent scalar decimals.
This class stores either a BigDecimal or Long depending on the precision. It also stores the precision and scale of the decimal.
Vector in-memory format

Spark uses an Array[Decimal] collection to hold vectors of decimals.
Reading

Spark reads bytes given to it by the parquet-mr API into its Decimal object
mentioned above.
Writing

Spark writes decimal values to Parquet files in two modes: legacy and
non-legacy.

Legacy mode writes all decimal values as fixed size byte arrays.
Non-legacy mode writes 32-bit integers, 64-bit integers, or fixed size
byte arrays depending on the decimal precision.

In both cases, Spark only writes the minimum number of bytes needed to
represent the level of precision of the decimal type.
What should Arrow + parquet-cpp do?

Scalar in-memory format

Arrow uses a 2's complement based, 16-byte custom Decimal128 class backed by 1 int64_t and 1 uint64_t.
Vector in-memory format

Arrow converts the scalar in-memory format values to their bytes (in big-endian byte order). The vector format is
an array whose elements are fixed size byte arrays of size 16 bytes.
Reading

Every system in this document supports writing decimal values as FIXED_LEN_BYTE_ARRAYs.
Some systems support writing decimals as 32- or 64-bit integers. Therefore, for maximum compatibility
parquet-cpp should support reading of INT32, INT64, and FIXED_LEN_BYTE_ARRAY
backed decimal values.
Writing

Every system in this document supports reading of parquet files containing
decimal values represented as FIXED_LEN_BYTE_ARRAYs. Therefore, for maximum
compatibility parquet-cpp should write decimals in the FIXED_LEN_BYTE_ARRAY
physical format using the smallest number of bytes possible.