Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jorisvandenbossche/3cc9942eaffb53564df65395e5656702 to your computer and use it in GitHub Desktop.
Save jorisvandenbossche/3cc9942eaffb53564df65395e5656702 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Parquet Format data types support across versions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Changes in Parquet format releases - only related to data types (and not encodings, compression, pages, etc):\n",
"\n",
"* Version 1.0.0 - Jul 30, 2013 ([parquet.thrift](https://github.com/apache/parquet-format/blob/parquet-format-1.0.0/src/thrift/parquet.thrift))\n",
" * ConvertedTypes that already exist at this points: UTF8, MAP, MAP_KEY_VALUE, LIST\n",
"* Version 2.0.0 - Dec 14, 2013 ([parquet.thrift](https://github.com/apache/parquet-format/blob/parquet-format-2.0.0/src/thrift/parquet.thrift), [changes](https://github.com/apache/parquet-format/compare/parquet-format-1.0.0..parquet-format-2.0.0))\n",
" * ENUM ConvertedType\n",
"* Version 2.1.0 - May 6, 2014 ([parquet.thrift](https://github.com/apache/parquet-format/blob/parquet-format-2.1.0/src/thrift/parquet.thrift), [changes](https://github.com/apache/parquet-format/compare/parquet-format-2.0.0..parquet-format-2.1.0))\n",
" * DECIMAL ConvertedType\n",
"* Version 2.2.0 - Feb 6, 2015 ([parquet.thrift](https://github.com/apache/parquet-format/blob/apache-parquet-format-2.2.0/src/thrift/parquet.thrift), [changes](https://github.com/apache/parquet-format/compare/parquet-format-2.1.0..apache-parquet-format-2.2.0))\n",
" * Additional ConvertedTypes (https://github.com/apache/parquet-format/pull/3, [PARQUET-12](https://issues.apache.org/jira/browse/PARQUET-12)) for DATE, TIME_MILLIS, TIMESTAMP_MILLIS UINT_8/UINT_16/UINT_32/UINT_64, INT_8/INT_16/INT_32/INT_64, JSON, BSON, INTERVAL\n",
"* Version 2.3.0 (identical to 2.2.0, small fixup)\n",
"* Version 2.4.0 - Oct 17, 2017 ([changes](https://github.com/apache/parquet-format/compare/apache-parquet-format-2.3.0..apache-parquet-format-2.4.0))\n",
" * Additional ConvertedTypes: TIME_MICROS, TIMESTAMP_MICROS\n",
" * Introduction of LogicalType (that for now covers the existing converted types, AFAIK) + UUIDType, NullType logical types\n",
"* Version 2.5.0 - Mar 29, 2018 ([changes](https://github.com/apache/parquet-format/compare/apache-parquet-format-2.4.0..apache-parquet-format-2.5.0))\n",
"* Version 2.6.0 - Sep 27, 2018 ([changes](https://github.com/apache/parquet-format/compare/apache-parquet-format-2.5.0..apache-parquet-format-2.6.0))\n",
" * Addition of NanoSeconds time unit for Time and Timestamp logical types\n",
"* Version 2.7.0 - Sep 25, 2019 ([changes](https://github.com/apache/parquet-format/compare/apache-parquet-format-2.6.0..apache-parquet-format-2.7.0))\n",
" * (bloom filter, encryption)\n",
"* Version 2.8.0 - Jan 13, 2020 ([changes](https://github.com/apache/parquet-format/compare/apache-parquet-format-2.7.0..apache-parquet-format-2.8.0))\n",
" * (Byte Stream Split encoding)\n",
"\n",
"\n",
"\n",
"Currently, when writing parquet files with Arrow (parquet-cpp), we default to parquet format \"1.0\".\n",
"This would in theory mean that we don't use ConvertedTypes or LogicalTypes introduced in format \"2.0\"+. However, in practice, we already write most of those annotations with the default of `version=\"1.0\"` as well. \n",
"\n",
"For example, we already write many of the Converted/LogicalTypes from 2.0+ (e.g. decimal, date, timestamp with millis/micros) by default. But we don't write converted/logical types for:\n",
"\n",
"- Integer types other than int32/int64 (eg unsigned integers). The ConvertedType annotations (UINT_8, INT_8, etc) for those were introduced in version 2.2.0 (Feb 6, 2015), the LogicalType equivalents in version 2.4.0 (Oct 17, 2017).\n",
"- Nanosecond-resolution timestamps. The LogicalType time unit for this was introduced in version 2.6.0 (Sep 27, 2018)\n",
"\n",
"When specifying `version=\"2.0\"`, then the above 2 cases also use the appropriate logical type annotations.\n",
"\n",
"All other Converted/LogicalTypes are already written, even with `version=\"1.0\"`. I assume this is because for those types there is no risk on misinterpreting the physical data type when not understanding the type annotations (in contrast, eg UINT64 is stored as INT64 physical type, and thus would lead to wrong data when interpreted as its physical type)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"Illustration with code example:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pyarrow as pa\n",
"import pyarrow.parquet as pq\n",
"import decimal"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'4.0.0.dev60+gab5fc979c'"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pa.__version__"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"table = pa.table({\n",
" \"int64\": [1, 2, 3],\n",
" \"int32\": pa.array([1, 2, 3], type=\"int32\"),\n",
" \"uint32\": pa.array([1, 2, 3], type=\"uint32\"),\n",
" \"date\": pa.array([1, 2, 3], type=pa.date32()),\n",
" \"timestamp\": pa.array([1, 2, 3], type=pa.timestamp(\"ms\")),\n",
" \"timestamp_tz\": pa.array([1, 2, 3], type=pa.timestamp(\"ms\", tz=\"UTC\")),\n",
" \"timestamp_ns\": pa.array([1000, 2000, 3000], type=pa.timestamp(\"ns\")),\n",
" \"timestamp_ns_tz\": pa.array([1000, 2000, 3000], type=pa.timestamp(\"ns\", tz=\"UTC\")),\n",
" \"string\": pa.array([\"a\", \"b\", \"c\"]),\n",
" \"list\": pa.array([[1, 2], [3, 4], [5, 6]]),\n",
" \"decimal\": pa.array([decimal.Decimal(\"1.0\"), decimal.Decimal(\"2.0\"), decimal.Decimal(\"3.0\")]),\n",
"})"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"pq.write_table(table, \"test_v1.parquet\", version=\"1.0\")\n",
"pq.write_table(table, \"test_v2.parquet\", version=\"2.0\")"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<pyarrow._parquet.ParquetSchema object at 0x7eff2e9d4d40>\n",
"required group field_id=0 schema {\n",
" optional int64 field_id=1 int64;\n",
" optional int32 field_id=2 int32;\n",
" optional int64 field_id=3 uint32;\n",
" optional int32 field_id=4 date (Date);\n",
" optional int64 field_id=5 timestamp (Timestamp(isAdjustedToUTC=false, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false));\n",
" optional int64 field_id=6 timestamp_tz (Timestamp(isAdjustedToUTC=true, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false));\n",
" optional int64 field_id=7 timestamp_ns (Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false));\n",
" optional int64 field_id=8 timestamp_ns_tz (Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false));\n",
" optional binary field_id=9 string (String);\n",
" optional group field_id=10 list (List) {\n",
" repeated group field_id=11 list {\n",
" optional int64 field_id=12 item;\n",
" }\n",
" }\n",
" optional fixed_len_byte_array(1) field_id=13 decimal (Decimal(precision=2, scale=1));\n",
"}"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"schema1 = pq.read_metadata(\"test_v1.parquet\").schema\n",
"schema1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The corresponding Arrow schema:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"int64: int64\n",
"int32: int32\n",
"uint32: int64\n",
"date: date32[day]\n",
"timestamp: timestamp[ms]\n",
"timestamp_tz: timestamp[ms, tz=UTC]\n",
"timestamp_ns: timestamp[us]\n",
"timestamp_ns_tz: timestamp[us, tz=UTC]\n",
"string: string\n",
"list: list<item: int64>\n",
" child 0, item: int64\n",
"decimal: decimal(2, 1)\n"
]
}
],
"source": [
"print(schema1.to_arrow_schema().to_string(show_field_metadata=False))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note:\n",
"\n",
"* Integers other than int32/int64 (eg unsigned ints) are lost\n",
"* Nanoseconds have become microseconds"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<pyarrow._parquet.ParquetSchema object at 0x7eff2e9d2080>\n",
"required group field_id=0 schema {\n",
" optional int64 field_id=1 int64;\n",
" optional int32 field_id=2 int32;\n",
" optional int32 field_id=3 uint32 (Int(bitWidth=32, isSigned=false));\n",
" optional int32 field_id=4 date (Date);\n",
" optional int64 field_id=5 timestamp (Timestamp(isAdjustedToUTC=false, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false));\n",
" optional int64 field_id=6 timestamp_tz (Timestamp(isAdjustedToUTC=true, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false));\n",
" optional int64 field_id=7 timestamp_ns (Timestamp(isAdjustedToUTC=false, timeUnit=nanoseconds, is_from_converted_type=false, force_set_converted_type=false));\n",
" optional int64 field_id=8 timestamp_ns_tz (Timestamp(isAdjustedToUTC=true, timeUnit=nanoseconds, is_from_converted_type=false, force_set_converted_type=false));\n",
" optional binary field_id=9 string (String);\n",
" optional group field_id=10 list (List) {\n",
" repeated group field_id=11 list {\n",
" optional int64 field_id=12 item;\n",
" }\n",
" }\n",
" optional fixed_len_byte_array(1) field_id=13 decimal (Decimal(precision=2, scale=1));\n",
"}"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"schema2 = pq.read_metadata(\"test_v2.parquet\").schema\n",
"schema2"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"int64: int64\n",
"int32: int32\n",
"uint32: uint32\n",
"date: date32[day]\n",
"timestamp: timestamp[ms]\n",
"timestamp_tz: timestamp[ms, tz=UTC]\n",
"timestamp_ns: timestamp[ns]\n",
"timestamp_ns_tz: timestamp[ns, tz=UTC]\n",
"string: string\n",
"list: list<item: int64>\n",
" child 0, item: int64\n",
"decimal: decimal(2, 1)\n"
]
}
],
"source": [
"print(schema2.to_arrow_schema().to_string(show_field_metadata=False))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now all types are preserved."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looking more closely to the columns with nanosecond timestamps:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<ParquetColumnSchema>\n",
" name: timestamp_ns\n",
" path: timestamp_ns\n",
" max_definition_level: 1\n",
" max_repetition_level: 0\n",
" physical_type: INT64\n",
" logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)\n",
" converted_type (legacy): NONE"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"schema1.column(6) # this actually also has the TIMESTAMP_MICROS set, but is incorrectly not shown, see https://issues.apache.org/jira/browse/ARROW-11399"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<ParquetColumnSchema>\n",
" name: timestamp_ns\n",
" path: timestamp_ns\n",
" max_definition_level: 1\n",
" max_repetition_level: 0\n",
" physical_type: INT64\n",
" logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=nanoseconds, is_from_converted_type=false, force_set_converted_type=false)\n",
" converted_type (legacy): NONE"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"schema2.column(6)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<ParquetColumnSchema>\n",
" name: timestamp_ns_tz\n",
" path: timestamp_ns_tz\n",
" max_definition_level: 1\n",
" max_repetition_level: 0\n",
" physical_type: INT64\n",
" logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)\n",
" converted_type (legacy): TIMESTAMP_MICROS"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"schema1.column(7)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<ParquetColumnSchema>\n",
" name: timestamp_ns_tz\n",
" path: timestamp_ns_tz\n",
" max_definition_level: 1\n",
" max_repetition_level: 0\n",
" physical_type: INT64\n",
" logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=nanoseconds, is_from_converted_type=false, force_set_converted_type=false)\n",
" converted_type (legacy): NONE"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"schema2.column(7)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For the `version=\"2.0\"`, a legacy converted type is no longer set (since this is not possible, there is no ConvertedType for nanoseconds). But this also means that such files won't correctly read by older parquet readers that do not yet support the nanosecond LogicalType (introduced in Parquet format 2.6.0, Sep 2018). Such readers will read those columns as int64 instead of timestamp."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python (arrow-dev)",
"language": "python",
"name": "arrow-dev"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment