Skip to content

Instantly share code, notes, and snippets.

@jorisvandenbossche
Last active August 7, 2019 15:12
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jorisvandenbossche/b268c1f64ee30df3f4e770c90da7eda7 to your computer and use it in GitHub Desktop.
Save jorisvandenbossche/b268c1f64ee30df3f4e770c90da7eda7 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Test Arrow ExtensionType in python"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[ARROW-840](https://issues.apache.org/jira/browse/ARROW-840) added the ability to create extension types from Python. This is done by subclassing `pyarrow.ExtensionType`.\n",
"\n",
"An example:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pyarrow as pa"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'0.14.1.dev224+g0a0423a31.d20190807'"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pa.__version__"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"class UuidType(pa.ExtensionType):\n",
"\n",
" def __init__(self):\n",
" pa.ExtensionType.__init__(self, pa.binary(16))\n",
"\n",
" def __reduce__(self):\n",
" return UuidType, ()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"ty = UuidType()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"UuidType(extension<arrow.py_extension_type>)"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ty"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"FixedSizeBinaryType(fixed_size_binary[16])"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ty.storage_type"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Any python extension type has a fixed `extension_name`:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'arrow.py_extension_type'"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ty.extension_name"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can create an array with this extension type by creating the raw \"storage\" array and using `ExtensionArray.from_storage`:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"storage = pa.array([b\"0123456789abcdef\"], type=pa.binary(16))"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"a = pa.ExtensionArray.from_storage(UuidType(), storage)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<pyarrow.lib.ExtensionArray object at 0x7fb72037ea98>\n",
"[\n",
" 30313233343536373839616263646566\n",
"]"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"a"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"UuidType(extension<arrow.py_extension_type>)"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"a.type"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This array with an extension type survives serialization through IPC. \n",
"\n",
"For example, here we put the array in a RecordBatch and write it to a file:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"batch = pa.RecordBatch.from_arrays([a], [\"exts\"])"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"with pa.OSFile('test_extension_type.arrow', 'wb') as sink:\n",
" writer = pa.RecordBatchFileWriter(sink, batch.schema)\n",
" writer.write_batch(batch)\n",
" writer.close()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After reading it back in, it still has the UuidType:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"reader = pa.ipc.open_file('test_extension_type.arrow')\n",
"result = reader.read_all()"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"exts: extension<arrow.py_extension_type>"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"result.schema"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"UuidType(extension<arrow.py_extension_type>)"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"result.column(0).type"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, the above is done in a single session (same Python process). If we restart the session (as if it was a different Python process):"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pyarrow as pa"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# see note below: if it was defined and imported from a module, it is enough that this module is importable (on the python path)\n",
"# and you don't need to redefine it here (this is a limitation of in-notebook defined classes)\n",
"\n",
"class UuidType(pa.ExtensionType):\n",
"\n",
" def __init__(self):\n",
" pa.ExtensionType.__init__(self, pa.binary(16))\n",
"\n",
" def __reduce__(self):\n",
" return UuidType, ()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"reader = pa.ipc.open_file('test_extension_type.arrow')\n",
"result2 = reader.read_all()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pyarrow.Table\n",
"exts: extension<arrow.py_extension_type>"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"result2"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"UuidType(extension<arrow.py_extension_type>)"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"result2.column(0).type"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This has still the correct UuidType. **Note:** here, I redefined the UuidType class in the notebook, but in practice it is enough that the definition of the class is available, for example importable from the module where it is defined and from where you imported it when creating the data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If the definition of the python extension type is however not known or available (here mimicked by a restart of the session without redefining the subclass), arrow will fall back to an \"unknown\" extension type:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pyarrow as pa"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"reader = pa.ipc.open_file('test_extension_type.arrow')\n",
"result3 = reader.read_all()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pyarrow.Table\n",
"exts: extension<arrow.py_extension_type>"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"result3"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"UnknownExtensionType(extension<arrow.py_extension_type>)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"result3.column(0).type"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This preserves the fact that it was an extension type, but any information about the original type (a name indicating it was \"uuid\", any metadata) is lost."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python (arrow-dev)",
"language": "python",
"name": "arrow-dev"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment