Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jorisvandenbossche/9382de2eb96db5db2ef801f63a359082 to your computer and use it in GitHub Desktop.
Save jorisvandenbossche/9382de2eb96db5db2ef801f63a359082 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# `pyarrow.dataset` handling of duplicate partition field / data column\n",
"\n",
"Related to a discussion on the user@arrow.apache.org mailing list on having partition information both in the partition fields (file paths) as in a column in the actual data files."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pyarrow as pa\n",
"import pyarrow.parquet as pq\n",
"import pyarrow.dataset as ds\n",
"\n",
"import pathlib"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"basedir = pathlib.Path(\".\") / \"dataset_experiments\"\n",
"basedir.mkdir(exist_ok=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Basic example"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A basic example seems to work fine. First creating a small toy dataset with partition field `\"part\"` (with values \"a\" and \"b\"), where this is also included as a column in the parquet files:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"case = basedir / \"duplicated_column_partition_field\"\n",
"case.mkdir(exist_ok=True)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"table1 = pa.table({\"part\": [\"a\", \"a\"], \"col1\": [1, 2]})\n",
"subdir1 = case / \"part=a\"\n",
"subdir1.mkdir(exist_ok=True)\n",
"pq.write_table(table1, subdir1 / \"data.parquet\")\n",
"\n",
"table2 = pa.table({\"part\": [\"b\", \"b\"], \"col1\": [3, 4]})\n",
"subdir2 = case / \"part=b\"\n",
"subdir2.mkdir(exist_ok=True)\n",
"pq.write_table(table2, subdir2 / \"data.parquet\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Opening this dataset with `pyarrow.dataset` works fine:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"dataset = ds.dataset(case, partitioning=\"hive\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The schema of the dataset indicates it gets the type information for the partition columns from the actual file (the parquet \"field_id\" metadata is included):"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"part: string\n",
" -- field metadata --\n",
" PARQUET:field_id: '1'\n",
"col1: int64\n",
" -- field metadata --\n",
" PARQUET:field_id: '2'"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dataset.schema"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And reading the full dataset also works as expected:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>part</th>\n",
" <th>col1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>a</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>a</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>b</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>b</td>\n",
" <td>4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" part col1\n",
"0 a 1\n",
"1 a 2\n",
"2 b 3\n",
"3 b 4"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dataset.to_table().to_pandas()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Which values are used for filtering?\n",
"\n",
"Using a variant of the above, but creating a dataset with inconsistent values in the partition fields vs the actual data columns, to experiment with which is being used to actually filter datasets:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"case = basedir / \"duplicated_column_partition_field_mismatch_values\"\n",
"case.mkdir(exist_ok=True)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"table1 = pa.table({\"part\": [\"a\", \"b\"], \"col1\": [1, 2]})\n",
"subdir1 = case / \"part=a\"\n",
"subdir1.mkdir(exist_ok=True)\n",
"pq.write_table(table1, subdir1 / \"data.parquet\")\n",
"\n",
"table2 = pa.table({\"part\": [\"c\", \"d\"], \"col1\": [3, 4]})\n",
"subdir2 = case / \"part=b\"\n",
"subdir2.mkdir(exist_ok=True)\n",
"pq.write_table(table2, subdir2 / \"data.parquet\")"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"dataset = ds.dataset(case, partitioning=\"hive\")"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"part: string\n",
" -- field metadata --\n",
" PARQUET:field_id: '1'\n",
"col1: int64\n",
" -- field metadata --\n",
" PARQUET:field_id: '2'"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dataset.schema"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Reading the data gives the values from the actual parquet file:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>part</th>\n",
" <th>col1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>a</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>b</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>c</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>d</td>\n",
" <td>4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" part col1\n",
"0 a 1\n",
"1 b 2\n",
"2 c 3\n",
"3 d 4"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dataset.to_table().to_pandas()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Filtering on the colum/partition fields, seems to use the partition field (i.e. the file path information) and doesn't do further filtering using the actual physical parquet data:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>part</th>\n",
" <th>col1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>a</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>b</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" part col1\n",
"0 a 1\n",
"1 b 2"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dataset.to_table(filter=ds.field(\"part\") == \"a\").to_pandas()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>part</th>\n",
" <th>col1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>c</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>d</td>\n",
" <td>4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" part col1\n",
"0 c 3\n",
"1 d 4"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dataset.to_table(filter=ds.field(\"part\") == \"b\").to_pandas()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## How are incompatible types handled?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What if the partition field in the file path and the actual column have contradicting type information?\n",
"\n",
"Using incompatible data types to see if this is checked when discovering the dataset (the file paths use strings (`/part=a/`) and the parquet file integers):"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"case = basedir / \"duplicated_column_partition_field_mismatch_type\"\n",
"case.mkdir(exist_ok=True)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"table1 = pa.table({\"part\": [1, 1], \"col1\": [1, 2]})\n",
"subdir1 = case / \"part=a\"\n",
"subdir1.mkdir(exist_ok=True)\n",
"pq.write_table(table1, subdir1 / \"data.parquet\")\n",
"\n",
"table2 = pa.table({\"part\": [2, 2], \"col1\": [3, 4]})\n",
"subdir2 = case / \"part=b\"\n",
"subdir2.mkdir(exist_ok=True)\n",
"pq.write_table(table2, subdir2 / \"data.parquet\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This already fails when reading with an appropriate error message:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"ename": "ArrowInvalid",
"evalue": "Unable to merge: Field part has incompatible types: int64 vs string\nIn ../src/arrow/type.cc, line 1501, code: (_error_or_value7).status()\nIn ../src/arrow/type.cc, line 1564, code: AddField(field)\nIn ../src/arrow/type.cc, line 1629, code: builder.AddSchema(schema)\nIn ../src/arrow/dataset/discovery.cc, line 235, code: (_error_or_value14).status()",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mArrowInvalid\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-17-60d351911622>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdataset\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mds\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdataset\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcase\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpartitioning\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"hive\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m~/scipy/repos/arrow/python/pyarrow/dataset.py\u001b[0m in \u001b[0;36mdataset\u001b[0;34m(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes)\u001b[0m\n\u001b[1;32m 669\u001b[0m \u001b[0;31m# TODO(kszucs): support InMemoryDataset for a table input\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 670\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0m_is_path_like\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msource\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 671\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0m_filesystem_dataset\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msource\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 672\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msource\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mtuple\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlist\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 673\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mall\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0m_is_path_like\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0melem\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0melem\u001b[0m \u001b[0;32min\u001b[0m \u001b[0msource\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/scipy/repos/arrow/python/pyarrow/dataset.py\u001b[0m in \u001b[0;36m_filesystem_dataset\u001b[0;34m(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)\u001b[0m\n\u001b[1;32m 436\u001b[0m \u001b[0mfactory\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mFileSystemDatasetFactory\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpaths_or_selector\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mformat\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0moptions\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 437\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 438\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mfactory\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfinish\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mschema\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 439\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 440\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/scipy/repos/arrow/python/pyarrow/_dataset.pyx\u001b[0m in \u001b[0;36mpyarrow._dataset.DatasetFactory.finish\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32m~/scipy/repos/arrow/python/pyarrow/error.pxi\u001b[0m in \u001b[0;36mpyarrow.lib.pyarrow_internal_check_status\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32m~/scipy/repos/arrow/python/pyarrow/error.pxi\u001b[0m in \u001b[0;36mpyarrow.lib.check_status\u001b[0;34m()\u001b[0m\n",
"\u001b[0;31mArrowInvalid\u001b[0m: Unable to merge: Field part has incompatible types: int64 vs string\nIn ../src/arrow/type.cc, line 1501, code: (_error_or_value7).status()\nIn ../src/arrow/type.cc, line 1564, code: AddField(field)\nIn ../src/arrow/type.cc, line 1629, code: builder.AddSchema(schema)\nIn ../src/arrow/dataset/discovery.cc, line 235, code: (_error_or_value14).status()"
]
}
],
"source": [
"dataset = ds.dataset(case, partitioning=\"hive\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But what with types that could be compatible? For example for integer partition fields, we use int32, but the actual data could use int64:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"case = basedir / \"duplicated_column_partition_field_mismatch_type_int\"\n",
"case.mkdir(exist_ok=True)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"table1 = pa.table({\"part\": [1, 1], \"col1\": [1, 2]})\n",
"subdir1 = case / \"part=1\"\n",
"subdir1.mkdir(exist_ok=True)\n",
"pq.write_table(table1, subdir1 / \"data.parquet\")\n",
"\n",
"table2 = pa.table({\"part\": [2, 2], \"col1\": [3, 4]})\n",
"subdir2 = case / \"part=2\"\n",
"subdir2.mkdir(exist_ok=True)\n",
"pq.write_table(table2, subdir2 / \"data.parquet\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This already fails when reading with an appropriate error message:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"ename": "ArrowInvalid",
"evalue": "Unable to merge: Field part has incompatible types: int64 vs int32\nIn ../src/arrow/type.cc, line 1501, code: (_error_or_value7).status()\nIn ../src/arrow/type.cc, line 1564, code: AddField(field)\nIn ../src/arrow/type.cc, line 1629, code: builder.AddSchema(schema)\nIn ../src/arrow/dataset/discovery.cc, line 235, code: (_error_or_value14).status()",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mArrowInvalid\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-20-60d351911622>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdataset\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mds\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdataset\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcase\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpartitioning\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"hive\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m~/scipy/repos/arrow/python/pyarrow/dataset.py\u001b[0m in \u001b[0;36mdataset\u001b[0;34m(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes)\u001b[0m\n\u001b[1;32m 669\u001b[0m \u001b[0;31m# TODO(kszucs): support InMemoryDataset for a table input\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 670\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0m_is_path_like\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msource\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 671\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0m_filesystem_dataset\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msource\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 672\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msource\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mtuple\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlist\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 673\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mall\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0m_is_path_like\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0melem\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0melem\u001b[0m \u001b[0;32min\u001b[0m \u001b[0msource\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/scipy/repos/arrow/python/pyarrow/dataset.py\u001b[0m in \u001b[0;36m_filesystem_dataset\u001b[0;34m(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)\u001b[0m\n\u001b[1;32m 436\u001b[0m \u001b[0mfactory\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mFileSystemDatasetFactory\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpaths_or_selector\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mformat\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0moptions\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 437\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 438\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mfactory\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfinish\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mschema\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 439\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 440\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/scipy/repos/arrow/python/pyarrow/_dataset.pyx\u001b[0m in \u001b[0;36mpyarrow._dataset.DatasetFactory.finish\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32m~/scipy/repos/arrow/python/pyarrow/error.pxi\u001b[0m in \u001b[0;36mpyarrow.lib.pyarrow_internal_check_status\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32m~/scipy/repos/arrow/python/pyarrow/error.pxi\u001b[0m in \u001b[0;36mpyarrow.lib.check_status\u001b[0;34m()\u001b[0m\n",
"\u001b[0;31mArrowInvalid\u001b[0m: Unable to merge: Field part has incompatible types: int64 vs int32\nIn ../src/arrow/type.cc, line 1501, code: (_error_or_value7).status()\nIn ../src/arrow/type.cc, line 1564, code: AddField(field)\nIn ../src/arrow/type.cc, line 1629, code: builder.AddSchema(schema)\nIn ../src/arrow/dataset/discovery.cc, line 235, code: (_error_or_value14).status()"
]
}
],
"source": [
"dataset = ds.dataset(case, partitioning=\"hive\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This last example is something we should be able to get working."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python (arrow-dev)",
"language": "python",
"name": "arrow-dev"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment