jorisvandenbossche/pyarrow-dataset-test-duplicate-partition-field-column.ipynb

## pyarrow-dataset-test-duplicate-partition-field-column.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# `pyarrow.dataset` handling of duplicate partition field / data column\n",
    "\n",
    "Related to a discussion on the user@arrow.apache.org mailing list on having partition information both in the partition fields (file paths) as in a column in the actual data files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pyarrow as pa\n",
    "import pyarrow.parquet as pq\n",
    "import pyarrow.dataset as ds\n",
    "\n",
    "import pathlib"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "basedir = pathlib.Path(\".\") / \"dataset_experiments\"\n",
    "basedir.mkdir(exist_ok=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basic example"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A basic example seems to work fine. First creating a small toy dataset with partition field `\"part\"` (with values \"a\" and \"b\"), where this is also included as a column in the parquet files:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "case = basedir / \"duplicated_column_partition_field\"\n",
    "case.mkdir(exist_ok=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "table1 = pa.table({\"part\": [\"a\", \"a\"], \"col1\": [1, 2]})\n",
    "subdir1 = case / \"part=a\"\n",
    "subdir1.mkdir(exist_ok=True)\n",
    "pq.write_table(table1, subdir1 / \"data.parquet\")\n",
    "\n",
    "table2 = pa.table({\"part\": [\"b\", \"b\"], \"col1\": [3, 4]})\n",
    "subdir2 = case / \"part=b\"\n",
    "subdir2.mkdir(exist_ok=True)\n",
    "pq.write_table(table2, subdir2 / \"data.parquet\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Opening this dataset with `pyarrow.dataset` works fine:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset = ds.dataset(case, partitioning=\"hive\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The schema of the dataset indicates it gets the type information for the partition columns from the actual file (the parquet \"field_id\" metadata is included):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "part: string\n",
       "  -- field metadata --\n",
       "  PARQUET:field_id: '1'\n",
       "col1: int64\n",
       "  -- field metadata --\n",
       "  PARQUET:field_id: '2'"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset.schema"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And reading the full dataset also works as expected:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>part</th>\n",
       "      <th>col1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>a</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>b</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>b</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  part  col1\n",
       "0    a     1\n",
       "1    a     2\n",
       "2    b     3\n",
       "3    b     4"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset.to_table().to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Which values are used for filtering?\n",
    "\n",
    "Using a variant of the above, but creating a dataset with inconsistent values in the partition fields vs the actual data columns, to experiment with which is being used to actually filter datasets:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "case = basedir / \"duplicated_column_partition_field_mismatch_values\"\n",
    "case.mkdir(exist_ok=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "table1 = pa.table({\"part\": [\"a\", \"b\"], \"col1\": [1, 2]})\n",
    "subdir1 = case / \"part=a\"\n",
    "subdir1.mkdir(exist_ok=True)\n",
    "pq.write_table(table1, subdir1 / \"data.parquet\")\n",
    "\n",
    "table2 = pa.table({\"part\": [\"c\", \"d\"], \"col1\": [3, 4]})\n",
    "subdir2 = case / \"part=b\"\n",
    "subdir2.mkdir(exist_ok=True)\n",
    "pq.write_table(table2, subdir2 / \"data.parquet\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset = ds.dataset(case, partitioning=\"hive\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "part: string\n",
       "  -- field metadata --\n",
       "  PARQUET:field_id: '1'\n",
       "col1: int64\n",
       "  -- field metadata --\n",
       "  PARQUET:field_id: '2'"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset.schema"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Reading the data gives the values from the actual parquet file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>part</th>\n",
       "      <th>col1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>b</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>c</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>d</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  part  col1\n",
       "0    a     1\n",
       "1    b     2\n",
       "2    c     3\n",
       "3    d     4"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset.to_table().to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Filtering on the colum/partition fields, seems to use the partition field (i.e. the file path information) and doesn't do further filtering using the actual physical parquet data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>part</th>\n",
       "      <th>col1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>b</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  part  col1\n",
       "0    a     1\n",
       "1    b     2"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset.to_table(filter=ds.field(\"part\") == \"a\").to_pandas()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>part</th>\n",
       "      <th>col1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>c</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>d</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  part  col1\n",
       "0    c     3\n",
       "1    d     4"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset.to_table(filter=ds.field(\"part\") == \"b\").to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## How are incompatible types handled?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What if the partition field in the file path and the actual column have contradicting type information?\n",
    "\n",
    "Using incompatible data types to see if this is checked when discovering the dataset (the file paths use strings (`/part=a/`) and the parquet file integers):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "case = basedir / \"duplicated_column_partition_field_mismatch_type\"\n",
    "case.mkdir(exist_ok=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "table1 = pa.table({\"part\": [1, 1], \"col1\": [1, 2]})\n",
    "subdir1 = case / \"part=a\"\n",
    "subdir1.mkdir(exist_ok=True)\n",
    "pq.write_table(table1, subdir1 / \"data.parquet\")\n",
    "\n",
    "table2 = pa.table({\"part\": [2, 2], \"col1\": [3, 4]})\n",
    "subdir2 = case / \"part=b\"\n",
    "subdir2.mkdir(exist_ok=True)\n",
    "pq.write_table(table2, subdir2 / \"data.parquet\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This already fails when reading with an appropriate error message:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "ename": "ArrowInvalid",
     "evalue": "Unable to merge: Field part has incompatible types: int64 vs string\nIn ../src/arrow/type.cc, line 1501, code: (_error_or_value7).status()\nIn ../src/arrow/type.cc, line 1564, code: AddField(field)\nIn ../src/arrow/type.cc, line 1629, code: builder.AddSchema(schema)\nIn ../src/arrow/dataset/discovery.cc, line 235, code: (_error_or_value14).status()",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mArrowInvalid\u001b[0m                              Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-17-60d351911622>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdataset\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mds\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdataset\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcase\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpartitioning\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"hive\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m~/scipy/repos/arrow/python/pyarrow/dataset.py\u001b[0m in \u001b[0;36mdataset\u001b[0;34m(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes)\u001b[0m\n\u001b[1;32m    669\u001b[0m     \u001b[0;31m# TODO(kszucs): support InMemoryDataset for a table input\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    670\u001b[0m     \u001b[0;32mif\u001b[0m \u001b[0m_is_path_like\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msource\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 671\u001b[0;31m         \u001b[0;32mreturn\u001b[0m \u001b[0m_filesystem_dataset\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msource\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    672\u001b[0m     \u001b[0;32melif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msource\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mtuple\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlist\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    673\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0mall\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0m_is_path_like\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0melem\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0melem\u001b[0m \u001b[0;32min\u001b[0m \u001b[0msource\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/scipy/repos/arrow/python/pyarrow/dataset.py\u001b[0m in \u001b[0;36m_filesystem_dataset\u001b[0;34m(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)\u001b[0m\n\u001b[1;32m    436\u001b[0m     \u001b[0mfactory\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mFileSystemDatasetFactory\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpaths_or_selector\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mformat\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0moptions\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    437\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 438\u001b[0;31m     \u001b[0;32mreturn\u001b[0m \u001b[0mfactory\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfinish\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mschema\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    439\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    440\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/scipy/repos/arrow/python/pyarrow/_dataset.pyx\u001b[0m in \u001b[0;36mpyarrow._dataset.DatasetFactory.finish\u001b[0;34m()\u001b[0m\n",
      "\u001b[0;32m~/scipy/repos/arrow/python/pyarrow/error.pxi\u001b[0m in \u001b[0;36mpyarrow.lib.pyarrow_internal_check_status\u001b[0;34m()\u001b[0m\n",
      "\u001b[0;32m~/scipy/repos/arrow/python/pyarrow/error.pxi\u001b[0m in \u001b[0;36mpyarrow.lib.check_status\u001b[0;34m()\u001b[0m\n",
      "\u001b[0;31mArrowInvalid\u001b[0m: Unable to merge: Field part has incompatible types: int64 vs string\nIn ../src/arrow/type.cc, line 1501, code: (_error_or_value7).status()\nIn ../src/arrow/type.cc, line 1564, code: AddField(field)\nIn ../src/arrow/type.cc, line 1629, code: builder.AddSchema(schema)\nIn ../src/arrow/dataset/discovery.cc, line 235, code: (_error_or_value14).status()"
     ]
    }
   ],
   "source": [
    "dataset = ds.dataset(case, partitioning=\"hive\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But what with types that could be compatible? For example for integer partition fields, we use int32, but the actual data could use int64:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "case = basedir / \"duplicated_column_partition_field_mismatch_type_int\"\n",
    "case.mkdir(exist_ok=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "table1 = pa.table({\"part\": [1, 1], \"col1\": [1, 2]})\n",
    "subdir1 = case / \"part=1\"\n",
    "subdir1.mkdir(exist_ok=True)\n",
    "pq.write_table(table1, subdir1 / \"data.parquet\")\n",
    "\n",
    "table2 = pa.table({\"part\": [2, 2], \"col1\": [3, 4]})\n",
    "subdir2 = case / \"part=2\"\n",
    "subdir2.mkdir(exist_ok=True)\n",
    "pq.write_table(table2, subdir2 / \"data.parquet\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This already fails when reading with an appropriate error message:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "ename": "ArrowInvalid",
     "evalue": "Unable to merge: Field part has incompatible types: int64 vs int32\nIn ../src/arrow/type.cc, line 1501, code: (_error_or_value7).status()\nIn ../src/arrow/type.cc, line 1564, code: AddField(field)\nIn ../src/arrow/type.cc, line 1629, code: builder.AddSchema(schema)\nIn ../src/arrow/dataset/discovery.cc, line 235, code: (_error_or_value14).status()",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mArrowInvalid\u001b[0m                              Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-20-60d351911622>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdataset\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mds\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdataset\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcase\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpartitioning\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"hive\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m~/scipy/repos/arrow/python/pyarrow/dataset.py\u001b[0m in \u001b[0;36mdataset\u001b[0;34m(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes)\u001b[0m\n\u001b[1;32m    669\u001b[0m     \u001b[0;31m# TODO(kszucs): support InMemoryDataset for a table input\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    670\u001b[0m     \u001b[0;32mif\u001b[0m \u001b[0m_is_path_like\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msource\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 671\u001b[0;31m         \u001b[0;32mreturn\u001b[0m \u001b[0m_filesystem_dataset\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msource\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    672\u001b[0m     \u001b[0;32melif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msource\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mtuple\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlist\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    673\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0mall\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0m_is_path_like\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0melem\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0melem\u001b[0m \u001b[0;32min\u001b[0m \u001b[0msource\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/scipy/repos/arrow/python/pyarrow/dataset.py\u001b[0m in \u001b[0;36m_filesystem_dataset\u001b[0;34m(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)\u001b[0m\n\u001b[1;32m    436\u001b[0m     \u001b[0mfactory\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mFileSystemDatasetFactory\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpaths_or_selector\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mformat\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0moptions\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    437\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 438\u001b[0;31m     \u001b[0;32mreturn\u001b[0m \u001b[0mfactory\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfinish\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mschema\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    439\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    440\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/scipy/repos/arrow/python/pyarrow/_dataset.pyx\u001b[0m in \u001b[0;36mpyarrow._dataset.DatasetFactory.finish\u001b[0;34m()\u001b[0m\n",
      "\u001b[0;32m~/scipy/repos/arrow/python/pyarrow/error.pxi\u001b[0m in \u001b[0;36mpyarrow.lib.pyarrow_internal_check_status\u001b[0;34m()\u001b[0m\n",
      "\u001b[0;32m~/scipy/repos/arrow/python/pyarrow/error.pxi\u001b[0m in \u001b[0;36mpyarrow.lib.check_status\u001b[0;34m()\u001b[0m\n",
      "\u001b[0;31mArrowInvalid\u001b[0m: Unable to merge: Field part has incompatible types: int64 vs int32\nIn ../src/arrow/type.cc, line 1501, code: (_error_or_value7).status()\nIn ../src/arrow/type.cc, line 1564, code: AddField(field)\nIn ../src/arrow/type.cc, line 1629, code: builder.AddSchema(schema)\nIn ../src/arrow/dataset/discovery.cc, line 235, code: (_error_or_value14).status()"
     ]
    }
   ],
   "source": [
    "dataset = ds.dataset(case, partitioning=\"hive\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This last example is something we should be able to get working."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python (arrow-dev)",
   "language": "python",
   "name": "arrow-dev"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}