Skip to content

Instantly share code, notes, and snippets.

@lmangani
Created May 26, 2023 17:52
Show Gist options
  • Save lmangani/638f8296845a8770cc688d9136548e8b to your computer and use it in GitHub Desktop.
Save lmangani/638f8296845a8770cc688d9136548e8b to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"source": [
"# chDB Colab\n",
"\n",
"Compare DuckDB, chDB and Pandas performance dealing with .Parquet files"
],
"metadata": {
"id": "AxstrLjxEEL2"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "-kqtWblkoujV"
},
"source": [
"## Setup\n",
"First we download some files and install the required libraries."
]
},
{
"cell_type": "code",
"metadata": {
"id": "39OPDvzADNMz",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "f8df1f10-13d4-4d5e-ba00-bc8eb83672e1"
},
"source": [
"!pip install pyarrow pandas\n",
"!pip install chdb --pre --upgrade\n",
"!pip install chdb --pre --upgrade\n",
"\n",
"!mkdir -p taxi\n",
"!wget https://github.com/cwida/duckdb-data/releases/download/v1.0/taxi_2019_04.parquet -O taxi/201904.parquet\n",
"!wget https://github.com/cwida/duckdb-data/releases/download/v1.0/taxi_2019_05.parquet -O taxi/201905.parquet\n",
"!wget https://github.com/cwida/duckdb-data/releases/download/v1.0/taxi_2019_06.parquet -O taxi/201906.parquet"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
"Requirement already satisfied: pyarrow in /usr/local/lib/python3.10/dist-packages (9.0.0)\n",
"Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (1.5.3)\n",
"Requirement already satisfied: numpy>=1.16.6 in /usr/local/lib/python3.10/dist-packages (from pyarrow) (1.22.4)\n",
"Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas) (2.8.2)\n",
"Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas) (2022.7.1)\n",
"Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)\n",
"Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
"Collecting chdb\n",
" Downloading chdb-0.8.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (109.0 MB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m109.0/109.0 MB\u001b[0m \u001b[31m3.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hInstalling collected packages: chdb\n",
"Successfully installed chdb-0.8.0\n",
"--2023-05-26 17:18:34-- https://github.com/cwida/duckdb-data/releases/download/v1.0/taxi_2019_04.parquet\n",
"Resolving github.com (github.com)... 20.27.177.113\n",
"Connecting to github.com (github.com)|20.27.177.113|:443... connected.\n",
"HTTP request sent, awaiting response... 301 Moved Permanently\n",
"Location: https://github.com/duckdb/duckdb-data/releases/download/v1.0/taxi_2019_04.parquet [following]\n",
"--2023-05-26 17:18:35-- https://github.com/duckdb/duckdb-data/releases/download/v1.0/taxi_2019_04.parquet\n",
"Reusing existing connection to github.com:443.\n",
"HTTP request sent, awaiting response... 302 Found\n",
"Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/263853960/d481ee00-d5d4-11eb-85f7-c0e266116772?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230526%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230526T171835Z&X-Amz-Expires=300&X-Amz-Signature=f8f907c66b4a5ed7c68e0d461f6b9871c165635bae26ffa0c9495041595ae14f&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=263853960&response-content-disposition=attachment%3B%20filename%3Dtaxi_2019_04.parquet&response-content-type=application%2Foctet-stream [following]\n",
"--2023-05-26 17:18:35-- https://objects.githubusercontent.com/github-production-release-asset-2e65be/263853960/d481ee00-d5d4-11eb-85f7-c0e266116772?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230526%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230526T171835Z&X-Amz-Expires=300&X-Amz-Signature=f8f907c66b4a5ed7c68e0d461f6b9871c165635bae26ffa0c9495041595ae14f&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=263853960&response-content-disposition=attachment%3B%20filename%3Dtaxi_2019_04.parquet&response-content-type=application%2Foctet-stream\n",
"Resolving objects.githubusercontent.com (objects.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...\n",
"Connecting to objects.githubusercontent.com (objects.githubusercontent.com)|185.199.109.133|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 127056503 (121M) [application/octet-stream]\n",
"Saving to: ‘taxi/201904.parquet’\n",
"\n",
"taxi/201904.parquet 100%[===================>] 121.17M 38.4MB/s in 3.2s \n",
"\n",
"2023-05-26 17:18:39 (38.4 MB/s) - ‘taxi/201904.parquet’ saved [127056503/127056503]\n",
"\n",
"--2023-05-26 17:18:39-- https://github.com/cwida/duckdb-data/releases/download/v1.0/taxi_2019_05.parquet\n",
"Resolving github.com (github.com)... 20.27.177.113\n",
"Connecting to github.com (github.com)|20.27.177.113|:443... connected.\n",
"HTTP request sent, awaiting response... 301 Moved Permanently\n",
"Location: https://github.com/duckdb/duckdb-data/releases/download/v1.0/taxi_2019_05.parquet [following]\n",
"--2023-05-26 17:18:39-- https://github.com/duckdb/duckdb-data/releases/download/v1.0/taxi_2019_05.parquet\n",
"Reusing existing connection to github.com:443.\n",
"HTTP request sent, awaiting response... 302 Found\n",
"Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/263853960/f2525180-d5da-11eb-83dc-697493d2d923?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230526%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230526T171840Z&X-Amz-Expires=300&X-Amz-Signature=beba2fad556dfef8f99a2a880396e43cfd220c52c53346efae8d44350bf4876a&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=263853960&response-content-disposition=attachment%3B%20filename%3Dtaxi_2019_05.parquet&response-content-type=application%2Foctet-stream [following]\n",
"--2023-05-26 17:18:40-- https://objects.githubusercontent.com/github-production-release-asset-2e65be/263853960/f2525180-d5da-11eb-83dc-697493d2d923?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230526%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230526T171840Z&X-Amz-Expires=300&X-Amz-Signature=beba2fad556dfef8f99a2a880396e43cfd220c52c53346efae8d44350bf4876a&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=263853960&response-content-disposition=attachment%3B%20filename%3Dtaxi_2019_05.parquet&response-content-type=application%2Foctet-stream\n",
"Resolving objects.githubusercontent.com (objects.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...\n",
"Connecting to objects.githubusercontent.com (objects.githubusercontent.com)|185.199.110.133|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 130095527 (124M) [application/octet-stream]\n",
"Saving to: ‘taxi/201905.parquet’\n",
"\n",
"taxi/201905.parquet 100%[===================>] 124.07M 35.8MB/s in 3.5s \n",
"\n",
"2023-05-26 17:18:44 (35.8 MB/s) - ‘taxi/201905.parquet’ saved [130095527/130095527]\n",
"\n",
"--2023-05-26 17:18:44-- https://github.com/cwida/duckdb-data/releases/download/v1.0/taxi_2019_06.parquet\n",
"Resolving github.com (github.com)... 20.27.177.113\n",
"Connecting to github.com (github.com)|20.27.177.113|:443... connected.\n",
"HTTP request sent, awaiting response... 301 Moved Permanently\n",
"Location: https://github.com/duckdb/duckdb-data/releases/download/v1.0/taxi_2019_06.parquet [following]\n",
"--2023-05-26 17:18:44-- https://github.com/duckdb/duckdb-data/releases/download/v1.0/taxi_2019_06.parquet\n",
"Reusing existing connection to github.com:443.\n",
"HTTP request sent, awaiting response... 302 Found\n",
"Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/263853960/288fd100-d5db-11eb-887a-36389f711f61?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230526%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230526T171844Z&X-Amz-Expires=300&X-Amz-Signature=42a62f517dc5ced56c78a571843a7d1af2187c84d91625f46560ebc49c64c02d&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=263853960&response-content-disposition=attachment%3B%20filename%3Dtaxi_2019_06.parquet&response-content-type=application%2Foctet-stream [following]\n",
"--2023-05-26 17:18:44-- https://objects.githubusercontent.com/github-production-release-asset-2e65be/263853960/288fd100-d5db-11eb-887a-36389f711f61?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230526%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230526T171844Z&X-Amz-Expires=300&X-Amz-Signature=42a62f517dc5ced56c78a571843a7d1af2187c84d91625f46560ebc49c64c02d&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=263853960&response-content-disposition=attachment%3B%20filename%3Dtaxi_2019_06.parquet&response-content-type=application%2Foctet-stream\n",
"Resolving objects.githubusercontent.com (objects.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...\n",
"Connecting to objects.githubusercontent.com (objects.githubusercontent.com)|185.199.109.133|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 120790979 (115M) [application/octet-stream]\n",
"Saving to: ‘taxi/201906.parquet’\n",
"\n",
"taxi/201906.parquet 100%[===================>] 115.19M 32.9MB/s in 3.5s \n",
"\n",
"2023-05-26 17:18:49 (32.9 MB/s) - ‘taxi/201906.parquet’ saved [120790979/120790979]\n",
"\n"
]
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "imYHieGpD7_e"
},
"source": [
"import pyarrow.parquet as pq\n",
"import pandas\n",
"import glob\n",
"import chdb\n",
"import duckdb\n",
"\n",
"# no setup for chDB\n",
"chdb.query(\"SELECT version()\")\n",
"\n",
"# some setup for DuckDB \n",
"con = duckdb.connect()"
],
"execution_count": 16,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "sGtaQW0L11LT"
},
"source": [
"## Reading Multiple Parquet Files\n",
"\n",
"chDB can read multiple parquet files using the glob syntax.\n",
"\n",
"In Pandas, we need to load the files separately and concatenate them together into a single DataFrame."
]
},
{
"cell_type": "code",
"metadata": {
"id": "1_Kl55nvhfPy",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 391
},
"outputId": "b899a86d-37e7-4b3f-b761-bbf23ed28774"
},
"source": [
"%%time\n",
"chdb.query('SELECT * FROM file(\"taxi/*.parquet\", Parquet) LIMIT 5', 'Dataframe');\n"
],
"execution_count": 17,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"CPU times: user 45.5 ms, sys: 14.2 ms, total: 59.6 ms\n",
"Wall time: 65.7 ms\n"
]
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
" vendor_id pickup_at dropoff_at \\\n",
"0 b'1' 2019-06-01 00:55:13+00:00 2019-06-01 00:56:17+00:00 \n",
"1 b'1' 2019-06-01 00:06:31+00:00 2019-06-01 00:06:52+00:00 \n",
"2 b'1' 2019-06-01 00:17:05+00:00 2019-06-01 00:36:38+00:00 \n",
"3 b'1' 2019-06-01 00:59:02+00:00 2019-06-01 00:59:12+00:00 \n",
"4 b'1' 2019-06-01 00:03:25+00:00 2019-06-01 00:15:42+00:00 \n",
"\n",
" passenger_count trip_distance rate_code_id store_and_fwd_flag \\\n",
"0 1 0.0 b'1' b'N' \n",
"1 1 0.0 b'1' b'N' \n",
"2 1 4.4 b'1' b'N' \n",
"3 0 0.8 b'1' b'N' \n",
"4 1 1.7 b'1' b'N' \n",
"\n",
" pickup_location_id dropoff_location_id payment_type fare_amount extra \\\n",
"0 145 145 b'2' 3.0 0.5 \n",
"1 262 263 b'2' 2.5 3.0 \n",
"2 74 7 b'2' 17.5 0.5 \n",
"3 145 145 b'2' 2.5 1.0 \n",
"4 113 148 b'1' 9.5 3.0 \n",
"\n",
" mta_tax tip_amount tolls_amount improvement_surcharge total_amount \\\n",
"0 0.5 0.00 0.0 0.3 4.300000 \n",
"1 0.5 0.00 0.0 0.3 6.300000 \n",
"2 0.5 0.00 0.0 0.3 18.799999 \n",
"3 0.5 0.00 0.0 0.3 4.300000 \n",
"4 0.5 2.65 0.0 0.3 15.950000 \n",
"\n",
" congestion_surcharge \n",
"0 0.0 \n",
"1 2.5 \n",
"2 0.0 \n",
"3 0.0 \n",
"4 2.5 "
],
"text/html": [
"\n",
" <div id=\"df-5c8191f0-bcfa-4ca9-aab3-96bdb26a9f5f\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>vendor_id</th>\n",
" <th>pickup_at</th>\n",
" <th>dropoff_at</th>\n",
" <th>passenger_count</th>\n",
" <th>trip_distance</th>\n",
" <th>rate_code_id</th>\n",
" <th>store_and_fwd_flag</th>\n",
" <th>pickup_location_id</th>\n",
" <th>dropoff_location_id</th>\n",
" <th>payment_type</th>\n",
" <th>fare_amount</th>\n",
" <th>extra</th>\n",
" <th>mta_tax</th>\n",
" <th>tip_amount</th>\n",
" <th>tolls_amount</th>\n",
" <th>improvement_surcharge</th>\n",
" <th>total_amount</th>\n",
" <th>congestion_surcharge</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>b'1'</td>\n",
" <td>2019-06-01 00:55:13+00:00</td>\n",
" <td>2019-06-01 00:56:17+00:00</td>\n",
" <td>1</td>\n",
" <td>0.0</td>\n",
" <td>b'1'</td>\n",
" <td>b'N'</td>\n",
" <td>145</td>\n",
" <td>145</td>\n",
" <td>b'2'</td>\n",
" <td>3.0</td>\n",
" <td>0.5</td>\n",
" <td>0.5</td>\n",
" <td>0.00</td>\n",
" <td>0.0</td>\n",
" <td>0.3</td>\n",
" <td>4.300000</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>b'1'</td>\n",
" <td>2019-06-01 00:06:31+00:00</td>\n",
" <td>2019-06-01 00:06:52+00:00</td>\n",
" <td>1</td>\n",
" <td>0.0</td>\n",
" <td>b'1'</td>\n",
" <td>b'N'</td>\n",
" <td>262</td>\n",
" <td>263</td>\n",
" <td>b'2'</td>\n",
" <td>2.5</td>\n",
" <td>3.0</td>\n",
" <td>0.5</td>\n",
" <td>0.00</td>\n",
" <td>0.0</td>\n",
" <td>0.3</td>\n",
" <td>6.300000</td>\n",
" <td>2.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>b'1'</td>\n",
" <td>2019-06-01 00:17:05+00:00</td>\n",
" <td>2019-06-01 00:36:38+00:00</td>\n",
" <td>1</td>\n",
" <td>4.4</td>\n",
" <td>b'1'</td>\n",
" <td>b'N'</td>\n",
" <td>74</td>\n",
" <td>7</td>\n",
" <td>b'2'</td>\n",
" <td>17.5</td>\n",
" <td>0.5</td>\n",
" <td>0.5</td>\n",
" <td>0.00</td>\n",
" <td>0.0</td>\n",
" <td>0.3</td>\n",
" <td>18.799999</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>b'1'</td>\n",
" <td>2019-06-01 00:59:02+00:00</td>\n",
" <td>2019-06-01 00:59:12+00:00</td>\n",
" <td>0</td>\n",
" <td>0.8</td>\n",
" <td>b'1'</td>\n",
" <td>b'N'</td>\n",
" <td>145</td>\n",
" <td>145</td>\n",
" <td>b'2'</td>\n",
" <td>2.5</td>\n",
" <td>1.0</td>\n",
" <td>0.5</td>\n",
" <td>0.00</td>\n",
" <td>0.0</td>\n",
" <td>0.3</td>\n",
" <td>4.300000</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>b'1'</td>\n",
" <td>2019-06-01 00:03:25+00:00</td>\n",
" <td>2019-06-01 00:15:42+00:00</td>\n",
" <td>1</td>\n",
" <td>1.7</td>\n",
" <td>b'1'</td>\n",
" <td>b'N'</td>\n",
" <td>113</td>\n",
" <td>148</td>\n",
" <td>b'1'</td>\n",
" <td>9.5</td>\n",
" <td>3.0</td>\n",
" <td>0.5</td>\n",
" <td>2.65</td>\n",
" <td>0.0</td>\n",
" <td>0.3</td>\n",
" <td>15.950000</td>\n",
" <td>2.5</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5c8191f0-bcfa-4ca9-aab3-96bdb26a9f5f')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-5c8191f0-bcfa-4ca9-aab3-96bdb26a9f5f button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-5c8191f0-bcfa-4ca9-aab3-96bdb26a9f5f');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
]
},
"metadata": {},
"execution_count": 17
}
]
},
{
"cell_type": "code",
"source": [
"%%time\n",
"con.execute(\"SELECT * FROM 'taxi/*.parquet' LIMIT 5\").df()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 478
},
"id": "u6glJZKDKiXN",
"outputId": "5079aaa4-731f-4a36-af50-99cfe663a1c0"
},
"execution_count": 18,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"CPU times: user 37.2 ms, sys: 6.72 ms, total: 43.9 ms\n",
"Wall time: 63.5 ms\n"
]
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
" vendor_id pickup_at dropoff_at passenger_count \\\n",
"0 1 2019-06-01 00:55:13 2019-06-01 00:56:17 1 \n",
"1 1 2019-06-01 00:06:31 2019-06-01 00:06:52 1 \n",
"2 1 2019-06-01 00:17:05 2019-06-01 00:36:38 1 \n",
"3 1 2019-06-01 00:59:02 2019-06-01 00:59:12 0 \n",
"4 1 2019-06-01 00:03:25 2019-06-01 00:15:42 1 \n",
"\n",
" trip_distance rate_code_id store_and_fwd_flag pickup_location_id \\\n",
"0 0.0 1 N 145 \n",
"1 0.0 1 N 262 \n",
"2 4.4 1 N 74 \n",
"3 0.8 1 N 145 \n",
"4 1.7 1 N 113 \n",
"\n",
" dropoff_location_id payment_type fare_amount extra mta_tax tip_amount \\\n",
"0 145 2 3.0 0.5 0.5 0.00 \n",
"1 263 2 2.5 3.0 0.5 0.00 \n",
"2 7 2 17.5 0.5 0.5 0.00 \n",
"3 145 2 2.5 1.0 0.5 0.00 \n",
"4 148 1 9.5 3.0 0.5 2.65 \n",
"\n",
" tolls_amount improvement_surcharge total_amount congestion_surcharge \n",
"0 0.0 0.3 4.300000 0.0 \n",
"1 0.0 0.3 6.300000 2.5 \n",
"2 0.0 0.3 18.799999 0.0 \n",
"3 0.0 0.3 4.300000 0.0 \n",
"4 0.0 0.3 15.950000 2.5 "
],
"text/html": [
"\n",
" <div id=\"df-0bcfc294-8ccd-442d-b76c-b90b7a584b12\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>vendor_id</th>\n",
" <th>pickup_at</th>\n",
" <th>dropoff_at</th>\n",
" <th>passenger_count</th>\n",
" <th>trip_distance</th>\n",
" <th>rate_code_id</th>\n",
" <th>store_and_fwd_flag</th>\n",
" <th>pickup_location_id</th>\n",
" <th>dropoff_location_id</th>\n",
" <th>payment_type</th>\n",
" <th>fare_amount</th>\n",
" <th>extra</th>\n",
" <th>mta_tax</th>\n",
" <th>tip_amount</th>\n",
" <th>tolls_amount</th>\n",
" <th>improvement_surcharge</th>\n",
" <th>total_amount</th>\n",
" <th>congestion_surcharge</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>2019-06-01 00:55:13</td>\n",
" <td>2019-06-01 00:56:17</td>\n",
" <td>1</td>\n",
" <td>0.0</td>\n",
" <td>1</td>\n",
" <td>N</td>\n",
" <td>145</td>\n",
" <td>145</td>\n",
" <td>2</td>\n",
" <td>3.0</td>\n",
" <td>0.5</td>\n",
" <td>0.5</td>\n",
" <td>0.00</td>\n",
" <td>0.0</td>\n",
" <td>0.3</td>\n",
" <td>4.300000</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>2019-06-01 00:06:31</td>\n",
" <td>2019-06-01 00:06:52</td>\n",
" <td>1</td>\n",
" <td>0.0</td>\n",
" <td>1</td>\n",
" <td>N</td>\n",
" <td>262</td>\n",
" <td>263</td>\n",
" <td>2</td>\n",
" <td>2.5</td>\n",
" <td>3.0</td>\n",
" <td>0.5</td>\n",
" <td>0.00</td>\n",
" <td>0.0</td>\n",
" <td>0.3</td>\n",
" <td>6.300000</td>\n",
" <td>2.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>2019-06-01 00:17:05</td>\n",
" <td>2019-06-01 00:36:38</td>\n",
" <td>1</td>\n",
" <td>4.4</td>\n",
" <td>1</td>\n",
" <td>N</td>\n",
" <td>74</td>\n",
" <td>7</td>\n",
" <td>2</td>\n",
" <td>17.5</td>\n",
" <td>0.5</td>\n",
" <td>0.5</td>\n",
" <td>0.00</td>\n",
" <td>0.0</td>\n",
" <td>0.3</td>\n",
" <td>18.799999</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>2019-06-01 00:59:02</td>\n",
" <td>2019-06-01 00:59:12</td>\n",
" <td>0</td>\n",
" <td>0.8</td>\n",
" <td>1</td>\n",
" <td>N</td>\n",
" <td>145</td>\n",
" <td>145</td>\n",
" <td>2</td>\n",
" <td>2.5</td>\n",
" <td>1.0</td>\n",
" <td>0.5</td>\n",
" <td>0.00</td>\n",
" <td>0.0</td>\n",
" <td>0.3</td>\n",
" <td>4.300000</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1</td>\n",
" <td>2019-06-01 00:03:25</td>\n",
" <td>2019-06-01 00:15:42</td>\n",
" <td>1</td>\n",
" <td>1.7</td>\n",
" <td>1</td>\n",
" <td>N</td>\n",
" <td>113</td>\n",
" <td>148</td>\n",
" <td>1</td>\n",
" <td>9.5</td>\n",
" <td>3.0</td>\n",
" <td>0.5</td>\n",
" <td>2.65</td>\n",
" <td>0.0</td>\n",
" <td>0.3</td>\n",
" <td>15.950000</td>\n",
" <td>2.5</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0bcfc294-8ccd-442d-b76c-b90b7a584b12')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-0bcfc294-8ccd-442d-b76c-b90b7a584b12 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-0bcfc294-8ccd-442d-b76c-b90b7a584b12');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
]
},
"metadata": {},
"execution_count": 18
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "iudIWgW-e29W",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "b658f4ea-f143-488f-f437-c3eb5ff875b9"
},
"source": [
"%%time\n",
"df = pandas.concat(\n",
"\t[pandas.read_parquet(file)\n",
"\t for file\n",
"\t in glob.glob('taxi/*.parquet')])\n",
"print(df.head(5))"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
" vendor_id pickup_at dropoff_at passenger_count \\\n",
"0 1 2019-06-01 00:55:13 2019-06-01 00:56:17 1 \n",
"1 1 2019-06-01 00:06:31 2019-06-01 00:06:52 1 \n",
"2 1 2019-06-01 00:17:05 2019-06-01 00:36:38 1 \n",
"3 1 2019-06-01 00:59:02 2019-06-01 00:59:12 0 \n",
"4 1 2019-06-01 00:03:25 2019-06-01 00:15:42 1 \n",
"\n",
" trip_distance rate_code_id store_and_fwd_flag pickup_location_id \\\n",
"0 0.0 1 N 145 \n",
"1 0.0 1 N 262 \n",
"2 4.4 1 N 74 \n",
"3 0.8 1 N 145 \n",
"4 1.7 1 N 113 \n",
"\n",
" dropoff_location_id payment_type fare_amount extra mta_tax tip_amount \\\n",
"0 145 2 3.0 0.5 0.5 0.00 \n",
"1 263 2 2.5 3.0 0.5 0.00 \n",
"2 7 2 17.5 0.5 0.5 0.00 \n",
"3 145 2 2.5 1.0 0.5 0.00 \n",
"4 148 1 9.5 3.0 0.5 2.65 \n",
"\n",
" tolls_amount improvement_surcharge total_amount congestion_surcharge \n",
"0 0.0 0.3 4.300000 0.0 \n",
"1 0.0 0.3 6.300000 2.5 \n",
"2 0.0 0.3 18.799999 0.0 \n",
"3 0.0 0.3 4.300000 0.0 \n",
"4 0.0 0.3 15.950000 2.5 \n",
"CPU times: user 10.4 s, sys: 5.19 s, total: 15.6 s\n",
"Wall time: 12 s\n"
]
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "08pmu4IN2TPA"
},
"source": [
"## Concatenate the three files into a single large file \n",
"\n",
"As Pandas does not have native support for reading multiple files, we perform the remaining experiments on a single large file.\n",
"\n",
"We use the pyarrow library to concatenate the three files into a single file."
]
},
{
"cell_type": "code",
"metadata": {
"id": "CISvtKWrfgJP"
},
"source": [
"# concatenate all three parquet files into a single file\n",
"pq.write_table(pq.ParquetDataset('taxi/').read(), 'alltaxi.parquet', row_group_size=100000)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "_ER2RRs24Ikx"
},
"source": [
"### Querying the Single File\n",
"\n",
"Now let's query the single file and check the achieved performance."
]
},
{
"cell_type": "code",
"metadata": {
"id": "vujLPgUqHztl",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 391
},
"outputId": "c0c6f074-0fae-45ac-c6fc-ad00cb850fd9"
},
"source": [
"%%time\n",
"# chDB\n",
"chdb.query('SELECT * FROM file(\"alltaxi.parquet\", Parquet) LIMIT 5', 'Dataframe');\n"
],
"execution_count": 19,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"CPU times: user 66.7 ms, sys: 28.1 ms, total: 94.7 ms\n",
"Wall time: 117 ms\n"
]
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
" vendor_id pickup_at dropoff_at \\\n",
"0 b'1' 2019-04-01 00:04:09+00:00 2019-04-01 00:06:35+00:00 \n",
"1 b'1' 2019-04-01 00:22:45+00:00 2019-04-01 00:25:43+00:00 \n",
"2 b'1' 2019-04-01 00:39:48+00:00 2019-04-01 01:19:39+00:00 \n",
"3 b'1' 2019-04-01 00:35:32+00:00 2019-04-01 00:37:11+00:00 \n",
"4 b'1' 2019-04-01 00:44:05+00:00 2019-04-01 00:57:58+00:00 \n",
"\n",
" passenger_count trip_distance rate_code_id store_and_fwd_flag \\\n",
"0 1 0.5 b'1' b'N' \n",
"1 1 0.7 b'1' b'N' \n",
"2 1 10.9 b'1' b'N' \n",
"3 1 0.2 b'1' b'N' \n",
"4 1 4.8 b'1' b'N' \n",
"\n",
" pickup_location_id dropoff_location_id payment_type fare_amount extra \\\n",
"0 239 239 b'1' 4.0 3.0 \n",
"1 230 100 b'2' 4.5 3.0 \n",
"2 68 127 b'1' 36.0 3.0 \n",
"3 68 68 b'2' 3.5 3.0 \n",
"4 50 42 b'1' 15.5 3.0 \n",
"\n",
" mta_tax tip_amount tolls_amount improvement_surcharge total_amount \\\n",
"0 0.5 1.00 0.0 0.3 8.80 \n",
"1 0.5 0.00 0.0 0.3 8.30 \n",
"2 0.5 7.95 0.0 0.3 47.75 \n",
"3 0.5 0.00 0.0 0.3 7.30 \n",
"4 0.5 3.85 0.0 0.3 23.15 \n",
"\n",
" congestion_surcharge \n",
"0 2.5 \n",
"1 2.5 \n",
"2 2.5 \n",
"3 2.5 \n",
"4 2.5 "
],
"text/html": [
"\n",
" <div id=\"df-6813280a-6daa-40e6-9869-68e4e1e45324\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>vendor_id</th>\n",
" <th>pickup_at</th>\n",
" <th>dropoff_at</th>\n",
" <th>passenger_count</th>\n",
" <th>trip_distance</th>\n",
" <th>rate_code_id</th>\n",
" <th>store_and_fwd_flag</th>\n",
" <th>pickup_location_id</th>\n",
" <th>dropoff_location_id</th>\n",
" <th>payment_type</th>\n",
" <th>fare_amount</th>\n",
" <th>extra</th>\n",
" <th>mta_tax</th>\n",
" <th>tip_amount</th>\n",
" <th>tolls_amount</th>\n",
" <th>improvement_surcharge</th>\n",
" <th>total_amount</th>\n",
" <th>congestion_surcharge</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>b'1'</td>\n",
" <td>2019-04-01 00:04:09+00:00</td>\n",
" <td>2019-04-01 00:06:35+00:00</td>\n",
" <td>1</td>\n",
" <td>0.5</td>\n",
" <td>b'1'</td>\n",
" <td>b'N'</td>\n",
" <td>239</td>\n",
" <td>239</td>\n",
" <td>b'1'</td>\n",
" <td>4.0</td>\n",
" <td>3.0</td>\n",
" <td>0.5</td>\n",
" <td>1.00</td>\n",
" <td>0.0</td>\n",
" <td>0.3</td>\n",
" <td>8.80</td>\n",
" <td>2.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>b'1'</td>\n",
" <td>2019-04-01 00:22:45+00:00</td>\n",
" <td>2019-04-01 00:25:43+00:00</td>\n",
" <td>1</td>\n",
" <td>0.7</td>\n",
" <td>b'1'</td>\n",
" <td>b'N'</td>\n",
" <td>230</td>\n",
" <td>100</td>\n",
" <td>b'2'</td>\n",
" <td>4.5</td>\n",
" <td>3.0</td>\n",
" <td>0.5</td>\n",
" <td>0.00</td>\n",
" <td>0.0</td>\n",
" <td>0.3</td>\n",
" <td>8.30</td>\n",
" <td>2.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>b'1'</td>\n",
" <td>2019-04-01 00:39:48+00:00</td>\n",
" <td>2019-04-01 01:19:39+00:00</td>\n",
" <td>1</td>\n",
" <td>10.9</td>\n",
" <td>b'1'</td>\n",
" <td>b'N'</td>\n",
" <td>68</td>\n",
" <td>127</td>\n",
" <td>b'1'</td>\n",
" <td>36.0</td>\n",
" <td>3.0</td>\n",
" <td>0.5</td>\n",
" <td>7.95</td>\n",
" <td>0.0</td>\n",
" <td>0.3</td>\n",
" <td>47.75</td>\n",
" <td>2.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>b'1'</td>\n",
" <td>2019-04-01 00:35:32+00:00</td>\n",
" <td>2019-04-01 00:37:11+00:00</td>\n",
" <td>1</td>\n",
" <td>0.2</td>\n",
" <td>b'1'</td>\n",
" <td>b'N'</td>\n",
" <td>68</td>\n",
" <td>68</td>\n",
" <td>b'2'</td>\n",
" <td>3.5</td>\n",
" <td>3.0</td>\n",
" <td>0.5</td>\n",
" <td>0.00</td>\n",
" <td>0.0</td>\n",
" <td>0.3</td>\n",
" <td>7.30</td>\n",
" <td>2.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>b'1'</td>\n",
" <td>2019-04-01 00:44:05+00:00</td>\n",
" <td>2019-04-01 00:57:58+00:00</td>\n",
" <td>1</td>\n",
" <td>4.8</td>\n",
" <td>b'1'</td>\n",
" <td>b'N'</td>\n",
" <td>50</td>\n",
" <td>42</td>\n",
" <td>b'1'</td>\n",
" <td>15.5</td>\n",
" <td>3.0</td>\n",
" <td>0.5</td>\n",
" <td>3.85</td>\n",
" <td>0.0</td>\n",
" <td>0.3</td>\n",
" <td>23.15</td>\n",
" <td>2.5</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-6813280a-6daa-40e6-9869-68e4e1e45324')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-6813280a-6daa-40e6-9869-68e4e1e45324 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-6813280a-6daa-40e6-9869-68e4e1e45324');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
]
},
"metadata": {},
"execution_count": 19
}
]
},
{
"cell_type": "code",
"source": [
"%%time\n",
"# duckDB\n",
"con.execute(\"SELECT * FROM 'alltaxi.parquet' LIMIT 5\").df()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 478
},
"id": "xTIvWbQgKoLN",
"outputId": "03dbb9eb-65c1-4044-99a7-7ef513cbd8de"
},
"execution_count": 20,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"CPU times: user 27 ms, sys: 15.2 ms, total: 42.2 ms\n",
"Wall time: 52.2 ms\n"
]
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
" vendor_id pickup_at dropoff_at passenger_count \\\n",
"0 1 2019-04-01 00:04:09 2019-04-01 00:06:35 1 \n",
"1 1 2019-04-01 00:22:45 2019-04-01 00:25:43 1 \n",
"2 1 2019-04-01 00:39:48 2019-04-01 01:19:39 1 \n",
"3 1 2019-04-01 00:35:32 2019-04-01 00:37:11 1 \n",
"4 1 2019-04-01 00:44:05 2019-04-01 00:57:58 1 \n",
"\n",
" trip_distance rate_code_id store_and_fwd_flag pickup_location_id \\\n",
"0 0.5 1 N 239 \n",
"1 0.7 1 N 230 \n",
"2 10.9 1 N 68 \n",
"3 0.2 1 N 68 \n",
"4 4.8 1 N 50 \n",
"\n",
" dropoff_location_id payment_type fare_amount extra mta_tax tip_amount \\\n",
"0 239 1 4.0 3.0 0.5 1.00 \n",
"1 100 2 4.5 3.0 0.5 0.00 \n",
"2 127 1 36.0 3.0 0.5 7.95 \n",
"3 68 2 3.5 3.0 0.5 0.00 \n",
"4 42 1 15.5 3.0 0.5 3.85 \n",
"\n",
" tolls_amount improvement_surcharge total_amount congestion_surcharge \n",
"0 0.0 0.3 8.80 2.5 \n",
"1 0.0 0.3 8.30 2.5 \n",
"2 0.0 0.3 47.75 2.5 \n",
"3 0.0 0.3 7.30 2.5 \n",
"4 0.0 0.3 23.15 2.5 "
],
"text/html": [
"\n",
" <div id=\"df-3829ba77-526f-416d-bb76-807553dfde5e\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>vendor_id</th>\n",
" <th>pickup_at</th>\n",
" <th>dropoff_at</th>\n",
" <th>passenger_count</th>\n",
" <th>trip_distance</th>\n",
" <th>rate_code_id</th>\n",
" <th>store_and_fwd_flag</th>\n",
" <th>pickup_location_id</th>\n",
" <th>dropoff_location_id</th>\n",
" <th>payment_type</th>\n",
" <th>fare_amount</th>\n",
" <th>extra</th>\n",
" <th>mta_tax</th>\n",
" <th>tip_amount</th>\n",
" <th>tolls_amount</th>\n",
" <th>improvement_surcharge</th>\n",
" <th>total_amount</th>\n",
" <th>congestion_surcharge</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>2019-04-01 00:04:09</td>\n",
" <td>2019-04-01 00:06:35</td>\n",
" <td>1</td>\n",
" <td>0.5</td>\n",
" <td>1</td>\n",
" <td>N</td>\n",
" <td>239</td>\n",
" <td>239</td>\n",
" <td>1</td>\n",
" <td>4.0</td>\n",
" <td>3.0</td>\n",
" <td>0.5</td>\n",
" <td>1.00</td>\n",
" <td>0.0</td>\n",
" <td>0.3</td>\n",
" <td>8.80</td>\n",
" <td>2.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>2019-04-01 00:22:45</td>\n",
" <td>2019-04-01 00:25:43</td>\n",
" <td>1</td>\n",
" <td>0.7</td>\n",
" <td>1</td>\n",
" <td>N</td>\n",
" <td>230</td>\n",
" <td>100</td>\n",
" <td>2</td>\n",
" <td>4.5</td>\n",
" <td>3.0</td>\n",
" <td>0.5</td>\n",
" <td>0.00</td>\n",
" <td>0.0</td>\n",
" <td>0.3</td>\n",
" <td>8.30</td>\n",
" <td>2.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>2019-04-01 00:39:48</td>\n",
" <td>2019-04-01 01:19:39</td>\n",
" <td>1</td>\n",
" <td>10.9</td>\n",
" <td>1</td>\n",
" <td>N</td>\n",
" <td>68</td>\n",
" <td>127</td>\n",
" <td>1</td>\n",
" <td>36.0</td>\n",
" <td>3.0</td>\n",
" <td>0.5</td>\n",
" <td>7.95</td>\n",
" <td>0.0</td>\n",
" <td>0.3</td>\n",
" <td>47.75</td>\n",
" <td>2.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>2019-04-01 00:35:32</td>\n",
" <td>2019-04-01 00:37:11</td>\n",
" <td>1</td>\n",
" <td>0.2</td>\n",
" <td>1</td>\n",
" <td>N</td>\n",
" <td>68</td>\n",
" <td>68</td>\n",
" <td>2</td>\n",
" <td>3.5</td>\n",
" <td>3.0</td>\n",
" <td>0.5</td>\n",
" <td>0.00</td>\n",
" <td>0.0</td>\n",
" <td>0.3</td>\n",
" <td>7.30</td>\n",
" <td>2.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1</td>\n",
" <td>2019-04-01 00:44:05</td>\n",
" <td>2019-04-01 00:57:58</td>\n",
" <td>1</td>\n",
" <td>4.8</td>\n",
" <td>1</td>\n",
" <td>N</td>\n",
" <td>50</td>\n",
" <td>42</td>\n",
" <td>1</td>\n",
" <td>15.5</td>\n",
" <td>3.0</td>\n",
" <td>0.5</td>\n",
" <td>3.85</td>\n",
" <td>0.0</td>\n",
" <td>0.3</td>\n",
" <td>23.15</td>\n",
" <td>2.5</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-3829ba77-526f-416d-bb76-807553dfde5e')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-3829ba77-526f-416d-bb76-807553dfde5e button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-3829ba77-526f-416d-bb76-807553dfde5e');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
]
},
"metadata": {},
"execution_count": 20
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "6plLAHdN39y5",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 478
},
"outputId": "5a3efc88-b60d-4b00-f32c-ac29adc160cd"
},
"source": [
"%%time\n",
"pandas.read_parquet('alltaxi.parquet').head(5)"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"CPU times: user 11.3 s, sys: 4.08 s, total: 15.4 s\n",
"Wall time: 9.81 s\n"
]
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
" vendor_id pickup_at dropoff_at passenger_count \\\n",
"0 1 2019-04-01 00:04:09 2019-04-01 00:06:35 1 \n",
"1 1 2019-04-01 00:22:45 2019-04-01 00:25:43 1 \n",
"2 1 2019-04-01 00:39:48 2019-04-01 01:19:39 1 \n",
"3 1 2019-04-01 00:35:32 2019-04-01 00:37:11 1 \n",
"4 1 2019-04-01 00:44:05 2019-04-01 00:57:58 1 \n",
"\n",
" trip_distance rate_code_id store_and_fwd_flag pickup_location_id \\\n",
"0 0.5 1 N 239 \n",
"1 0.7 1 N 230 \n",
"2 10.9 1 N 68 \n",
"3 0.2 1 N 68 \n",
"4 4.8 1 N 50 \n",
"\n",
" dropoff_location_id payment_type fare_amount extra mta_tax tip_amount \\\n",
"0 239 1 4.0 3.0 0.5 1.00 \n",
"1 100 2 4.5 3.0 0.5 0.00 \n",
"2 127 1 36.0 3.0 0.5 7.95 \n",
"3 68 2 3.5 3.0 0.5 0.00 \n",
"4 42 1 15.5 3.0 0.5 3.85 \n",
"\n",
" tolls_amount improvement_surcharge total_amount congestion_surcharge \n",
"0 0.0 0.3 8.80 2.5 \n",
"1 0.0 0.3 8.30 2.5 \n",
"2 0.0 0.3 47.75 2.5 \n",
"3 0.0 0.3 7.30 2.5 \n",
"4 0.0 0.3 23.15 2.5 "
],
"text/html": [
"\n",
" <div id=\"df-ddfa6d56-7c5f-4ba5-914b-0c4dcba7f4b1\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>vendor_id</th>\n",
" <th>pickup_at</th>\n",
" <th>dropoff_at</th>\n",
" <th>passenger_count</th>\n",
" <th>trip_distance</th>\n",
" <th>rate_code_id</th>\n",
" <th>store_and_fwd_flag</th>\n",
" <th>pickup_location_id</th>\n",
" <th>dropoff_location_id</th>\n",
" <th>payment_type</th>\n",
" <th>fare_amount</th>\n",
" <th>extra</th>\n",
" <th>mta_tax</th>\n",
" <th>tip_amount</th>\n",
" <th>tolls_amount</th>\n",
" <th>improvement_surcharge</th>\n",
" <th>total_amount</th>\n",
" <th>congestion_surcharge</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>2019-04-01 00:04:09</td>\n",
" <td>2019-04-01 00:06:35</td>\n",
" <td>1</td>\n",
" <td>0.5</td>\n",
" <td>1</td>\n",
" <td>N</td>\n",
" <td>239</td>\n",
" <td>239</td>\n",
" <td>1</td>\n",
" <td>4.0</td>\n",
" <td>3.0</td>\n",
" <td>0.5</td>\n",
" <td>1.00</td>\n",
" <td>0.0</td>\n",
" <td>0.3</td>\n",
" <td>8.80</td>\n",
" <td>2.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>2019-04-01 00:22:45</td>\n",
" <td>2019-04-01 00:25:43</td>\n",
" <td>1</td>\n",
" <td>0.7</td>\n",
" <td>1</td>\n",
" <td>N</td>\n",
" <td>230</td>\n",
" <td>100</td>\n",
" <td>2</td>\n",
" <td>4.5</td>\n",
" <td>3.0</td>\n",
" <td>0.5</td>\n",
" <td>0.00</td>\n",
" <td>0.0</td>\n",
" <td>0.3</td>\n",
" <td>8.30</td>\n",
" <td>2.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>2019-04-01 00:39:48</td>\n",
" <td>2019-04-01 01:19:39</td>\n",
" <td>1</td>\n",
" <td>10.9</td>\n",
" <td>1</td>\n",
" <td>N</td>\n",
" <td>68</td>\n",
" <td>127</td>\n",
" <td>1</td>\n",
" <td>36.0</td>\n",
" <td>3.0</td>\n",
" <td>0.5</td>\n",
" <td>7.95</td>\n",
" <td>0.0</td>\n",
" <td>0.3</td>\n",
" <td>47.75</td>\n",
" <td>2.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>2019-04-01 00:35:32</td>\n",
" <td>2019-04-01 00:37:11</td>\n",
" <td>1</td>\n",
" <td>0.2</td>\n",
" <td>1</td>\n",
" <td>N</td>\n",
" <td>68</td>\n",
" <td>68</td>\n",
" <td>2</td>\n",
" <td>3.5</td>\n",
" <td>3.0</td>\n",
" <td>0.5</td>\n",
" <td>0.00</td>\n",
" <td>0.0</td>\n",
" <td>0.3</td>\n",
" <td>7.30</td>\n",
" <td>2.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1</td>\n",
" <td>2019-04-01 00:44:05</td>\n",
" <td>2019-04-01 00:57:58</td>\n",
" <td>1</td>\n",
" <td>4.8</td>\n",
" <td>1</td>\n",
" <td>N</td>\n",
" <td>50</td>\n",
" <td>42</td>\n",
" <td>1</td>\n",
" <td>15.5</td>\n",
" <td>3.0</td>\n",
" <td>0.5</td>\n",
" <td>3.85</td>\n",
" <td>0.0</td>\n",
" <td>0.3</td>\n",
" <td>23.15</td>\n",
" <td>2.5</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-ddfa6d56-7c5f-4ba5-914b-0c4dcba7f4b1')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-ddfa6d56-7c5f-4ba5-914b-0c4dcba7f4b1 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-ddfa6d56-7c5f-4ba5-914b-0c4dcba7f4b1');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
]
},
"metadata": {},
"execution_count": 8
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "SQZyP6Er6B-8"
},
"source": [
"## Counting the Rows\n",
"\n",
"Now suppose we want to figure out how many rows are in our data set. We can do that using the following code snippets.\n",
"\n",
"Note that by default Pandas will read the entire Parquet file into memory again. We can manually optimize the query by specifying that only a single column should be loaded.\n",
"\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "7usVA6OSlsOE",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 115
},
"outputId": "2fa36eb6-7c29-442f-d9af-1707436d00a2"
},
"source": [
"%%time\n",
"# chDB\n",
"chdb.query('SELECT count(*) FROM file(\"alltaxi.parquet\", Parquet) LIMIT 5', 'Dataframe');\n"
],
"execution_count": 21,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"CPU times: user 177 ms, sys: 234 ms, total: 411 ms\n",
"Wall time: 977 ms\n"
]
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
" count()\n",
"0 21939424"
],
"text/html": [
"\n",
" <div id=\"df-cd6400db-2793-406a-b083-785c577cb52c\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>count()</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>21939424</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-cd6400db-2793-406a-b083-785c577cb52c')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-cd6400db-2793-406a-b083-785c577cb52c button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-cd6400db-2793-406a-b083-785c577cb52c');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
]
},
"metadata": {},
"execution_count": 21
}
]
},
{
"cell_type": "code",
"source": [
"%%time\n",
"# DuckDB\n",
"print(con.execute(\"SELECT COUNT(*) FROM 'alltaxi.parquet'\").df())"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "7VeRcMmlKwfk",
"outputId": "5bc34063-4b56-4ef7-8a6e-4605d0406cd5"
},
"execution_count": 22,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
" count_star()\n",
"0 21939424\n",
"CPU times: user 31.6 ms, sys: 1.78 ms, total: 33.4 ms\n",
"Wall time: 28.7 ms\n"
]
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "t4-v5dL5H921",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "e41d38ab-2784-4818-f8ff-2a6870669465"
},
"source": [
"%%time\n",
"# Pandas (naive)\n",
"print(len(pandas.read_parquet('alltaxi.parquet')))"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"21939424\n",
"CPU times: user 8.36 s, sys: 3.95 s, total: 12.3 s\n",
"Wall time: 7.56 s\n"
]
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "JwYp_WhvIkW0",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "7cf2f932-dcbf-443d-87c9-8858d1b2f2d7"
},
"source": [
"%%time\n",
"# Pandas (projection pushdown)\n",
"print(len(pandas.read_parquet('alltaxi.parquet', columns=['vendor_id'])))"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"21939424\n",
"CPU times: user 1.4 s, sys: 543 ms, total: 1.95 s\n",
"Wall time: 1.63 s\n"
]
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "frXCfU-u8p9R"
},
"source": [
"# Filtering Rows\n",
"It is common to use some sort of filtering predicate to only look at the interesting parts of a data set. For example, imagine we want to know how many taxi rides occur after the 30th of June 2019. We can do that using the following queries in both Pandas and chDB.\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "TVQpHs4DJHAQ",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 115
},
"outputId": "8154fb68-69b2-491d-c0b1-9612ef26119e"
},
"source": [
"%%time\n",
"# chDB\n",
"chdb.query(\"SELECT count(*) FROM file('alltaxi.parquet', Parquet) WHERE pickup_at > '2019-06-30'\", 'Dataframe');"
],
"execution_count": 24,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"CPU times: user 446 ms, sys: 172 ms, total: 618 ms\n",
"Wall time: 1.41 s\n"
]
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
" count()\n",
"0 167022"
],
"text/html": [
"\n",
" <div id=\"df-d391d58d-2917-48ef-b296-739b8f947b2c\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>count()</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>167022</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-d391d58d-2917-48ef-b296-739b8f947b2c')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-d391d58d-2917-48ef-b296-739b8f947b2c button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-d391d58d-2917-48ef-b296-739b8f947b2c');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
]
},
"metadata": {},
"execution_count": 24
}
]
},
{
"cell_type": "code",
"source": [
"%%time\n",
"# DuckDB\n",
"con.execute(\"SELECT COUNT(*) FROM 'alltaxi.parquet' WHERE pickup_at > '2019-06-30'\").df()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 115
},
"id": "k9w3oHAMK0Hb",
"outputId": "b93516ea-5344-4e6a-a6b0-4d0f9791fb45"
},
"execution_count": 23,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"CPU times: user 100 ms, sys: 4.85 ms, total: 105 ms\n",
"Wall time: 72.7 ms\n"
]
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
" count_star()\n",
"0 167022"
],
"text/html": [
"\n",
" <div id=\"df-05607803-262e-4b52-a294-f1983d04c1b7\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>count_star()</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>167022</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-05607803-262e-4b52-a294-f1983d04c1b7')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-05607803-262e-4b52-a294-f1983d04c1b7 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-05607803-262e-4b52-a294-f1983d04c1b7');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
]
},
"metadata": {},
"execution_count": 23
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "nH-nuAfCJTRr",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "0631ab50-99cf-4547-f40d-840abdc26b76"
},
"source": [
"%%time\n",
"# Pandas (naive)\n",
"len(pandas.read_parquet('alltaxi.parquet')\n",
" .query(\"pickup_at > '2019-06-30'\"))"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"CPU times: user 8.34 s, sys: 3.66 s, total: 12 s\n",
"Wall time: 8.19 s\n"
]
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
"167022"
]
},
"metadata": {},
"execution_count": 13
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "hUdGqKBTJutf",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "4e872513-249d-4f9d-8da9-3b52e04ad846"
},
"source": [
"%%time\n",
"# Pandas (projection pushdown)\n",
"len(pandas.read_parquet('alltaxi.parquet', columns=['pickup_at'])\n",
" .query(\"pickup_at > '2019-06-30'\"))"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"CPU times: user 624 ms, sys: 350 ms, total: 974 ms\n",
"Wall time: 940 ms\n"
],
"name": "stdout"
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
"167022"
]
},
"metadata": {
"tags": []
},
"execution_count": 27
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "RQFlOTKCKGro"
},
"source": [
"%%time\n",
"# Pandas (projection + filter pushdown)\n",
"len(pandas.read_parquet('alltaxi.parquet', columns=['pickup_at'], filters=[('pickup_at', '>', '2019-06-30')]))"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "pDVawGeiCCdR"
},
"source": [
"df = pandas.read_parquet('alltaxi.parquet')"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "td7w1Pi-CGR8"
},
"source": [
"%%time\n",
"# Pandas native\n",
"print(len(df[['pickup_at']].query(\"pickup_at > '2019-06-30'\")))"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "7vfhR9FHDYrp"
},
"source": [
"## Aggregates\n",
"\n",
"Now suppose we want to figure out how many rows are in our data set. We can do that using the following code snippets.\n",
"\n",
"Note that by default Pandas will read the entire Parquet file into memory again. We can manually optimize the query by specifying that only a single column should be loaded.\n",
"\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "bPlEXfuIKbf2"
},
"source": [
"%%time\n",
"# chDB (SQL)\n",
"chdb.query(\"SELECT passenger_count, count(*) FROM file('alltaxi.parquet', Parquet) GROUP BY passenger_count\", 'Dataframe');"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"%%time\n",
"# DuckDB (SQL)\n",
"con.execute(\"SELECT passenger_count, COUNT(*) FROM 'alltaxi.parquet' GROUP BY passenger_count\").df()"
],
"metadata": {
"id": "BoEePwlkK32s"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "xchbsEwoL5ll"
},
"source": [
"%%time\n",
"# Pandas (naive)\n",
"pandas.read_parquet('alltaxi.parquet').groupby('passenger_count').agg({'passenger_count' : 'count'})"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "K8WEZBEYMEja"
},
"source": [
"%%time\n",
"# Pandas (projection pushdown)\n",
"pandas.read_parquet('alltaxi.parquet', columns=['passenger_count']).groupby('passenger_count').agg({'passenger_count' : 'count'})"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "cTdx54ecw7x9"
},
"source": [
"%%time\n",
"# Pandas (native)\n",
"df[['passenger_count']].groupby('passenger_count').agg({'passenger_count' : 'count'})"
],
"execution_count": null,
"outputs": []
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment