lsloan/umich-udp-bigquery-to-pandas-dataframe.ipynb

## umich-udp-bigquery-to-pandas-dataframe.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "authorship_tag": "ABX9TyP8PIE5O2Mkb+bYHkgl3yA9",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/lsloan/b6e481c2d045ee6d1b768b04f8fa9a34/umich-udp-bigquery-to-pandas-dataframe.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# UMich UDP BigQuery to Pandas DataFrame\n",
        "\n",
        "Based on code snippets in Google Colab, these examples show how to get data directly from UMich's UDP instance in Google BigQuery into Pandas DataFrames."
      ],
      "metadata": {
        "id": "GTCG0CG7q6eI"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from google.colab import auth\n",
        "auth.authenticate_user()"
      ],
      "metadata": {
        "id": "5JFB6Lz6PVKT"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "cloudProjectId = 'umich-udp-exploration'\n",
        "bigqueryTableId = 'lecture_capture_exploration.640658_export'\n",
        "sql = f'SELECT * FROM `{bigqueryTableId}`'"
      ],
      "metadata": {
        "id": "bl-phOllco5J"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Pandas API\n",
        "\n",
        "Using the Google BigQuery (`gbq`) module from Pandas requires very little code.  One line in this example.  It lacks some of Cloud API's flashy features, but it gets the job done.  It may not allow defining default column types, but there might be a way to specify the `project_id` once in the environment.  Fortunately, it supports the `tqdm` progress bar using the `progress_bar_type='tqdm_notebook'` argument.\n",
        "\n",
        "See: [`read_gbq()` documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_gbq.html)"
      ],
      "metadata": {
        "id": "FitC1Ot2pr_n"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import pandas as pd\n",
        "\n",
        "df = pd.read_gbq(sql, project_id=cloudProjectId, progress_bar_type='tqdm_notebook')\n",
        "display(df.head())"
      ],
      "metadata": {
        "id": "YPvTe3c7chnd"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Cloud API\n",
        "\n",
        "Using Google's Cloud API module for Python, there are a few advantages, but it requires a little more code.  For example…\n",
        "* The project ID may be given once when constructing the client object, then each subsequent query will not need to be told which project to use.\n",
        "* `to_dataframe()` lets Python types be defined for each of various BigQuery types.\n",
        "* `progress_bar_type='tqdm_notebook'` can be set to get a progress bar while the data is being loaded (and while the query is running), which is helpful with large datasets.\n",
        "\n",
        "See…\n",
        "* [Download query results to DataFrame](https://cloud.google.com/bigquery/docs/samples/bigquery-query-results-dataframe) BigQuery code sample\n",
        "* [`to_dataframe()` documentation](https://cloud.google.com/python/docs/reference/bigquery/latest/google.cloud.bigquery.job.QueryJob#google_cloud_bigquery_job_QueryJob_to_dataframe)"
      ],
      "metadata": {
        "id": "GkXTFPMHoAx6"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from google.cloud import bigquery\n",
        "\n",
        "client = bigquery.Client(project=cloudProjectId)\n",
        "df0 = client.query(sql).to_dataframe(progress_bar_type='tqdm_notebook')\n",
        "display(df0.head())"
      ],
      "metadata": {
        "id": "TwdbSbHyPVKU"
      },
      "execution_count": null,
      "outputs": []
    }
  ]
}
	{
	"nbformat": 4,
	"nbformat_minor": 0,
	"metadata": {
	"colab": {
	"provenance": [],
	"authorship_tag": "ABX9TyP8PIE5O2Mkb+bYHkgl3yA9",
	"include_colab_link": true
	},
	"kernelspec": {
	"name": "python3",
	"display_name": "Python 3"
	},
	"language_info": {
	"name": "python"
	}
	},
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "view-in-github",
	"colab_type": "text"
	},
	"source": [
	"<a href=\"https://colab.research.google.com/gist/lsloan/b6e481c2d045ee6d1b768b04f8fa9a34/umich-udp-bigquery-to-pandas-dataframe.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
	]
	},
	{
	"cell_type": "markdown",
	"source": [
	"# UMich UDP BigQuery to Pandas DataFrame\n",
	"\n",
	"Based on code snippets in Google Colab, these examples show how to get data directly from UMich's UDP instance in Google BigQuery into Pandas DataFrames."
	],
	"metadata": {
	"id": "GTCG0CG7q6eI"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"from google.colab import auth\n",
	"auth.authenticate_user()"
	],
	"metadata": {
	"id": "5JFB6Lz6PVKT"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"cloudProjectId = 'umich-udp-exploration'\n",
	"bigqueryTableId = 'lecture_capture_exploration.640658_export'\n",
	"sql = f'SELECT * FROM `{bigqueryTableId}`'"
	],
	"metadata": {
	"id": "bl-phOllco5J"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"## Pandas API\n",
	"\n",
	"Using the Google BigQuery (`gbq`) module from Pandas requires very little code. One line in this example. It lacks some of Cloud API's flashy features, but it gets the job done. It may not allow defining default column types, but there might be a way to specify the `project_id` once in the environment. Fortunately, it supports the `tqdm` progress bar using the `progress_bar_type='tqdm_notebook'` argument.\n",
	"\n",
	"See: [`read_gbq()` documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_gbq.html)"
	],
	"metadata": {
	"id": "FitC1Ot2pr_n"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"import pandas as pd\n",
	"\n",
	"df = pd.read_gbq(sql, project_id=cloudProjectId, progress_bar_type='tqdm_notebook')\n",
	"display(df.head())"
	],
	"metadata": {
	"id": "YPvTe3c7chnd"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"## Cloud API\n",
	"\n",
	"Using Google's Cloud API module for Python, there are a few advantages, but it requires a little more code. For example…\n",
	"* The project ID may be given once when constructing the client object, then each subsequent query will not need to be told which project to use.\n",
	"* `to_dataframe()` lets Python types be defined for each of various BigQuery types.\n",
	"* `progress_bar_type='tqdm_notebook'` can be set to get a progress bar while the data is being loaded (and while the query is running), which is helpful with large datasets.\n",
	"\n",
	"See…\n",
	"* [Download query results to DataFrame](https://cloud.google.com/bigquery/docs/samples/bigquery-query-results-dataframe) BigQuery code sample\n",
	"* [`to_dataframe()` documentation](https://cloud.google.com/python/docs/reference/bigquery/latest/google.cloud.bigquery.job.QueryJob#google_cloud_bigquery_job_QueryJob_to_dataframe)"
	],
	"metadata": {
	"id": "GkXTFPMHoAx6"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"from google.cloud import bigquery\n",
	"\n",
	"client = bigquery.Client(project=cloudProjectId)\n",
	"df0 = client.query(sql).to_dataframe(progress_bar_type='tqdm_notebook')\n",
	"display(df0.head())"
	],
	"metadata": {
	"id": "TwdbSbHyPVKU"
	},
	"execution_count": null,
	"outputs": []
	}
	]
	}