Skip to content

Instantly share code, notes, and snippets.

@lsloan
Last active July 19, 2023 02:46
Show Gist options
  • Save lsloan/b6e481c2d045ee6d1b768b04f8fa9a34 to your computer and use it in GitHub Desktop.
Save lsloan/b6e481c2d045ee6d1b768b04f8fa9a34 to your computer and use it in GitHub Desktop.
umich-udp-bigquery-to-pandas-dataframe.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"authorship_tag": "ABX9TyP8PIE5O2Mkb+bYHkgl3yA9",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/lsloan/b6e481c2d045ee6d1b768b04f8fa9a34/umich-udp-bigquery-to-pandas-dataframe.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"source": [
"# UMich UDP BigQuery to Pandas DataFrame\n",
"\n",
"Based on code snippets in Google Colab, these examples show how to get data directly from UMich's UDP instance in Google BigQuery into Pandas DataFrames."
],
"metadata": {
"id": "GTCG0CG7q6eI"
}
},
{
"cell_type": "code",
"source": [
"from google.colab import auth\n",
"auth.authenticate_user()"
],
"metadata": {
"id": "5JFB6Lz6PVKT"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"cloudProjectId = 'umich-udp-exploration'\n",
"bigqueryTableId = 'lecture_capture_exploration.640658_export'\n",
"sql = f'SELECT * FROM `{bigqueryTableId}`'"
],
"metadata": {
"id": "bl-phOllco5J"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Pandas API\n",
"\n",
"Using the Google BigQuery (`gbq`) module from Pandas requires very little code. One line in this example. It lacks some of Cloud API's flashy features, but it gets the job done. It may not allow defining default column types, but there might be a way to specify the `project_id` once in the environment. Fortunately, it supports the `tqdm` progress bar using the `progress_bar_type='tqdm_notebook'` argument.\n",
"\n",
"See: [`read_gbq()` documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_gbq.html)"
],
"metadata": {
"id": "FitC1Ot2pr_n"
}
},
{
"cell_type": "code",
"source": [
"import pandas as pd\n",
"\n",
"df = pd.read_gbq(sql, project_id=cloudProjectId, progress_bar_type='tqdm_notebook')\n",
"display(df.head())"
],
"metadata": {
"id": "YPvTe3c7chnd"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Cloud API\n",
"\n",
"Using Google's Cloud API module for Python, there are a few advantages, but it requires a little more code. For example…\n",
"* The project ID may be given once when constructing the client object, then each subsequent query will not need to be told which project to use.\n",
"* `to_dataframe()` lets Python types be defined for each of various BigQuery types.\n",
"* `progress_bar_type='tqdm_notebook'` can be set to get a progress bar while the data is being loaded (and while the query is running), which is helpful with large datasets.\n",
"\n",
"See…\n",
"* [Download query results to DataFrame](https://cloud.google.com/bigquery/docs/samples/bigquery-query-results-dataframe) BigQuery code sample\n",
"* [`to_dataframe()` documentation](https://cloud.google.com/python/docs/reference/bigquery/latest/google.cloud.bigquery.job.QueryJob#google_cloud_bigquery_job_QueryJob_to_dataframe)"
],
"metadata": {
"id": "GkXTFPMHoAx6"
}
},
{
"cell_type": "code",
"source": [
"from google.cloud import bigquery\n",
"\n",
"client = bigquery.Client(project=cloudProjectId)\n",
"df0 = client.query(sql).to_dataframe(progress_bar_type='tqdm_notebook')\n",
"display(df0.head())"
],
"metadata": {
"id": "TwdbSbHyPVKU"
},
"execution_count": null,
"outputs": []
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment