Skip to content

Instantly share code, notes, and snippets.

@johngrinalds
Created March 1, 2023 18:28
Show Gist options
  • Save johngrinalds/5df23f86ed2b59c7d80a57dbea79e85d to your computer and use it in GitHub Desktop.
Save johngrinalds/5df23f86ed2b59c7d80a57dbea79e85d to your computer and use it in GitHub Desktop.

How to Use BigQuery's Object Tables

Subtitle: Use SQL to Analyze Unstructured Data Like Images, Video, and Audio!

Introduction

Traditionally, database SQL queries have only applied to structured, relational data. Unstructured data like photos or audio would require manual inspection or separate ML pipelines.

However, with the recent release of Google Cloud Object Tables for BigQuery, we can now run SQL queries over unstructured object data in Google Cloud Storage! While object tables simply reflect the metadata of a GCS bucket's content, once in place, they can be extended to all sorts of innovative use cases like:

  1. NLP Sentiment Analysis
  2. Video Embeddings
  3. Image Object Detection and Classification
  4. Transcription and Translation

This how-to guide will illustrate a step-by-step example of creating and extending a BigQuery object table.

Important Note: This feature is pre-GA (as of March 2023), so you will need to allowlist your project by emailing: bq-unstructured-help@google.com. Additionally, you will need to select a cloud region that specifically supports this feature; I found that us-east1 worked well for me.

Names and References

Names and IDs I'm using throughout this how-to are:

Project ID: bq-object-tables Object Bucket Name: bq-object-tables-sports-bucket Cloud Function Name: cloud_ai_vision Cloud Function Entrypoint: cloud_ai_vision BigQuery Dataset: bq_ot_dataset BigQuery Table: bq_object_tables_external_table BigQuery Remote Function Connection: biglake-connection

Step 0: Upload Images to GCS

For this tutorial, I will be working with unstructured image data. If you already have the images in your Google Cloud bucket that you want to analyze, then you're good. Otherwise you can use some example sports images from this Kaggle dataset.

Step 1: Enable Needed APIs

Open Cloud Shell and set your project with gcloud config set project [PROJECT_ID] if needed.

Enable the necessary APIs:

gcloud services enable cloudfunctions.googleapis.com --async
gcloud services enable cloudbuild.googleapis.com --async
gcloud services enable bigqueryconnection.googleapis.com --async
gcloud services enable vision.googleapis.com --async

Step 2: Create BigLake Connection

Create the BigQuery BigLake remote connection. This will allow you to connect BigQuery to Google Cloud Storage: bq mk --connection --location=us-east1 --project_id=bq-object-tables --connection_type=CLOUD_RESOURCE biglake-connection

Show the details of the newly created connection using this command: bq show --connection bq-object-tables.us-east1.biglake-connection

In those details, take note of the serviceAccountId under properties. It should look something like this:

{"serviceAccountId": "bqcx-012345678901-abcd@gcp-sa-bigquery-condel.iam.gserviceaccount.com"}

Now, grant GCS access to the SA generated by the previous command:

gsutil iam ch serviceAccount:bqcx-012345678901-abcd@gcp-sa-bigquery-condel.iam.gserviceaccount.com:objectViewer gs://bq-object-tables-sports-bucket

Step 3: Create Sample BigQuery Dataset and Table

Create a dataset in BigQuery to hold your object table: bq mk -d --data_location=us-east1 bq_ot_dataset

Finally, create the external object table. IMPORTANT: This query will fail if the project is not allowlisted for Object Tables (see introduction above):

CREATE OR REPLACE EXTERNAL TABLE
  `bq-object-tables.bq_ot_dataset.bq_object_tables_external_table`
WITH CONNECTION `bq-object-tables.us-east1.biglake-connection` OPTIONS ( 
    object_metadata="DIRECTORY",
    uris = ['gs://bq-object-tables-sports-bucket/*'],
    max_staleness=INTERVAL 30 MINUTE,
    metadata_cache_mode="AUTOMATIC");

Congrats! You've now created a metadata table reflecting the objects in your GCS bucket. Feel free to run a SELECT statement to check it out. It should look something like this:

At this point though, it's honestly not super exciting. To really leverage the power of object tables, we need to couple it with some ML inferencing. To learn how, read on.

Step 4: Write Function Code

We will use BigQuery's Remote Functions feature to generate embeddings for our GCS objects. The following steps will reflect the steps in my post on How to Create Remote Functions in BigQuery.

To start, create folder in your Cloud Shell called function/.

In that folder, using your editor of choice, save the following Cloud Function code as main.py:

#
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

from google.protobuf.json_format import MessageToJson
from google.cloud import vision
import json

def cloud_ai_vision(request):
    try: 
        client = vision.ImageAnnotatorClient()
        request_json = request.get_json()
    
        calls = request_json['calls']
        return_value = []
        for call in calls:
            image = vision.Image(
                source=vision.ImageSource(image_uri=call[0]))
            response = client.object_localization(image=image)  
            serialized_proto_plus = vision.AnnotateImageResponse.serialize(response)
            response = vision.AnnotateImageResponse.deserialize(serialized_proto_plus)
            response_json = MessageToJson(response._pb)
            return_value.append(str(response_json))
            return_json = json.dumps({"replies": return_value})
        return return_json
    except Exception as e: 
        return str(e.message)

This program will identify and annotate objects in your images using the Google Cloud Vision API.

Also in that folder, add a requirements.txt file with the following contents:

wheel
functions-framework==3.2.0
protobuf
google-cloud-vision
google-cloud-logging

Step 5: Deploy Code to Cloud Function

Deploy the Cloud Function with this command:

gcloud functions deploy cloud_ai_vision \
--runtime python310 \
--trigger-http \
--ingress-settings all \
--region us-east1 \
--source ./function/

If asked Allow unauthenticated invocations of new function [remote_hash]? (y/N)? select N

Note that --ingress-settings needs to be all otherwise the remote function call from BigQuery will be blocked.

Step 6: Add Role to Service Account

Add the invoker role to the SA displayed displayed above under serviceAccountId:

gcloud projects add-iam-policy-binding bq-object-tables --member="serviceAccount:bqcx-012345678901-abcd@gcp-sa-bigquery-condel.iam.gserviceaccount.com" \
--role='roles/cloudfunctions.invoker'

Step 7: Create BigQuery Remote Function

Find and take note of the Cloud Functions endpoint URL using: gcloud functions describe cloud_ai_vision

Run this command in BQ to create the remote function:

CREATE FUNCTION `bq-object-tables`.bq_ot_dataset.localize_objects_uri (uri STRING) RETURNS STRING
REMOTE WITH CONNECTION `bq-object-tables.us-east1.biglake-connection`
OPTIONS (endpoint = 'https://us-east1-bq-object-tables.cloudfunctions.net/cloud_ai_vision')

Step 8: Generate Annotations

Finally, run this query to use the remote function to generate embeddings for the images (for me, it took about 2 min to process 500 images):

CREATE OR REPLACE TABLE
  `bq-object-tables.bq_ot_dataset.bq_object_tables_results` AS
SELECT
  uri,
  signed_url,
  `bq-object-tables`.bq_ot_dataset.localize_objects_uri(signed_url) AS objects,
FROM
  EXTERNAL_OBJECT_TRANSFORM(TABLE `bq-object-tables.bq_ot_dataset.bq_object_tables_external_table`,
    ['SIGNED_URL']);

Note that the EXTERNAL_OBJECT_TRANSFORM command generates signed urls from the GCS object URIs. These signed URLs are passed to the remote function for inference.

A preview of this new table should look like this:

Step 9: Analyze Results

At this point, the table can be analyzed, joined with other data, or processed like any other table, using the power of SQL and BigQuery with the unstructured objects in GCS!

One example is search. Let's create a search index to filter the embeddings to relevant images.

Create a search index:

CREATE SEARCH INDEX vision_index
ON `bq-object-tables.bq_ot_dataset.bq_object_tables_results` (objects);

Now, we can search for a given term within the embeddings:

SELECT * 
FROM `bq-object-tables.bq_ot_dataset.bq_object_tables_results`
WHERE SEARCH(objects, 'car')

When finished, deleting the search index can be done with this:

DROP SEARCH INDEX vision_index ON `bq-object-tables.bq_ot_dataset.bq_object_tables_results`;

Conclusion

I hope this has been a helpful introduction to the exciting new Object Tables feature for BigQuery. We've looked at how to create the metadata tables for GCS buckets and augment them with ML via Remote Functions.

References and Further Reading

  1. BigQuery object tables in a minute
  2. https://cloud.google.com/blog/products/data-analytics/how-to-manage-and-process-unstructured-data-in-bigquery
  3. https://cloud.google.com/bigquery/docs/reference/standard-sql/remote-functions
  4. https://cloud.google.com/blog/products/data-analytics/extending-bigquery-functions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment