Featuring the Metropolitan Museum of Art and the Cloud Vision API
BigQuery is a data warehousing solution provided by Google Cloud.
The reason to be of data warehouses for organizations lies in gathering all sources of data into a single entity, as well as reshaping them into SQL databases with business-oriented schemas.
This allows collaborators of an organization to gain access to multiple sources of analysis-ready data through a unique service.
Thus, cross-source data analysis is easily enabled.
This type of service comes in to be very useful for any data-driven company function, in particular Business Intelligence Analysts or Data Scientists.
BigQuery also comes with public datasets (eg. hackernews
, stackoverflow
or github_repos
). What is best is that the list keeps being updated on a regular basis.
If you are new to BigQuery and would like to explore these open data, you can find valuable information here: try BigQuery for free.
In addition, are some pointers to interesting analysis of BigQuery public datasets from Felipe Hoffa:
- All the open source in Github now shared within BigQuery: Analyze all the code!
- These are the real Stack Overflow trends: Use the pageviews
- Who contributed the most to open source in 2017? Let's analyze Github's data and find out.
I have personally been working with BigQuery for almost a year as part of my work and here are some learnings I picked up along the way which you may find useful.
To support my statements, I will use public dataset bigquery-public-data.the_met
.
The the_met
dataset gathers data from 200k art objects from the Metropolitan Museum of Art of New York.
The data consists in object metadata as well as picture representation.
On top of that, all the images have been ran through a pretrained computer vision model on Cloud Vision API, providing rich image annotation (among which image classification, face detection and localization, Optical Character Recognition).
These annotations are made available in the dataset and contain a lot of nested fields, making of it a great playground to wrangle with data.
The dataset consists of three tables:
the_met.images
the_met.objects
the_met.vision_api_data
The common primary key is the object_id
.
Illustration here
More information about the_met
is available in this post from Sara Robinson: When art meets big data.
The tricks described here do not necessarily address complicated cases but rather intend to help you write shorter and more efficient queries, often times with overlooked commands.
The topics we will cover are as follows:
- Idempotently split tables at random
- Shorten your queries with
EXCEPT
- Modify arrays inline with
ARRAY
- When SQL cannot handle it, just JS it
- Bonus part on BigQuery ML
Disclaimer: The following examples will be using Standard SQL, which, in general provides more features than BigQuery Legacy SQL. They assume you are already familiar with BigQuery, row aggregation, records, repeated fields and subqueries.
When experimenting with Machine Learning and BigQuery data, it may sound useful to be able to idempotently randomly split tables (the splitted sets of lines need to remain the same, whenever the query is ran). The use case arises when splitting a dataset into Training and Development sets.
For instance, let us perform a 80-20 split of the_met.objects
. The idea lies in hashing a column field present in all rows and unique for each of the rows.
The arithmetic properties of the integer hashes can then be exploited to discriminate lines idempotently.
For instance, for 20% splits, the modulo 5 value can be used.
In our case, this field could be object_number
.
SELECT
*
FROM
`bigquery-public-data.the_met.objects`
WHERE
MOD(FARM_FINGERPRINT(object_number),5) = 0 -- for the development set
# MOD(FARM_FINGERPRINT(object_number),5) != 0 -- for the training set
The above query returned 80221 lines out of 401596 (ie. 19.97%). Yay!
An often overlooked keyword is EXCEPT
. This allows you to query all columns except a subset of them.
For example
SELECT
* EXCEPT(is_highlight, is_public_domain)
FROM
`bigquery-public-data.the_met.objects`
Let us now look at the image annotations available in the_met.vision_api_data
.
Among the available annotations, there are the faceAnnotations
:
What if we were interested in all columns except faceAnnotations.surpriseLikelihood
and faceAnnotations.sorrowLikelihood
, would this query work?
SELECT
* EXCEPT(faceAnnotations.supriseLikelihood, faceAnnotations.sorrowLikelihood)
FROM
`bigquery-public-data.the_met.vision_api_data`
In practice, this query is not allowed as it references a nested field in EXCEPT
.
faceAnnotations
needs to be UNNEST
ed before referencing.
SELECT
* EXCEPT(surpriseLikelihood, sorrowLikelihood)
FROM
`bigquery-public-data.the_met.vision_api_data`, UNNEST(faceAnnotations) f
The above query works but faceAnnotations
are now unnested.
The next section shows how to perform the task while preserving the nested structure, thanks to keyword ARRAY
.
In general, maintaining nested structures turns out to be more cost-effective in terms of storage and processing power, compared to fully flat tables.
You may know about ARRAY_AGG
or ARRAY_LENGTH
, but have you heard of often overlooked ARRAY
?
According to the documentation,
ARRAY(subquery)
returns an ARRAY
with one element for each row in a subquery.
Let us look at several use cases.
We take back the previous example.
SELECT
* EXCEPT(faceAnnotations),
ARRAY(
SELECT AS STRUCT
* EXCEPT(supriseLikelihood, sorrowLikelihood)
FROM
data.faceAnnotations
) AS faceAnnotations
FROM
`bigquery-public-data.the_data.vision_api_data` data
And I never unnested any column.
Notice that SELECT AS STRUCT
is necesseary when querying multiple columns within an ARRAY
clause.
This time, instead of filtering out columns, we will filter out lines within an ARRAY
, based on a condition on a nested column.
Assume we are not interested in faces which are very likely to be under-exposed (ie. underExposedLikelihood='VERY_LIKELY'
).
SELECT
* EXCEPT(faceAnnotations),
ARRAY(
SELECT AS STRUCT
*
FROM
data.faceAnnotations
WHERE underExposedLikelihood != 'VERY_LIKELY'
) AS faceAnnotations
FROM
`bigquery-public-data.the_data.vision_api_data` data
On the top of the face annotations, the Cloud Vision API provides with web detection features, similar to Google Image reverse search.
Let us enrich this schema with some matching object_id
s from the_met.images
(image urls for the museum objects).
This is target schema:
The corresponding query
SELECT
* EXCEPT(webDetection),
STRUCT(
webDetection.partialMatchingImages,
webDetection.pagesWithMatchingImages,
webDetection.fullMatchingImages,
ARRAY(
SELECT AS STRUCT
fmi.score,
fmi.url,
i.object_id
FROM
data.webDetection.fullMatchingImages fmi
INNER JOIN
`bigquery-public-data.the_met.images` i
ON
fmi.url = i.original_image_url
) AS fullMatchingImages_from_met,
webDetection.webEntities
) AS webDetection
FROM
`bigquery-public-data.the_met.vision_api_data` data
From there you could, for instance, evaluate image retrieval performance of Google's reverse image search.
BigQuery allows you to write your own User-Defined Functions (UDFs) in JavaScript, which can prove useful when StandardSQL does not support what you are trying to achieve.
For instance, we could be interested in extracting, from vision_api_data
, the labelAnnotation
with the second highest score
, for each object_id
.
In StandardSQL, you can select top lines but not the second top line (at least not in a single query).
This can be achieved with
CREATE TEMPORARY FUNCTION get_second_best(
labelAnnotations ARRAY<
STRUCT
<
score FLOAT64,
description STRING
>
>
)
RETURNS
STRUCT<score FLOAT64, description STRING>
LANGUAGE js AS """
labelAnnotations.sort(function(a, b){return b.score-a.score});
return labelAnnotations[1]
""";
SELECT
object_Id,
get_second_best(
ARRAY(
SELECT AS STRUCT
score,
description
FROM
data.labelAnnotations
)
) AS second_most_confident_labelAnnotation
FROM
`bigquery-public-data.the_met.vision_api_data` data
Note that UDFs are not currently supported in BigQuery views.
BigQuery ML is one the newest features of BigQuery. It allows you to build, train, deploy Machine Learning models only with a few SQL commands. This can save you the time of building data pipelines and perhaps spinning up other services like Google AppEngine.
For the sake of example, let us see if some of an object's metadata can be good predictors of whether or not a face is represented on it. We will rely on the Cloud Vision API for groundtruth labeling of presence of a face:
IF(ARRAY_LENGTH(faceAnnotations)=0 OR faceAnnotations IS NULL, 0, 1) AS label
We will use a linear logistic regression for this purpose.
Creating a model on BigQuery is as simple as pre-pending your training set query with a few commands.
Provided you have created a dataset named bqml
, the syntax looks like the following:
CREATE MODEL `bqml.the_met_model`
OPTIONS(model_type='logistic_reg') AS
SELECT IF(ARRAY_LENGTH(faceAnnotations)=0 OR faceAnnotations IS NULL, 0, 1) as label
, IFNULL(department, "") department
, IFNULL(culture, "") culture
, IFNULL(classification, "") classification
, is_highlight
, is_public_domain
, object_begin_date
, object_end_date
FROM
`bigquery-public-data.the_met.objects` o
INNER JOIN
`bigquery-public-data.the_met.vision_api_data` data
ON
o.object_id = data.object_id
WHERE
MOD(FARM_FINGERPRINT(o.object_number),5) != 0
Notice the mixture of types in the input features: STRING
, BOOLEAN
, INTEGER
.
By default, BigQuery handles this diversity with no problem by one-hot encoding the non-numerical features.
Also notice we have used the query result idempotent splitting trick to train the model on 80% of the data.
After training, let us evaluate the model on the training set with:
SELECT
*
FROM
ML.EVALUATE(MODEL `bqml_tutorial.the_met_model`,
(
SELECT IF(ARRAY_LENGTH(faceAnnotations)=0 OR faceAnnotations IS NULL, 0, 1) as label
, IFNULL(department, "") department
, IFNULL(culture, "") culture
, IFNULL(classification, "") classification
, is_highlight
, is_public_domain
, object_begin_date
, object_end_date
FROM
`bigquery-public-data.the_met.objects` o
INNER JOIN
`bigquery-public-data.the_met.vision_api_data` data
ON
o.object_id = data.object_id
WHERE
MOD(FARM_FINGERPRINT(o.object_number),5) = 0
)
)
The evaluation query returns
[
{
"precision": "0.6363636363636364",
"recall": "4.160228218233686E-4",
"accuracy": "0.8950143845832215",
"f1_score": "8.315020490586209E-4",
"log_loss": "0.3046493172393632",
"roc_auc": "0.857339"
}
]
As a side note, the prior distribution is the following:
[
{
"pc_with_faces": "0.1050043372170668",
"pc_no_faces": "0.8949956627829332"
}
]
Given the very low recall score and that the accuracy score barely exceeds prior distribution, we cannot say the model has learnt from the input features 😊. Nevertheless, this example can be used as an intro to BigQuery ML.
To improve model performance, we could have looked at string preprocessing and factoring for fields like culture
and classification
.
Other unstructured description text fields could have been exploited but this goes beyond the current scope of BigQuery ML.
At the time this post was written, BigQuery ML was available as a beta release.
And that is a wrap! Thank you for reading this far. I really enjoyed sharing those learnings so I hope they will be useful to some.
At Sephora South-East Asia (Sephora SEA), we leverage BigQuery to enable easier data-backed decisions as well as to better understand our customers. The Data Team at Sephora SEA takes care of internally democratizing data as much as possible, in part through SQL trainings on BigQuery. This article is dedicated to them, as well as to all of our trainees and past graduates.
And as my colleague Aurélien would proudly say:
SELECT thanks FROM you
How to check list is null or not in Jpa query