Skip to content

Instantly share code, notes, and snippets.

@polleyg
polleyg / list_all_bigquery_jobs.md
Created December 2, 2017 03:30
List BigQuery jobs from all users

Sometimes you need to troubleshoot and inspect the details of jobs (load, query etc.) in BigQuery. Inspecting the job history in the BigQuery web UI will only show the jobs that you have run. This is also true when you run run bq ls -j on the command line.

But, what if you need to get all jobs that have been run? An example would be auotmated jobs run by service accounts. A quick tip is to use the --all flag:

-a,--[no]all: Show all results. For jobs, will show jobs from all users. For datasets, will list hidden datasets. For transfer configs and runs, this flag is redundant and not necessary.

bq ls -j --all

(this tip originated from a question on Stack Overflow: https://stackoverflow.com/questions/47583485/bigquery-history-of-jobs-submitted-through-python-api)

@polleyg
polleyg / coalesce_stackdriver_logs_bigquery.md
Last active June 22, 2018 17:48
How to coalesce Stackdriver logs/sinks into just one BigQuery project/dataset

Exporting Stackdriver logs back to BigQuery for analysis is a very straightforward task (and a very good pattern to follow too BTW). However, when you are managing lots of projects you'll probably want to coalesce the logs from all your projects into one dedicated BigQuery project/dataset for easier querying and maintainability.

However, setting this up in the GCP console isn't that obvious to most users. This is because the configuration in the Stackdriver sink config doesn't actually allow you to select a different project/dataset to send the logs to. It only lets you select the currently selected/working project:

oppmi

The tip/trick is to select Custom destination from the drop-down and then use the following format to specify the project/dataset where you want the logs to go. Plug in the project and dataset that you want to use:

`bigquery.googleapis.com/projects/[PRO

@polleyg
polleyg / triggering_cf_folders.md
Last active December 15, 2017 11:31
Triggering Cloud Functions on folders in GCS

Cloud Functions are an extremely powerful and elegant way to solve problems. You don't need to worry about any of the infrastructure, nor trying to scale. It's a win-win! However, when deploying[1] your Cloud Function to trigger with GCS, you can currently only set the --trigger-bucket parameter to be an actual GCS bucket.

But what if you only want to trigger on a files in a specific folder(s) within that bucket?

Not to fret! There's a little trick for this. In fact, the object/file name (i.e. the new file that's been uploaded to your folder) actually contains the full path, including any folder names. So, in your Cloud Function all you need to do is this:

function(event, callback) {
const file = event.data;
if (file.resourceState === 'exists' && file.name && file.name.indexOf('my_lovely_folder/') !== -1) {
@polleyg
polleyg / common_sinks_and_metrics.md
Last active February 2, 2021 01:29
Common sinks and metrics we use for GCP
#!/usr/bin/env bash
PROJECT="<project-id>"

### Sinks ###

#Base sink for all logs to BigQuery
gcloud logging sinks create --quiet --project $project ${project}_bigquery_sink bigquery.googleapis.com/projects/billing-exports-184901/datasets/CLOUD_LOGGING_EXPORTS --log-filter "resource.type:*"

#### Metrics ####
@polleyg
polleyg / cloudbuild.yaml
Created September 6, 2018 11:47
A Cloud Build configuration
steps:
# 1. Fetch the source code
- name: gcr.io/cloud-builders/git
args: ['clone', 'https://github.com/polleyg/gcp-batch-ingestion-bigquery.git']
# 2a. Set up GCS & BQ etc. using public terraform Docker image
- name: hashicorp/terraform
args: ['init']
dir: 'terraform'
@polleyg
polleyg / index.js
Created September 6, 2018 12:11
Cloud Function for triggering templated Dataflow pipelines
//gcloud --project=grey-sort-challenge functions deploy goWithTheDataFlow --stage-bucket gs://batch-pipeline --trigger-bucket gs://batch-pipeline
const google = require('googleapis');
exports.goWithTheDataFlow = function(event, callback) {
const file = event.data;
const context = event.context;
console.log("File is: ", file);
console.log("State is: ", context.eventType);
if (context.eventType === 'google.storage.object.finalize' && file.name.indexOf('upload/') !== -1) {
@polleyg
polleyg / TemplatePipeline.java
Created September 6, 2018 12:14
Templated Dataflow pipeline for reading Wikipedia page views from a file
package org.polleyg;
import com.google.api.services.bigquery.model.TableFieldSchema;
import com.google.api.services.bigquery.model.TableRow;
import com.google.api.services.bigquery.model.TableSchema;
import org.apache.beam.runners.dataflow.options.DataflowPipelineOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.options.Description;
@polleyg
polleyg / infra.tf
Created September 6, 2018 13:24
Terraform script to create a bucket
terraform {
backend "gcs" {
bucket = "tf-state-gcp-batch-ingestion"
region = "australia-southeast1-a"
prefix = "terraform/state"
}
}
provider "google" {
project = "grey-sort-challenge"
@polleyg
polleyg / twitter_to_pubsub.py
Last active March 5, 2023 23:37
Twitter to PubSub
"""This script uses the Twitter Streaming API, via the tweepy library,
to pull in tweets and publish them to a PubSub topic.
"""
import base64
import datetime
import utils
from tweepy import OAuthHandler
from tweepy import Stream
from tweepy.streaming import StreamListener
@polleyg
polleyg / Dockerfile
Created September 26, 2018 11:01
Docker for Twitter
FROM python:2
RUN pip install --upgrade pip
RUN pip install tweepy
RUN pip install --upgrade google-api-python-client
RUN pip install --upgrade oauth2client
RUN pip install python-dateutil
ADD twitter-to-pubsub.py /twitter-to-pubsub.py
ADD utils.py /utils.py