Graham Polley polleyg

## list_all_bigquery_jobs.md

      
              1 file
            
          
              1 fork
            
          
              1 comment
            
          
              2 stars
            
          
                polleyg
                / list_all_bigquery_jobs.md
            
            
              Created
              December 2, 2017 03:30
            
              
                List BigQuery jobs from all users
              
          
    Sometimes you need to troubleshoot and inspect the details of jobs (load, query etc.) in BigQuery. Inspecting the job history in the BigQuery web UI will only show the jobs that you have run. This is also true when you run run bq ls -j on the command line.
But, what if you need to get all jobs that have been run? An example would be auotmated jobs run by service accounts. A quick tip is to use the --all flag:

-a,--[no]all: Show all results. For jobs, will show jobs from all users.
For datasets, will list hidden datasets. For transfer configs and runs, this flag is redundant and not necessary.

bq ls -j --all
(this tip originated from a question on Stack Overflow: https://stackoverflow.com/questions/47583485/bigquery-history-of-jobs-submitted-through-python-api)

  
## coalesce_stackdriver_logs_bigquery.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              1 star
            
          
                polleyg
                / coalesce_stackdriver_logs_bigquery.md
            
            
              Last active
              June 22, 2018 17:48
            
              
                How to coalesce Stackdriver logs/sinks into just one BigQuery project/dataset
              
          
    Exporting Stackdriver logs back to BigQuery for analysis is a very straightforward task (and a very good pattern to follow too BTW). However, when you are managing lots of projects you'll probably want to coalesce the logs from all your projects into one dedicated BigQuery project/dataset for easier querying and maintainability.
However, setting this up in the GCP console isn't that obvious to most users. This is because the configuration in the Stackdriver sink config doesn't actually allow you to select a different project/dataset to send the logs to. It only lets you select the currently selected/working project:

The tip/trick is to select Custom destination from the drop-down and then use the following format to specify the project/dataset where you want the logs to go. Plug in the project and dataset that you want to use:
`bigquery.googleapis.com/projects/[PRO

  
## triggering_cf_folders.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                polleyg
                / triggering_cf_folders.md
            
            
              Last active
              December 15, 2017 11:31
            
              
                Triggering Cloud Functions on folders in GCS
              
          
    Cloud Functions are an extremely powerful and elegant way to solve problems. You don't need to worry about any of the infrastructure, nor trying to scale. It's a win-win! However, when deploying[1] your Cloud Function to trigger with GCS, you can currently only set the --trigger-bucket parameter to be an actual GCS bucket.
But what if you only want to trigger on a files in a specific folder(s) within that bucket?
Not to fret! There's a little trick for this. In fact, the object/file name (i.e. the new file that's been uploaded to your folder) actually contains the full path, including any folder names. So, in your Cloud Function all you need to do is this:
function(event, callback) {
const file = event.data;
if (file.resourceState === 'exists' &amp;&amp; file.name &amp;&amp; file.name.indexOf('my_lovely_folder/') !== -1) {


## common_sinks_and_metrics.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              5 stars
            
          
                polleyg
                / common_sinks_and_metrics.md
            
            
              Last active
              February 2, 2021 01:29
            
              
                Common sinks and metrics we use for GCP
              
          
    #!/usr/bin/env bash
PROJECT="<project-id>"

### Sinks ###

#Base sink for all logs to BigQuery
gcloud logging sinks create --quiet --project $project ${project}_bigquery_sink bigquery.googleapis.com/projects/billing-exports-184901/datasets/CLOUD_LOGGING_EXPORTS --log-filter "resource.type:*"

#### Metrics ####


## cloudbuild.yaml
steps:
# 1. Fetch the source code
- name: gcr.io/cloud-builders/git
  args: ['clone', 'https://github.com/polleyg/gcp-batch-ingestion-bigquery.git']

# 2a. Set up GCS & BQ etc. using public terraform Docker image
- name: hashicorp/terraform
  args: ['init']
  dir: 'terraform'

## index.js
//gcloud --project=grey-sort-challenge functions deploy goWithTheDataFlow --stage-bucket gs://batch-pipeline --trigger-bucket gs://batch-pipeline
const google = require('googleapis');
exports.goWithTheDataFlow = function(event, callback) {
  const file = event.data;
  const context = event.context;

  console.log("File is: ", file);
  console.log("State is: ", context.eventType);

  if (context.eventType === 'google.storage.object.finalize' && file.name.indexOf('upload/') !== -1) {

## TemplatePipeline.java
package org.polleyg;

import com.google.api.services.bigquery.model.TableFieldSchema;
import com.google.api.services.bigquery.model.TableRow;
import com.google.api.services.bigquery.model.TableSchema;
import org.apache.beam.runners.dataflow.options.DataflowPipelineOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.options.Description;

## infra.tf
terraform {
  backend "gcs" {
    bucket = "tf-state-gcp-batch-ingestion"
    region = "australia-southeast1-a"
    prefix = "terraform/state"
  }
}

provider "google" {
  project = "grey-sort-challenge"

## twitter_to_pubsub.py
"""This script uses the Twitter Streaming API, via the tweepy library,
to pull in tweets and publish them to a PubSub topic.
"""
import base64
import datetime
import utils
from tweepy import OAuthHandler
from tweepy import Stream
from tweepy.streaming import StreamListener

## Dockerfile
FROM python:2

RUN pip install --upgrade pip
RUN pip install tweepy
RUN pip install --upgrade google-api-python-client
RUN pip install --upgrade oauth2client
RUN pip install python-dateutil

ADD twitter-to-pubsub.py /twitter-to-pubsub.py
ADD utils.py /utils.py
	steps:
	# 1. Fetch the source code
	- name: gcr.io/cloud-builders/git
	args: ['clone', 'https://github.com/polleyg/gcp-batch-ingestion-bigquery.git']

	# 2a. Set up GCS & BQ etc. using public terraform Docker image
	- name: hashicorp/terraform
	args: ['init']
	dir: 'terraform'
	//gcloud --project=grey-sort-challenge functions deploy goWithTheDataFlow --stage-bucket gs://batch-pipeline --trigger-bucket gs://batch-pipeline
	const google = require('googleapis');
	exports.goWithTheDataFlow = function(event, callback) {
	const file = event.data;
	const context = event.context;

	console.log("File is: ", file);
	console.log("State is: ", context.eventType);

	if (context.eventType === 'google.storage.object.finalize' && file.name.indexOf('upload/') !== -1) {
	package org.polleyg;

	import com.google.api.services.bigquery.model.TableFieldSchema;
	import com.google.api.services.bigquery.model.TableRow;
	import com.google.api.services.bigquery.model.TableSchema;
	import org.apache.beam.runners.dataflow.options.DataflowPipelineOptions;
	import org.apache.beam.sdk.Pipeline;
	import org.apache.beam.sdk.io.TextIO;
	import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
	import org.apache.beam.sdk.options.Description;
	terraform {
	backend "gcs" {
	bucket = "tf-state-gcp-batch-ingestion"
	region = "australia-southeast1-a"
	prefix = "terraform/state"
	}
	}

	provider "google" {
	project = "grey-sort-challenge"
	"""This script uses the Twitter Streaming API, via the tweepy library,
	to pull in tweets and publish them to a PubSub topic.
	"""
	import base64
	import datetime
	import utils
	from tweepy import OAuthHandler
	from tweepy import Stream
	from tweepy.streaming import StreamListener
	FROM python:2

	RUN pip install --upgrade pip
	RUN pip install tweepy
	RUN pip install --upgrade google-api-python-client
	RUN pip install --upgrade oauth2client
	RUN pip install python-dateutil

	ADD twitter-to-pubsub.py /twitter-to-pubsub.py
	ADD utils.py /utils.py