Skip to content

Instantly share code, notes, and snippets.

@jreuben11
Last active December 14, 2016 21:46
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jreuben11/262aeddd51b218243d69c4e962f16ba0 to your computer and use it in GitHub Desktop.
Save jreuben11/262aeddd51b218243d69c4e962f16ba0 to your computer and use it in GitHub Desktop.
BigQuery QuickRef

Big Data keeps evolving. Stone Age Hadoop was alot of Java bolierplate for defining HDFS access, Mapper & Reducer. This was superceded by Bronze Age Spark, which provided a succint Scala unification of:

  • ML pipelines
  • in-memory structured DataSets over RDDs via a SparkSession SQL API
  • Distributed Streams

(Note: You can run such jobs easily in a dynamically scalable manner on Google Dataproc) Technology keeps evolving - the Big-Iron Age has arrived in the form of Google Cloud Platform's SPARK KILLER - a nextgen Big Data stack, consisting of:

  • BigQuery - massively parallel, blazing fast https://cloud.google.com/blog/big-data/2016/01/anatomy-of-a-bigquery-query data analytics
  • TensorFlow - Deep Learning framework (the biggest Machine Learning breakthrough of the decade) distributed and accelerated over custom 'TensorChips'
  • DataFlow - (Apache Beam) a structured streaming windowing paradigm with a clean rich API that matches this metaphor - see https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison - build up a transform graph connecting sources and sinks (many of these integration points come out of the box - eg gpubsub, bigquery)
  • DataLab - Juypyter Notebooks with some extra IPython integration-convenience magic: %%sql , %%bigquery , %%storage, %%chart, %%mlalpha, %tensorboard, %%monitoring

BigQuery has 4 components which you can read about here: https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood

  • Borg - Resource Manager
  • Colossus - Distributed FS
  • Jupiter - 1 Pb/s network
  • Dremel - query engine, open sourced as Apache Drill

Anyhow, IMHO - Spark cannot compete, but I want to focus here on the BigQuery API. heres my QuickRef:

bq CLI

  • cancel - Request a job cancel and optionally waits
bq --nosync cancel job_id
  • cp - Copies a table
bq cp dataset.old_table dataset2.new_table
  • extract - extract source_table into Cloud Storage destination_uris.
bq extract ds.summary gs://mybucket/summary.csv
  • head - Displays rows in a table. params: --[no]job, --max_rows, --start_row, --[no]table
bq head -s 5 -n 10 dataset.table
  • init - Authenticate and create a default .bigqueryrc file. ??
  • insert - Insert JSON rows (from file or string) into a table
bq insert dataset.table /tmp/mydata.json
echo '{"a":1, "b":2}' | bq insert dataset.table
  • load - load source file / uris into destination_table, with optional json schema file / string

bq load ds.new_tbl ./info.csv ./info_schema.json
bq load ds.small gs://mybucket/small.csv name:integer,value:string
  • ls - List objects contained in project or dataset. Flags: -j show jobs, -p, show all projects
bq ls mydataset
bq ls --filter labels.color:red -a -j -p -n 1000
  • mk - Create a dataset, table or view
bq mk -d --data_location=EU new_dataset
bq --dataset_id=new_dataset mk -t new_dataset.newtable name:integer,value:string
bq mk --view='select 1 as num' new_dataset.newview
  • mkdef - Emits JSON definition for a GCS backed table.
bq mkdef 'gs://bucket/file.csv' field1:integer,field2:string
  • partition - Copies source tables of the format <source_table_prefix> into partitioned tables, with the date suffix of the source tables becoming the partition date of the destination table partitions.
bq partition dataset1.sharded_ dataset2.partitioned_table
  • query - Execute a query
bq query 'select count(*) from publicdata:samples.shakespeare'
  • rm - Delete dataset / table (-d / -t flags signify target type, -f force, -r remove all tables of a dataset)
bq rm ds.table
bq rm -r -f old_dataset
  • shell - Start an interactive bq session.
  • show - Show all information about an object.
  • update - Updates a dataset, table or view.
bq update --description "My table" existing_dataset.existing_table
bq update -t existing_dataset.existing_table name:integer,value:string
bq update --view='select 1 as num' existing_dataset.existing_view --view_udf_resource=path/to/file.js
  • version
  • wait - Wait #seconds for a job to finish
bq wait --fail_on_error job_id 100

SQL Syntax

query_statement:

[ WITH with_query_name AS ( select ) [, ...] ]
query_expr

query_expr:

{ select | ( query_expr ) | query_expr set_op query_expr }
[ ORDER BY expression [{ ASC | DESC }] [, ...] ]
[ LIMIT count [ OFFSET skip_rows ] ]

select:

SELECT  [{ ALL | DISTINCT }]
  { [ expression. ]* [ EXCEPT ( column_name [, ...] ) ]
  [ REPLACE ( expression [ AS ] column_name [, ...] ) ]
  | expression [ [ AS ] alias ] } [, ...]
[ FROM from_item  [, ...] ]
[ WHERE bool_expression ]
[ GROUP BY expression [, ...] ]
[ HAVING bool_expression ]
[ WINDOW window_name AS ( window_definition ) [, ...] ]

ALL returns all rows (default), including duplicate rows - opt out with DISTINCT EXCEPT - specifies columns to exclude from result REPLACE - replace column value with expression HAVING - similar to WHERE, but evaluated over GROUP BY aggs

set_op:

UNION { ALL | DISTINCT }

from_item:

{
    table_name [ [ AS ] alias ] |
    join |
    ( select ) [ [ AS ] alias ] |
    field_path |
    { UNNEST( array_expression ) | UNNEST( array_path ) | array_path }
    [ [ AS ] alias ] [ WITH OFFSET [ [ AS ] alias ] ] |
    with_query_name [ [ AS ] alias ]
}

UNNEST iterates through an ARRAY and produces one row for each ARRAY element WITH OFFSET add index col for UNNEST rows

join:

from_item join_type JOIN from_item
[ { ON bool_expression | USING ( join_column [, ...] ) } ]

**USING** clause specify mutual column_list of one or more columns which occur in both input tables. It performs an equality comparison

join_type:

{ [INNER] | CROSS | FULL [OUTER] | LEFT [OUTER] | RIGHT [OUTER] }

Data types

  • INT64
  • FLOAT64
  • BOOL
  • STRING
  • BYTES
  • DATE
  • TIME
  • TIMESTAMP
  • ARRAY
  • STRUCT eg ARRAY<STRUCT<ARRAY<STRUCT<a int64, b string>>>> - STRUCT between the 2 ARRAYs because ARRAYs cannot hold other ARRAYs !

Functions

Converters

  • CAST

Aggregate functions

  • ANY_VALUE - value ?? NULL
  • APPROX_COUNT_DISTINCT - statistical estimate for big data
  • APPROX_QUANTILES - get aprox boundaries for n quantiles
  • APPROX_TOP_COUNT - for agg-cat rows with count >= n, return agg-cat & count
  • APPROX_TOP_SUM - get aprox top n elements, based on the sum of an assigned weight
  • ARRAY_AGG - Returns an ARRAY of all expression values
  • ARRAY_CONCAT_AGG Concats ARRAYs into a single ARRAY
  • AVG, MIN, MAX, SUM
  • BIT_AND , BIT_OR, BIT_XOR
  • COUNT , COUNT_IF
  • LOGICAL_AND, LOGICAL_OR
  • STRING_AGG - concats strings
SELECT ARRAY_AGG(x*2) as array_agg
FROM UNNEST([1, 2, 3, 4, 5]) as x;

Statistical Aggregate Functions

  • CORR - Pearson coefficient of correlation of (x,y) pair set: -1 <= r <= 1
  • COVAR_POP - population covariance of (x,y) pair set: -Inf <= r <= +Inf
  • COVAR_SAMP - sample covariance of (x,y) pair set: -Inf <= r <= +Inf

Window-Frame Analytic Functions

OLAP time-slice aggregations

Analytic Function Syntax

analytic_function_name ( [ argument_list ] )
  OVER { window_name | ( [ window_specification ] ) }

window_specification:

  [ window_name ]
  [ PARTITION BY partition_expression_list ]
  [ ORDER BY expression [{ ASC | DESC }] [, ...] ]
  [ window_frame_clause ]

window_frame_clause:

{ ROWS | RANGE }
{
  { UNBOUNDED PRECEDING | numeric_expression PRECEDING | CURRENT ROW }
  |
  { BETWEEN window_frame_boundary_start AND window_frame_boundary_end }
}

eg ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING window_frame_boundary_start:

{ UNBOUNDED PRECEDING | numeric_expression { PRECEDING | FOLLOWING } | CURRENT ROW }

window_frame_boundary_end:

{ UNBOUNDED FOLLOWING | numeric_expression { PRECEDING | FOLLOWING } | CURRENT ROW }
  • PARTITION BY - similar to GROUP BY
  • RANK - computed ordinal column
  • DENSE_RANK - as above, but no peers ?
  • PERCENT_RANK - percentile rank of a row defined as (RK-1)/(NR-1)
  • CUME_DIST - relative rank of a row defined as NP/NR
  • NTILE - bucket number for constant bucket size
  • ROW_NUMBER
  • LEAD - window offset
  • LAG - window offset
  • FIRST_VALUE
  • LAST_VALUE
  • NTH_VALUE

Mathematical functions

  • ABS
  • SIGN
  • IS_INF, IS_NAN
  • IEEE_DIVIDE(X,Y) - If the result overflows, returns +/-inf.If Y=0 and X=0, returns NaN.
  • RAND
  • SQRT
  • POW, POWER
  • EXP
  • LN, LOG, LOG, LOG10
  • GREATEST, LEAST
  • DIV
  • SAFE_DIVIDE(X, Y) - Returns NULL on error
  • MOD

Rounding Functions

  • ROUND
  • TRUNC
  • CEIL
  • CEILING
  • FLOOR

Trig Functions

  • COS, COSH, ACOS, ACOSH
  • SIN, SINH, ASIN, ASINH
  • TAN, TANH, ATAN, ATANH, ATAN2

String functions

  • BYTE_LENGTH
  • CHAR_LENGTH
  • CONCAT
  • ENDS_WITH
  • FORMAT
  • LENGTH
  • LOWER
  • LTRIM
  • REGEXP_CONTAINS
  • REGEXP_EXTRACT - Returns an array of all substrings of value that match
  • REGEXP_REPLACE
  • REPLACE
  • RTRIM
  • SAFE_CONVERT_BYTES_TO_STRING - Any invalid UTF-8 chars are replaced with Unicode replacement char, U+FFFD.
  • SPLIT
  • STARTS_WITH
  • STRPOS
  • SUBSTR
  • TRIM
  • UPPER

Array functions

  • ARRAY_CONCAT
  • ARRAY_LENGTH
  • ARRAY_TO_STRING
  • GENERATE_ARRAY - range
  • GENERATE_DATE_ARRAY
  • OFFSET, SAFE_OFFSET
  • ORDINAL, SAFE_ORDINAL

Date / Time functions

  • CURRENT_DATE
  • EXTRACT(part FROM date_expression)
  • DATE
  • DATE_ADD, DATE_SUB, DATE_DIFF
  • DATE_TRUNC
  • DATE_FROM_UNIX_DATE
  • FORMAT_DATE
  • PARSE_DATE
  • UNIX_DATE
  • DATETIME
  • DATETIME_ADD, DATETIME_SUB, DATETIME_DIFF
  • TIME
  • TIME_ADD, TIME_SUB, TIME_DIFF
  • TIME_TRUNC
  • CURRENT_TIMESTAMP
  • STRING
  • TIMESTAMP_ADD, TIMESTAMP_SUB, TIMESTAMP_DIFF
  • TIMESTAMP_TRUNC
  • FORMAT_TIMESTAMP
  • PARSE_TIMESTAMP

Security functions

  • SESSION_USER

Misc DSL features

  • CASE ... WHEN ... THEN
  • COALESCE
  • IF
  • IFNULL
  • BETWEEN
  • LIKE
  • IN
  • IS
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment