Skip to content

Instantly share code, notes, and snippets.

@jreuben11

jreuben11/BigQueryQuickRef.md Secret

Last active Dec 14, 2016
Embed
What would you like to do?
BigQuery QuickRef

Big Data keeps evolving. Stone Age Hadoop was alot of Java bolierplate for defining HDFS access, Mapper & Reducer. This was superceded by Bronze Age Spark, which provided a succint Scala unification of:

  • ML pipelines
  • in-memory structured DataSets over RDDs via a SparkSession SQL API
  • Distributed Streams

(Note: You can run such jobs easily in a dynamically scalable manner on Google Dataproc) Technology keeps evolving - the Big-Iron Age has arrived in the form of Google Cloud Platform's SPARK KILLER - a nextgen Big Data stack, consisting of:

  • BigQuery - massively parallel, blazing fast https://cloud.google.com/blog/big-data/2016/01/anatomy-of-a-bigquery-query data analytics
  • TensorFlow - Deep Learning framework (the biggest Machine Learning breakthrough of the decade) distributed and accelerated over custom 'TensorChips'
  • DataFlow - (Apache Beam) a structured streaming windowing paradigm with a clean rich API that matches this metaphor - see https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison - build up a transform graph connecting sources and sinks (many of these integration points come out of the box - eg gpubsub, bigquery)
  • DataLab - Juypyter Notebooks with some extra IPython integration-convenience magic: %%sql , %%bigquery , %%storage, %%chart, %%mlalpha, %tensorboard, %%monitoring

BigQuery has 4 components which you can read about here: https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood

  • Borg - Resource Manager
  • Colossus - Distributed FS
  • Jupiter - 1 Pb/s network
  • Dremel - query engine, open sourced as Apache Drill

Anyhow, IMHO - Spark cannot compete, but I want to focus here on the BigQuery API. heres my QuickRef:

bq CLI

  • cancel - Request a job cancel and optionally waits
bq --nosync cancel job_id
  • cp - Copies a table
bq cp dataset.old_table dataset2.new_table
  • extract - extract source_table into Cloud Storage destination_uris.
bq extract ds.summary gs://mybucket/summary.csv
  • head - Displays rows in a table. params: --[no]job, --max_rows, --start_row, --[no]table
bq head -s 5 -n 10 dataset.table
  • init - Authenticate and create a default .bigqueryrc file. ??
  • insert - Insert JSON rows (from file or string) into a table
bq insert dataset.table /tmp/mydata.json
echo '{"a":1, "b":2}' | bq insert dataset.table
  • load - load source file / uris into destination_table, with optional json schema file / string

bq load ds.new_tbl ./info.csv ./info_schema.json
bq load ds.small gs://mybucket/small.csv name:integer,value:string
  • ls - List objects contained in project or dataset. Flags: -j show jobs, -p, show all projects
bq ls mydataset
bq ls --filter labels.color:red -a -j -p -n 1000
  • mk - Create a dataset, table or view
bq mk -d --data_location=EU new_dataset
bq --dataset_id=new_dataset mk -t new_dataset.newtable name:integer,value:string
bq mk --view='select 1 as num' new_dataset.newview
  • mkdef - Emits JSON definition for a GCS backed table.
bq mkdef 'gs://bucket/file.csv' field1:integer,field2:string
  • partition - Copies source tables of the format <source_table_prefix> into partitioned tables, with the date suffix of the source tables becoming the partition date of the destination table partitions.
bq partition dataset1.sharded_ dataset2.partitioned_table
  • query - Execute a query
bq query 'select count(*) from publicdata:samples.shakespeare'
  • rm - Delete dataset / table (-d / -t flags signify target type, -f force, -r remove all tables of a dataset)
bq rm ds.table
bq rm -r -f old_dataset
  • shell - Start an interactive bq session.
  • show - Show all information about an object.
  • update - Updates a dataset, table or view.
bq update --description "My table" existing_dataset.existing_table
bq update -t existing_dataset.existing_table name:integer,value:string
bq update --view='select 1 as num' existing_dataset.existing_view --view_udf_resource=path/to/file.js
  • version
  • wait - Wait #seconds for a job to finish
bq wait --fail_on_error job_id 100

SQL Syntax

query_statement:

[ WITH with_query_name AS ( select ) [, ...] ]
query_expr

query_expr:

{ select | ( query_expr ) | query_expr set_op query_expr }
[ ORDER BY expression [{ ASC | DESC }] [, ...] ]
[ LIMIT count [ OFFSET skip_rows ] ]

select:

SELECT  [{ ALL | DISTINCT }]
  { [ expression. ]* [ EXCEPT ( column_name [, ...] ) ]
  [ REPLACE ( expression [ AS ] column_name [, ...] ) ]
  | expression [ [ AS ] alias ] } [, ...]
[ FROM from_item  [, ...] ]
[ WHERE bool_expression ]
[ GROUP BY expression [, ...] ]
[ HAVING bool_expression ]
[ WINDOW window_name AS ( window_definition ) [, ...] ]

ALL returns all rows (default), including duplicate rows - opt out with DISTINCT EXCEPT - specifies columns to exclude from result REPLACE - replace column value with expression HAVING - similar to WHERE, but evaluated over GROUP BY aggs

set_op:

UNION { ALL | DISTINCT }

from_item:

{
    table_name [ [ AS ] alias ] |
    join |
    ( select ) [ [ AS ] alias ] |
    field_path |
    { UNNEST( array_expression ) | UNNEST( array_path ) | array_path }
    [ [ AS ] alias ] [ WITH OFFSET [ [ AS ] alias ] ] |
    with_query_name [ [ AS ] alias ]
}

UNNEST iterates through an ARRAY and produces one row for each ARRAY element WITH OFFSET add index col for UNNEST rows

join:

from_item join_type JOIN from_item
[ { ON bool_expression | USING ( join_column [, ...] ) } ]

**USING** clause specify mutual column_list of one or more columns which occur in both input tables. It performs an equality comparison

join_type:

{ [INNER] | CROSS | FULL [OUTER] | LEFT [OUTER] | RIGHT [OUTER] }

Data types

  • INT64
  • FLOAT64
  • BOOL
  • STRING
  • BYTES
  • DATE
  • TIME
  • TIMESTAMP
  • ARRAY
  • STRUCT eg ARRAY<STRUCT<ARRAY<STRUCT<a int64, b string>>>> - STRUCT between the 2 ARRAYs because ARRAYs cannot hold other ARRAYs !

Functions

Converters

  • CAST

Aggregate functions

  • ANY_VALUE - value ?? NULL
  • APPROX_COUNT_DISTINCT - statistical estimate for big data
  • APPROX_QUANTILES - get aprox boundaries for n quantiles
  • APPROX_TOP_COUNT - for agg-cat rows with count >= n, return agg-cat & count
  • APPROX_TOP_SUM - get aprox top n elements, based on the sum of an assigned weight
  • ARRAY_AGG - Returns an ARRAY of all expression values
  • ARRAY_CONCAT_AGG Concats ARRAYs into a single ARRAY
  • AVG, MIN, MAX, SUM
  • BIT_AND , BIT_OR, BIT_XOR
  • COUNT , COUNT_IF
  • LOGICAL_AND, LOGICAL_OR
  • STRING_AGG - concats strings
SELECT ARRAY_AGG(x*2) as array_agg
FROM UNNEST([1, 2, 3, 4, 5]) as x;

Statistical Aggregate Functions

  • CORR - Pearson coefficient of correlation of (x,y) pair set: -1 <= r <= 1
  • COVAR_POP - population covariance of (x,y) pair set: -Inf <= r <= +Inf
  • COVAR_SAMP - sample covariance of (x,y) pair set: -Inf <= r <= +Inf

Window-Frame Analytic Functions

OLAP time-slice aggregations

Analytic Function Syntax

analytic_function_name ( [ argument_list ] )
  OVER { window_name | ( [ window_specification ] ) }

window_specification:

  [ window_name ]
  [ PARTITION BY partition_expression_list ]
  [ ORDER BY expression [{ ASC | DESC }] [, ...] ]
  [ window_frame_clause ]

window_frame_clause:

{ ROWS | RANGE }
{
  { UNBOUNDED PRECEDING | numeric_expression PRECEDING | CURRENT ROW }
  |
  { BETWEEN window_frame_boundary_start AND window_frame_boundary_end }
}

eg ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING window_frame_boundary_start:

{ UNBOUNDED PRECEDING | numeric_expression { PRECEDING | FOLLOWING } | CURRENT ROW }

window_frame_boundary_end:

{ UNBOUNDED FOLLOWING | numeric_expression { PRECEDING | FOLLOWING } | CURRENT ROW }
  • PARTITION BY - similar to GROUP BY
  • RANK - computed ordinal column
  • DENSE_RANK - as above, but no peers ?
  • PERCENT_RANK - percentile rank of a row defined as (RK-1)/(NR-1)
  • CUME_DIST - relative rank of a row defined as NP/NR
  • NTILE - bucket number for constant bucket size
  • ROW_NUMBER
  • LEAD - window offset
  • LAG - window offset
  • FIRST_VALUE
  • LAST_VALUE
  • NTH_VALUE

Mathematical functions

  • ABS
  • SIGN
  • IS_INF, IS_NAN
  • IEEE_DIVIDE(X,Y) - If the result overflows, returns +/-inf.If Y=0 and X=0, returns NaN.
  • RAND
  • SQRT
  • POW, POWER
  • EXP
  • LN, LOG, LOG, LOG10
  • GREATEST, LEAST
  • DIV
  • SAFE_DIVIDE(X, Y) - Returns NULL on error
  • MOD

Rounding Functions

  • ROUND
  • TRUNC
  • CEIL
  • CEILING
  • FLOOR

Trig Functions

  • COS, COSH, ACOS, ACOSH
  • SIN, SINH, ASIN, ASINH
  • TAN, TANH, ATAN, ATANH, ATAN2

String functions

  • BYTE_LENGTH
  • CHAR_LENGTH
  • CONCAT
  • ENDS_WITH
  • FORMAT
  • LENGTH
  • LOWER
  • LTRIM
  • REGEXP_CONTAINS
  • REGEXP_EXTRACT - Returns an array of all substrings of value that match
  • REGEXP_REPLACE
  • REPLACE
  • RTRIM
  • SAFE_CONVERT_BYTES_TO_STRING - Any invalid UTF-8 chars are replaced with Unicode replacement char, U+FFFD.
  • SPLIT
  • STARTS_WITH
  • STRPOS
  • SUBSTR
  • TRIM
  • UPPER

Array functions

  • ARRAY_CONCAT
  • ARRAY_LENGTH
  • ARRAY_TO_STRING
  • GENERATE_ARRAY - range
  • GENERATE_DATE_ARRAY
  • OFFSET, SAFE_OFFSET
  • ORDINAL, SAFE_ORDINAL

Date / Time functions

  • CURRENT_DATE
  • EXTRACT(part FROM date_expression)
  • DATE
  • DATE_ADD, DATE_SUB, DATE_DIFF
  • DATE_TRUNC
  • DATE_FROM_UNIX_DATE
  • FORMAT_DATE
  • PARSE_DATE
  • UNIX_DATE
  • DATETIME
  • DATETIME_ADD, DATETIME_SUB, DATETIME_DIFF
  • TIME
  • TIME_ADD, TIME_SUB, TIME_DIFF
  • TIME_TRUNC
  • CURRENT_TIMESTAMP
  • STRING
  • TIMESTAMP_ADD, TIMESTAMP_SUB, TIMESTAMP_DIFF
  • TIMESTAMP_TRUNC
  • FORMAT_TIMESTAMP
  • PARSE_TIMESTAMP

Security functions

  • SESSION_USER

Misc DSL features

  • CASE ... WHEN ... THEN
  • COALESCE
  • IF
  • IFNULL
  • BETWEEN
  • LIKE
  • IN
  • IS
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.