jreuben11/BigQueryQuickRef.md Secret

## BigQueryQuickRef.md

      
    Raw
  

              BigQueryQuickRef.md
            
          
    Big Data keeps evolving. Stone Age Hadoop was alot of Java bolierplate for defining HDFS access, Mapper & Reducer.
This was superceded by Bronze Age Spark, which provided a succint Scala unification of:

ML pipelines
in-memory structured DataSets over RDDs via a SparkSession SQL API
Distributed Streams

(Note: You can run such jobs easily in a dynamically scalable manner on Google Dataproc)
Technology keeps evolving - the Big-Iron Age has arrived in the form of Google Cloud Platform's SPARK KILLER - a nextgen Big Data stack, consisting of:

BigQuery - massively parallel, blazing fast https://cloud.google.com/blog/big-data/2016/01/anatomy-of-a-bigquery-query data analytics
TensorFlow - Deep Learning framework (the biggest Machine Learning breakthrough of the decade) distributed and accelerated over custom 'TensorChips'
DataFlow - (Apache Beam) a structured streaming windowing paradigm with a clean rich API that matches this metaphor - see https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison - build up a transform graph connecting sources and sinks (many of these integration points come out of the box - eg gpubsub, bigquery)
DataLab - Juypyter Notebooks with some extra IPython integration-convenience magic: %%sql , %%bigquery , %%storage, %%chart, %%mlalpha, %tensorboard, %%monitoring

BigQuery has 4 components which you can read about here: https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood

Borg - Resource Manager
Colossus - Distributed FS
Jupiter - 1 Pb/s network
Dremel - query engine, open sourced as Apache Drill

Anyhow, IMHO - Spark cannot compete, but I want to focus here on the BigQuery API. heres my QuickRef:
bq CLI


cancel  - Request a job cancel and optionally waits

bq --nosync cancel job_id


cp - Copies a table

bq cp dataset.old_table dataset2.new_table


extract -  extract source_table into Cloud Storage destination_uris.

bq extract ds.summary gs://mybucket/summary.csv


head - Displays rows in a table. params: --[no]job, --max_rows, --start_row, --[no]table

bq head -s 5 -n 10 dataset.table


init  - Authenticate and create a default .bigqueryrc file. ??
insert - Insert JSON rows (from file or string) into a table

bq insert dataset.table /tmp/mydata.json
echo '{"a":1, "b":2}' | bq insert dataset.table


load  - load source file / uris into destination_table, with optional json schema file / string


bq load ds.new_tbl ./info.csv ./info_schema.json
bq load ds.small gs://mybucket/small.csv name:integer,value:string


ls - List objects contained in project or dataset. Flags: -j show jobs,  -p, show all projects

bq ls mydataset
bq ls --filter labels.color:red -a -j -p -n 1000


mk  - Create a dataset, table or view

bq mk -d --data_location=EU new_dataset
bq --dataset_id=new_dataset mk -t new_dataset.newtable name:integer,value:string
bq mk --view='select 1 as num' new_dataset.newview


mkdef  - Emits JSON definition for a GCS backed table.

bq mkdef 'gs://bucket/file.csv' field1:integer,field2:string


partition -  Copies source tables of the format <source_table_prefix> into partitioned tables, with the date suffix of the source tables becoming the partition date of the destination table partitions.

bq partition dataset1.sharded_ dataset2.partitioned_table


query - Execute a query

bq query 'select count(*) from publicdata:samples.shakespeare'


rm - Delete dataset / table  (-d / -t flags signify target type, -f force, -r remove all tables of a dataset)

bq rm ds.table
bq rm -r -f old_dataset


shell - Start an interactive bq session.
show - Show all information about an object.
update - Updates a dataset, table or view.

bq update --description "My table" existing_dataset.existing_table
bq update -t existing_dataset.existing_table name:integer,value:string
bq update --view='select 1 as num' existing_dataset.existing_view --view_udf_resource=path/to/file.js


version
wait - Wait #seconds for a job to finish

bq wait --fail_on_error job_id 100

SQL Syntax

query_statement:

[ WITH with_query_name AS ( select ) [, ...] ]
query_expr

query_expr:

{ select | ( query_expr ) | query_expr set_op query_expr }
[ ORDER BY expression [{ ASC | DESC }] [, ...] ]
[ LIMIT count [ OFFSET skip_rows ] ]

select:

SELECT  [{ ALL | DISTINCT }]
  { [ expression. ]* [ EXCEPT ( column_name [, ...] ) ]
  [ REPLACE ( expression [ AS ] column_name [, ...] ) ]
  | expression [ [ AS ] alias ] } [, ...]
[ FROM from_item  [, ...] ]
[ WHERE bool_expression ]
[ GROUP BY expression [, ...] ]
[ HAVING bool_expression ]
[ WINDOW window_name AS ( window_definition ) [, ...] ]

ALL returns all rows (default), including duplicate rows - opt out with DISTINCT
EXCEPT - specifies columns to exclude from result
REPLACE - replace column value with expression
HAVING - similar to WHERE, but evaluated over GROUP BY aggs
set_op:

UNION { ALL | DISTINCT }

from_item:

{
    table_name [ [ AS ] alias ] |
    join |
    ( select ) [ [ AS ] alias ] |
    field_path |
    { UNNEST( array_expression ) | UNNEST( array_path ) | array_path }
    [ [ AS ] alias ] [ WITH OFFSET [ [ AS ] alias ] ] |
    with_query_name [ [ AS ] alias ]
}

UNNEST iterates through an ARRAY and produces one row for each ARRAY element
WITH OFFSET add index col for UNNEST rows
join:

from_item join_type JOIN from_item
[ { ON bool_expression | USING ( join_column [, ...] ) } ]

**USING** clause specify mutual column_list of one or more columns which occur in both input tables. It performs an equality comparison

join_type:

{ [INNER] | CROSS | FULL [OUTER] | LEFT [OUTER] | RIGHT [OUTER] }

Data types


INT64
FLOAT64
BOOL
STRING
BYTES
DATE
TIME
TIMESTAMP
ARRAY
STRUCT
eg ARRAY<STRUCT<ARRAY<STRUCT<a int64, b string>>>> - STRUCT between the 2 ARRAYs because ARRAYs cannot hold other ARRAYs !

Functions

Converters


CAST

Aggregate functions


ANY_VALUE - value ?? NULL
APPROX_COUNT_DISTINCT - statistical estimate for big data
APPROX_QUANTILES - get aprox boundaries for n quantiles
APPROX_TOP_COUNT - for agg-cat rows with count >= n, return agg-cat & count
APPROX_TOP_SUM - get aprox top n elements, based on the sum of an assigned weight
ARRAY_AGG  - Returns an ARRAY of all expression values
ARRAY_CONCAT_AGG Concats ARRAYs into a single ARRAY
AVG, MIN, MAX, SUM
BIT_AND , BIT_OR, BIT_XOR
COUNT , COUNT_IF
LOGICAL_AND, LOGICAL_OR
STRING_AGG - concats strings

SELECT ARRAY_AGG(x*2) as array_agg
FROM UNNEST([1, 2, 3, 4, 5]) as x;

Statistical Aggregate Functions


CORR - Pearson coefficient of correlation of (x,y) pair set: -1 <= r <= 1
COVAR_POP - population covariance of (x,y) pair set: -Inf <= r <= +Inf
COVAR_SAMP - sample covariance of (x,y) pair set: -Inf <= r <= +Inf

Window-Frame Analytic Functions

OLAP time-slice aggregations
Analytic Function Syntax

analytic_function_name ( [ argument_list ] )
  OVER { window_name | ( [ window_specification ] ) }

window_specification:
  [ window_name ]
  [ PARTITION BY partition_expression_list ]
  [ ORDER BY expression [{ ASC | DESC }] [, ...] ]
  [ window_frame_clause ]

window_frame_clause:
{ ROWS | RANGE }
{
  { UNBOUNDED PRECEDING | numeric_expression PRECEDING | CURRENT ROW }
  |
  { BETWEEN window_frame_boundary_start AND window_frame_boundary_end }
}

eg ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
window_frame_boundary_start:
{ UNBOUNDED PRECEDING | numeric_expression { PRECEDING | FOLLOWING } | CURRENT ROW }

window_frame_boundary_end:
{ UNBOUNDED FOLLOWING | numeric_expression { PRECEDING | FOLLOWING } | CURRENT ROW }


PARTITION BY  - similar to GROUP BY
RANK - computed ordinal column
DENSE_RANK - as above, but no peers ?
PERCENT_RANK - percentile rank of a row defined as (RK-1)/(NR-1)
CUME_DIST -  relative rank of a row defined as NP/NR
NTILE - bucket number for constant bucket size
ROW_NUMBER
LEAD  - window offset
LAG  - window offset
FIRST_VALUE
LAST_VALUE
NTH_VALUE

Mathematical functions


ABS
SIGN
IS_INF, IS_NAN
IEEE_DIVIDE(X,Y) - If the result overflows, returns +/-inf.If Y=0 and X=0, returns NaN.
RAND
SQRT
POW, POWER
EXP
LN, LOG, LOG, LOG10
GREATEST, LEAST
DIV
SAFE_DIVIDE(X, Y) - Returns NULL on error
MOD

Rounding Functions


ROUND
TRUNC
CEIL
CEILING
FLOOR

Trig Functions


COS, COSH, ACOS, ACOSH
SIN, SINH, ASIN, ASINH
TAN, TANH, ATAN, ATANH, ATAN2

String functions


BYTE_LENGTH
CHAR_LENGTH
CONCAT
ENDS_WITH
FORMAT
LENGTH
LOWER
LTRIM
REGEXP_CONTAINS
REGEXP_EXTRACT - Returns an array of all substrings of value that match
REGEXP_REPLACE
REPLACE
RTRIM
SAFE_CONVERT_BYTES_TO_STRING - Any invalid UTF-8 chars are replaced with  Unicode replacement char, U+FFFD.
SPLIT
STARTS_WITH
STRPOS
SUBSTR
TRIM
UPPER

Array functions


ARRAY_CONCAT
ARRAY_LENGTH
ARRAY_TO_STRING
GENERATE_ARRAY - range
GENERATE_DATE_ARRAY
OFFSET, SAFE_OFFSET
ORDINAL, SAFE_ORDINAL

Date / Time functions


CURRENT_DATE
EXTRACT(part FROM date_expression)
DATE
DATE_ADD, DATE_SUB, DATE_DIFF
DATE_TRUNC
DATE_FROM_UNIX_DATE
FORMAT_DATE
PARSE_DATE
UNIX_DATE
DATETIME
DATETIME_ADD, DATETIME_SUB, DATETIME_DIFF
TIME
TIME_ADD, TIME_SUB, TIME_DIFF
TIME_TRUNC
CURRENT_TIMESTAMP
STRING
TIMESTAMP_ADD, TIMESTAMP_SUB, TIMESTAMP_DIFF
TIMESTAMP_TRUNC
FORMAT_TIMESTAMP
PARSE_TIMESTAMP

Security functions


SESSION_USER

Misc DSL features


CASE ... WHEN ... THEN
COALESCE
IF
IFNULL
BETWEEN
LIKE
IN
IS