Big Data keeps evolving. Stone Age Hadoop was alot of Java bolierplate for defining HDFS access, Mapper & Reducer. This was superceded by Bronze Age Spark, which provided a succint Scala unification of:
- ML pipelines
- in-memory structured DataSets over RDDs via a SparkSession SQL API
- Distributed Streams
(Note: You can run such jobs easily in a dynamically scalable manner on Google Dataproc) Technology keeps evolving - the Big-Iron Age has arrived in the form of Google Cloud Platform's SPARK KILLER - a nextgen Big Data stack, consisting of:
- BigQuery - massively parallel, blazing fast https://cloud.google.com/blog/big-data/2016/01/anatomy-of-a-bigquery-query data analytics
- TensorFlow - Deep Learning framework (the biggest Machine Learning breakthrough of the decade) distributed and accelerated over custom 'TensorChips'
- DataFlow - (Apache Beam) a structured streaming windowing paradigm with a clean rich API that matches this metaphor - see https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison - build up a transform graph connecting sources and sinks (many of these integration points come out of the box - eg gpubsub, bigquery)
- DataLab - Juypyter Notebooks with some extra IPython integration-convenience magic: %%sql , %%bigquery , %%storage, %%chart, %%mlalpha, %tensorboard, %%monitoring
BigQuery has 4 components which you can read about here: https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood
- Borg - Resource Manager
- Colossus - Distributed FS
- Jupiter - 1 Pb/s network
- Dremel - query engine, open sourced as Apache Drill
Anyhow, IMHO - Spark cannot compete, but I want to focus here on the BigQuery API. heres my QuickRef:
- cancel - Request a job cancel and optionally waits
bq --nosync cancel job_id
- cp - Copies a table
bq cp dataset.old_table dataset2.new_table
- extract - extract source_table into Cloud Storage destination_uris.
bq extract ds.summary gs://mybucket/summary.csv
- head - Displays rows in a table. params: --[no]job, --max_rows, --start_row, --[no]table
bq head -s 5 -n 10 dataset.table
- init - Authenticate and create a default .bigqueryrc file. ??
- insert - Insert JSON rows (from file or string) into a table
bq insert dataset.table /tmp/mydata.json
echo '{"a":1, "b":2}' | bq insert dataset.table
- load - load source file / uris into destination_table, with optional json schema file / string
bq load ds.new_tbl ./info.csv ./info_schema.json
bq load ds.small gs://mybucket/small.csv name:integer,value:string
- ls - List objects contained in project or dataset. Flags: -j show jobs, -p, show all projects
bq ls mydataset
bq ls --filter labels.color:red -a -j -p -n 1000
- mk - Create a dataset, table or view
bq mk -d --data_location=EU new_dataset
bq --dataset_id=new_dataset mk -t new_dataset.newtable name:integer,value:string
bq mk --view='select 1 as num' new_dataset.newview
- mkdef - Emits JSON definition for a GCS backed table.
bq mkdef 'gs://bucket/file.csv' field1:integer,field2:string
- partition - Copies source tables of the format <source_table_prefix> into partitioned tables, with the date suffix of the source tables becoming the partition date of the destination table partitions.
bq partition dataset1.sharded_ dataset2.partitioned_table
- query - Execute a query
bq query 'select count(*) from publicdata:samples.shakespeare'
- rm - Delete dataset / table (-d / -t flags signify target type, -f force, -r remove all tables of a dataset)
bq rm ds.table
bq rm -r -f old_dataset
- shell - Start an interactive bq session.
- show - Show all information about an object.
- update - Updates a dataset, table or view.
bq update --description "My table" existing_dataset.existing_table
bq update -t existing_dataset.existing_table name:integer,value:string
bq update --view='select 1 as num' existing_dataset.existing_view --view_udf_resource=path/to/file.js
- version
- wait - Wait #seconds for a job to finish
bq wait --fail_on_error job_id 100
[ WITH with_query_name AS ( select ) [, ...] ]
query_expr
{ select | ( query_expr ) | query_expr set_op query_expr }
[ ORDER BY expression [{ ASC | DESC }] [, ...] ]
[ LIMIT count [ OFFSET skip_rows ] ]
SELECT [{ ALL | DISTINCT }]
{ [ expression. ]* [ EXCEPT ( column_name [, ...] ) ]
[ REPLACE ( expression [ AS ] column_name [, ...] ) ]
| expression [ [ AS ] alias ] } [, ...]
[ FROM from_item [, ...] ]
[ WHERE bool_expression ]
[ GROUP BY expression [, ...] ]
[ HAVING bool_expression ]
[ WINDOW window_name AS ( window_definition ) [, ...] ]
ALL returns all rows (default), including duplicate rows - opt out with DISTINCT EXCEPT - specifies columns to exclude from result REPLACE - replace column value with expression HAVING - similar to WHERE, but evaluated over GROUP BY aggs
UNION { ALL | DISTINCT }
{
table_name [ [ AS ] alias ] |
join |
( select ) [ [ AS ] alias ] |
field_path |
{ UNNEST( array_expression ) | UNNEST( array_path ) | array_path }
[ [ AS ] alias ] [ WITH OFFSET [ [ AS ] alias ] ] |
with_query_name [ [ AS ] alias ]
}
UNNEST iterates through an ARRAY and produces one row for each ARRAY element WITH OFFSET add index col for UNNEST rows
from_item join_type JOIN from_item
[ { ON bool_expression | USING ( join_column [, ...] ) } ]
**USING** clause specify mutual column_list of one or more columns which occur in both input tables. It performs an equality comparison
{ [INNER] | CROSS | FULL [OUTER] | LEFT [OUTER] | RIGHT [OUTER] }
- INT64
- FLOAT64
- BOOL
- STRING
- BYTES
- DATE
- TIME
- TIMESTAMP
- ARRAY
- STRUCT
eg
ARRAY<STRUCT<ARRAY<STRUCT<a int64, b string>>>>
- STRUCT between the 2 ARRAYs because ARRAYs cannot hold other ARRAYs !
- CAST
- ANY_VALUE - value ?? NULL
- APPROX_COUNT_DISTINCT - statistical estimate for big data
- APPROX_QUANTILES - get aprox boundaries for n quantiles
- APPROX_TOP_COUNT - for agg-cat rows with count >= n, return agg-cat & count
- APPROX_TOP_SUM - get aprox top n elements, based on the sum of an assigned weight
- ARRAY_AGG - Returns an ARRAY of all expression values
- ARRAY_CONCAT_AGG Concats ARRAYs into a single ARRAY
- AVG, MIN, MAX, SUM
- BIT_AND , BIT_OR, BIT_XOR
- COUNT , COUNT_IF
- LOGICAL_AND, LOGICAL_OR
- STRING_AGG - concats strings
SELECT ARRAY_AGG(x*2) as array_agg
FROM UNNEST([1, 2, 3, 4, 5]) as x;
- CORR - Pearson coefficient of correlation of (x,y) pair set: -1 <= r <= 1
- COVAR_POP - population covariance of (x,y) pair set: -Inf <= r <= +Inf
- COVAR_SAMP - sample covariance of (x,y) pair set: -Inf <= r <= +Inf
OLAP time-slice aggregations
analytic_function_name ( [ argument_list ] )
OVER { window_name | ( [ window_specification ] ) }
window_specification:
[ window_name ]
[ PARTITION BY partition_expression_list ]
[ ORDER BY expression [{ ASC | DESC }] [, ...] ]
[ window_frame_clause ]
window_frame_clause:
{ ROWS | RANGE }
{
{ UNBOUNDED PRECEDING | numeric_expression PRECEDING | CURRENT ROW }
|
{ BETWEEN window_frame_boundary_start AND window_frame_boundary_end }
}
eg ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
window_frame_boundary_start:
{ UNBOUNDED PRECEDING | numeric_expression { PRECEDING | FOLLOWING } | CURRENT ROW }
window_frame_boundary_end:
{ UNBOUNDED FOLLOWING | numeric_expression { PRECEDING | FOLLOWING } | CURRENT ROW }
- PARTITION BY - similar to GROUP BY
- RANK - computed ordinal column
- DENSE_RANK - as above, but no peers ?
- PERCENT_RANK - percentile rank of a row defined as (RK-1)/(NR-1)
- CUME_DIST - relative rank of a row defined as NP/NR
- NTILE - bucket number for constant bucket size
- ROW_NUMBER
- LEAD - window offset
- LAG - window offset
- FIRST_VALUE
- LAST_VALUE
- NTH_VALUE
- ABS
- SIGN
- IS_INF, IS_NAN
- IEEE_DIVIDE(X,Y) - If the result overflows, returns +/-inf.If Y=0 and X=0, returns NaN.
- RAND
- SQRT
- POW, POWER
- EXP
- LN, LOG, LOG, LOG10
- GREATEST, LEAST
- DIV
- SAFE_DIVIDE(X, Y) - Returns NULL on error
- MOD
- ROUND
- TRUNC
- CEIL
- CEILING
- FLOOR
- COS, COSH, ACOS, ACOSH
- SIN, SINH, ASIN, ASINH
- TAN, TANH, ATAN, ATANH, ATAN2
- BYTE_LENGTH
- CHAR_LENGTH
- CONCAT
- ENDS_WITH
- FORMAT
- LENGTH
- LOWER
- LTRIM
- REGEXP_CONTAINS
- REGEXP_EXTRACT - Returns an array of all substrings of value that match
- REGEXP_REPLACE
- REPLACE
- RTRIM
- SAFE_CONVERT_BYTES_TO_STRING - Any invalid UTF-8 chars are replaced with Unicode replacement char, U+FFFD.
- SPLIT
- STARTS_WITH
- STRPOS
- SUBSTR
- TRIM
- UPPER
- ARRAY_CONCAT
- ARRAY_LENGTH
- ARRAY_TO_STRING
- GENERATE_ARRAY - range
- GENERATE_DATE_ARRAY
- OFFSET, SAFE_OFFSET
- ORDINAL, SAFE_ORDINAL
- CURRENT_DATE
- EXTRACT(part FROM date_expression)
- DATE
- DATE_ADD, DATE_SUB, DATE_DIFF
- DATE_TRUNC
- DATE_FROM_UNIX_DATE
- FORMAT_DATE
- PARSE_DATE
- UNIX_DATE
- DATETIME
- DATETIME_ADD, DATETIME_SUB, DATETIME_DIFF
- TIME
- TIME_ADD, TIME_SUB, TIME_DIFF
- TIME_TRUNC
- CURRENT_TIMESTAMP
- STRING
- TIMESTAMP_ADD, TIMESTAMP_SUB, TIMESTAMP_DIFF
- TIMESTAMP_TRUNC
- FORMAT_TIMESTAMP
- PARSE_TIMESTAMP
- SESSION_USER
- CASE ... WHEN ... THEN
- COALESCE
- IF
- IFNULL
- BETWEEN
- LIKE
- IN
- IS