Skip to content

Instantly share code, notes, and snippets.

@jiankuang
Last active November 29, 2016 19:25
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jiankuang/6a51ee855c41d1aa761cbb865b8a0f72 to your computer and use it in GitHub Desktop.
Save jiankuang/6a51ee855c41d1aa761cbb865b8a0f72 to your computer and use it in GitHub Desktop.
Google Cloud

Concepts

Tables

  • Native tables
  • External tables
  • Views
    • BigQuery's views are logical views, not materialized views, which means that the query that defines the view is re-executed every time the view is queried.
    • Views can only reference other tables and views with the same Dataset location.

Storage

BigQuery stores data in the Capacitor columnar data format.

Slots

Under the hood, analytics throughput is measured in BigQuery slots. A BigQuery slot is a unit of computational capacity required to execute SQL queries. BigQuery automatically calculates how many slots are required by each query, depending on query size and complexity.
To check how many slots your account uses, see Monitoring BigQuery Using Stackdriver.

Loading Data

Preparing Data for Loading

Loading denormalized data

In BigQuery, you typically should denormalize the data structure in order to enable superfast querying. While JOINs on small datasets are possible with BigQuery, they're not as performant as a denormalized structure.

Transforming XML data

When dealing with large XML files, it's important to not use a DOM-based parser on the XML file, as the parser will attempt to load the entire file as an in-memory tree.
Instead, we suggest using a pull-based parser or a SAX parser, which are more efficient. Parsing a large XML file is simple when using Python.

Import Data from Avro file

Schema automatically generated for Avro files

Querying Data

Legacy SQL

Legacy SQL & Standard SQL

To enable standard SQL: Click Compose Query -> Click Show Options -> Uncheck the Use Legacy SQL checkbox.

Data Types

STRING, BYTES, INTEGER, FLOAT, BOOLEAN, RECORD, TIMESTAMP, DATE, TIME, DATETIME

TOP() examples

TOP() is equal to GROUP BY...ORDER BY...LIMIT, but the TOP query will execute much faster.

  • with TOP():
SELECT
  TOP(word, 10), COUNT(*)
FROM
  ds.Table
WHERE
  word contains 'th';
  • without TOP():
SELECT
  word, COUNT(*) AS cnt
FROM
  ds.Table
WHERE
  word CONTAINS 'th'
GROUP BY
  word
ORDER BY
  cnt DESC LIMIT 10;

Table Decorators

Snapshot decorators

You can use snapshot decorators to undelete a table within 2 days of table deletion.

bq Command-Line Tool

bq load --autodetect <destination_table> <data_source_uri>
Enable autodetection of schema and options for formats—such as CSV and JSON—that are not self-describing. The default value is --noautodetect.
It can even automatically convert 'yes|no' to true|false
eg: bq load --autodetect healthcare.business_rules BusinessRules.csv

The relationship between BigQuery and Google Analytics

  • Google Analytics Premium can export website data into BigQuery
  • Not sure if Google Analytics can import data from BigQuery

Cloud Bigtable and other storage options

Google Cloud SQL : full SQL support for an online transaction processing (OLTP) system (not our use case)
Google BigQuery : interactive querying in an online analytical processing (OLAP) system (maybe our use case)
Google Cloud Storage : store immutable blobs larger than 10 MB, such as large images or movies (not our use case)
Cloud Datastore : store highly structured objects, or if you require support for ACID transactions and SQL-like queries (not sure)

SSH
gcloud compute ssh jian_kuang@compute-engine-instance-with-bigtable-access
Copy
gcloud compute copy-files Anaconda2-4.2.0-Linux-x86_64.sh jian_kuang@genia-datalab-compute:Anaconda2-4.2.0-Linux-x86_64.sh --zone us-central1-b

Create three nodes container engine:
gcloud container clusters create guestbook --num-nodes 3
Deploy NGINX in Kubernetes

kubectl run nginx --image=nginx --replicas=3

kubectl get pods -owide

expose the NGINX cluster as an external service:

kubectl expose deployment nginx --port=80 --target-port=80 --type=LoadBalancer

kubectl get service nginx
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment