Skip to content

Instantly share code, notes, and snippets.

@zhongchen
Last active August 29, 2019 00:25
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save zhongchen/e0021ad79940ec8ec1fa159101c21002 to your computer and use it in GitHub Desktop.
Save zhongchen/e0021ad79940ec8ec1fa159101c21002 to your computer and use it in GitHub Desktop.

Preparation

Bigquery

  • automatic schema detection. It is a useful feature if the data schema changes occasionally.
  • check analytics functions here
  • row_number window function. See here. It is a good way to filter duplicate records by doing partition by unique id and where row equals to one.
    • a good tutorial about row_number, rank, and dense_rank functions.
  • make sure the data encoding is correct when import external data source. See here. The data will still load successfully if the encoding is not correct, but the imported data will not match byte-to-byte to the source.
  • federated (external) data source. See here.
    • Save dirty data in cloud storage and load transformed data into Bigquery.
    • Save a small amount of frequently updated data in cloud storage to avoid loading all the data.
  • run large-scale SQL aggregations
  • use Google Stackdriver Audit Logs for access control. See here and here
  • does not manage access to individual tables or views within the dataset
  • use a view in a separate dataset as a mechanism to implement data access of a table. Creating a view based on the table, but the view is in another dataset different from the the dataset of the table.
  • a columnar database.
  • append only. can't change/update existing values.
  • a list of window functions
  • ensure data consistency, you can supply insertId for each inserted row

Dataproc

  • It is similar to AWS EMR. It is suitable for map-reduce data processing like Spark.
  • It is recommended to be job specific, one cluster for one job.

Pub/Sub

  • Windows
    • session window
    • global window
    • sliding window
    • tumbling window (fixed size)
  • Ordering
    • use the timestamp for the log message use case
    • at least once delivery
    • Messages only after the subscription is created. Earlier messages are lost.
    • push model
      • real-time performance.
      • use the response as an implicit ack
      • the publisher sends the message to a pre-configured endpoint.
    • pull model
      • the subscriber sends the request and the publisher sends the message along an ackId
      • the subscriber sends a request back to the publisher with the same ackId after receiving the message.
  • Deal with duplicates from publisher side.
    • attach an unique id attribute for each message sent to pub/sub
    • tell dataflow about the unique id attribute when processing messages from pub/sub.

Bigtable

  • Bigtable can store and analyze time series data or data with a natural semantic ordering
  • do small range-scan lookups, getting a small number of rows out of TBs of data.
  • See schema design best practice here and avoid hotspotting.
  • tall and narrow table: scan the data
  • short and wide table: get the data
  • instance type production and development?
  • Why Bigtable and not Cloud Spanner? Cost! Note that we can support 100,000 qps with 10 nodes in Bigtable, but would need ~150 nodes in Cloud Spanner.
  • performance increases linearly as you add nodes to the cluster.
  • Scalable, fast NoSQL with auto-balancing
  • data set size is larger than 1TB
  • non-structured the key value data with value size less than 10MB.
  • separate storage and query processing.
    • data is stored at one place
    • set up separate nodes to process queries for retrieving data.
    • no data loss when query processing nodes go down
    • fast recovery process for query processing nodes
  • operations are atomic at the row level. Avoid schema designs that require atomicity across rows.
  • group related columns into column families for better performance.
  • Queries that use the row key, a row prefix, or a row range are the most efficient
  • Why reverse timestamp? So that the ascending order of row keys puts the latest records at the top of the table. One approach to getting a reverse timestamp is to compute LONG_MAX - timestamp.millisecondsSinceEpoch()
  • Distribute the writing load between tablets while allowing common queries to return consecutive rows
  • Note that ensuring an even distribution of reads has taken priority over evenly distributing storage across the cluster
  • Cloud Bigtable tries to store roughly the same amount of data on each Cloud Bigtable node
  • SSD: 10,000 QPS at 6ms

Dataflow

  • It doesn't support Spark.
  • PCollection.
  • Prefer Combine to GroupBy, since combine can be done in stages. The process can aggregate results locally.
  • Implement custom Combine function by extending CombineFn
  • It’s at the group-by-key stage that the window has an impact.
  • In the Real world, the Watermark tracks how far behind the system is. Watermark is age of oldest unprocessed record. The watermark tells us when in processing time, the event time windows are expected to be complete, and therefore we can trigger the aggregation
  • configure the watermark so that it can emit speculative partial aggregation.
  • configure the watermark so that it can handle late data.
  • Where in Event Time to compute?
  • When in Processing Time to emit?
  • Triggers control when a result should be emitted for a window.
  • The reason watermark is based on arrival time into Pub/Sub: Dataflow can guarantee whether or not there will be any late records.

Spanner

  • use CPU utilization as a metrics to scale the service.
  • scalable sql.
  • it is expensive.

Cloud SQL

  • less than 10 TB

Speech to Text

  • synchronous upload for audios less than 1 minute
  • asynchronous upload for audios longer than 1 minute

Data Transfer

  • Transfer Appliance: transferring data using hardwares shipped to customers' data centers.
  • Storage Transfer Service: transfer data to cloud storage through Internet network.

Cloud Storage

  • Nearline
    • once a month
    • good for backup
  • Coldline
    • once a year
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment