zhongchen/GCP Data Engineer Certification Preparation.md

## GCP Data Engineer Certification Preparation.md

      
    Raw
  

              GCP Data Engineer Certification Preparation.md
            
          
    Preparation

Bigquery


automatic schema detection. It is a useful feature if the data schema changes occasionally.
check analytics functions here
row_number window function. See here. It is a good way to filter duplicate records by doing partition by unique id and where row equals to one.

a good tutorial about row_number, rank, and dense_rank functions.


make sure the data encoding is correct when import external data source. See here. The data will still load successfully if the encoding is not correct, but the imported data will not match byte-to-byte to the source.
federated (external) data source. See here.

Save dirty data in cloud storage and load transformed data into Bigquery.
Save a small amount of frequently updated data in cloud storage to avoid loading all the data.


run large-scale SQL aggregations
use Google Stackdriver Audit Logs for access control. See here and here
does not manage access to individual tables or views within the dataset
use a view in a separate dataset as a mechanism to implement data access of a table. Creating a view based on the table, but the view is in another dataset different from the the dataset of the table.
a columnar database.
append only. can't change/update existing values.
a list of window functions
ensure data consistency, you can supply insertId for each inserted row

Dataproc


It is similar to AWS EMR. It is suitable for map-reduce data processing like Spark.
It is recommended to be job specific, one cluster for one job.

Pub/Sub


Windows

session window
global window
sliding window
tumbling window (fixed size)


Ordering

use the timestamp for the log message use case
at least once delivery
Messages only after the subscription is created. Earlier messages are lost.
push model

real-time performance.
use the response as an implicit ack
the publisher sends the message to a pre-configured endpoint.


pull model

the subscriber sends the request and the publisher sends the message along an ackId
the subscriber sends a request back to the publisher with the same ackId after receiving the message.


Deal with duplicates from publisher side.

attach an unique id attribute for each message sent to pub/sub
tell dataflow about the unique id attribute when processing messages from pub/sub.


Bigtable


Bigtable can store and analyze time series data or data with a natural semantic ordering
do small range-scan lookups, getting a small number of rows out of TBs of data.
See schema design best practice here and avoid hotspotting.
tall and narrow table: scan the data
short and wide table: get the data
instance type production and development?
Why Bigtable and not Cloud Spanner? Cost! Note that we can support 100,000 qps with 10 nodes in Bigtable, but would need ~150 nodes in Cloud Spanner.
performance increases linearly as you add nodes to the cluster.
Scalable, fast NoSQL with auto-balancing
data set size is larger than 1TB
non-structured the key value data with value size less than 10MB.
separate storage and query processing.

data is stored at one place
set up separate nodes to process queries for retrieving data.
no data loss when query processing nodes go down
fast recovery process for query processing nodes


operations are atomic at the row level. Avoid schema designs that require atomicity across rows.
group related columns into column families for better performance.
Queries that use the row key, a row prefix, or a row range are the most efficient
Why reverse timestamp? So that the ascending order of row keys puts the latest records at the top of the table. One approach to getting a reverse timestamp is to compute LONG_MAX - timestamp.millisecondsSinceEpoch()
Distribute the writing load between tablets while allowing common queries to return consecutive rows
Note that ensuring an even distribution of reads has taken priority over evenly distributing storage across the cluster
Cloud Bigtable tries to store roughly the same amount of data on each Cloud Bigtable node
SSD: 10,000 QPS at 6ms

Dataflow


It doesn't support Spark.
PCollection.
Prefer Combine to GroupBy, since combine can be done in stages. The process can aggregate results locally.
Implement custom Combine function by extending CombineFn
It’s at the group-by-key stage that the window has an impact.
In the Real world, the Watermark tracks how far behind the system is. Watermark is age of oldest unprocessed record.  The watermark tells us when in processing time, the event time windows are expected to be complete, and therefore we can trigger the aggregation
configure the watermark so that it can emit speculative partial aggregation.
configure the watermark so that it can handle late data.
Where in Event Time to compute?
When in Processing Time to emit?
Triggers control when a result should be emitted for a window.
The reason watermark is based on arrival time into Pub/Sub: Dataflow can guarantee whether or not there will be any late records.

Spanner


use CPU utilization as a metrics to scale the service.
scalable sql.
it is expensive.

Cloud SQL


less than 10 TB

Speech to Text


synchronous upload for audios less than 1 minute
asynchronous upload for audios longer than 1 minute

Data Transfer


Transfer Appliance: transferring data using hardwares shipped to customers' data centers.
Storage Transfer Service: transfer data to cloud storage through Internet network.

Cloud Storage


Nearline

once a month
good for backup


Coldline

once a year