mikesparr/01-architecting-solutions.md

## 01-architecting-solutions.md

      
    Raw
  

              01-architecting-solutions.md
            
          
    Architecting for the cloud


Architect solutions to be scalable and reilient
Business requirements involve lowering costs / enhancing user experience
Keep an eye on technical needs during development and operation


3 Major Questions To Ask


Where is the company coming from

business, technical, personnel


Where is the company going to

on GCP, hybrid, multi-cloud / regional, national, global


What's next

allow for future changes


Key Data Lifecycle Steps (4)


Ingest - pull in raw data via streaming, batch, or app processes
Store - keep the retrieved data in a durable and accessible environment
Process/Analyze - transform the data into actionable information
Explore/Visualize - convert processed data into shareable, relatable content


Ingesting Data (11 services)

Streaming

Cloud Pub/Sub - messaging middleware system

Batch

Cloud Storage - object storage in buckets
Storage Transfer Service - move data from one place to another
BigQuery Transfer Service - move structured data from one place to another
Storage Transfer Appliance - move very large amounts of data (physical to cloud)

Application

Cloud Logging - outputs
Cloud Pub/Sub -
Cloud SQL - structured data
Cloud Firestore - serverless document data for NoSQL data
Cloud Bigtable - large amounts of NoSQL data
Cloud Spanner - fully managed relational database for structured SQL data


Storing Data

Objects

Cloud Storage
Cloud Storage for Firebase - mostly mobile / web apps with some overlap

Databases

Cloud SQL - relational DB for MySQL, Postgres, SQL Server
Cloud Spanner - large distributed SQL
Cloud Bigtable - large NoSQL
Cloud Firestore - serverless NoSQL

Warehouse

BigQuery - serverless highly-scalable multi-cloud data warehouse


Processing and Analyzing Data


big data, ETL pipelines, machine learning

Compute

Compute Engine - virtual compute machines
Kubernetes Engine - orchestration of containerized workloads
App Engine - quickly get apps up and running

Large-Scale

Cloud Dataproc - modern data lake, ETL, (hadoop, spark, flink, presto, + 30 tools/frameworks)
Cloud Dataflow - based on Apache Beam
Cloud Dataprep - intelligent cloud data service to visually explore, clean, and prepare for analysis/ML

Analysis

BigQuery - analyze petabytes of data at incredible speeds with zero operational overhead


Exploring and Visualizing Data

Science

Cloud Datalab - uses jupyter notebooks to interact and visualize data

Visualizing

BigQuery BI - business intelligence functionality for BQ
Cloud Data Studio - can be utilized by host of data
Looker - frontent enterprise platform for BI, apps, embedded data analytics

Key points:


4 phases: ingest, store, processed/analyzed, explored and visualized
Data ingested via streaming, batch, or application processes
Data structure can change, depending on its source and destination
Google offers a wide range of services to manage data in every phase of its lifecycle


## 02-overarching-principles.md

      
    Raw
  

              02-overarching-principles.md
            
          
    Overall Principles

Grasping Key Tech Fundamentals


Describing distributed systems
Core networking fundamentals
Applying HTTP/HTTPS
Understanding SRE principles

Keeping in Compliance - follow spirit and letter of "the law"


Compliance with what?
Getting help with compliance
Relevant products and services

Annotating Resources Properly


Understanding annotation options
Applying security marks
Working with labels
Implementing networking tags
Choosing the right annotation

Managing Quotas & Costs


Working with quota limits
Cost optimization principles
Best practices (overall, compute, storage and data analysis)


Key Fundamentals

Distributed System - group of servers working together as to appear as a single server to end user

Scale Horizontally - increase capacity by adding more servers that work together
Scale Vertically - Increasing capacity by adding more memory or using a faster CPU
Sharding - Splitting server into multiple servers, a.k.a. "partitioning"

Networking - be familiar with 7-layer OSI model

7 Layer OSI model

Application - End user layer (human comp interaction): HTTP, FTP, IRC, SSH, DNS
Presentation - Syntax layer: SSL, SSH, IMAP, FTP, MPEG, JPEG
Session - Sync and send to port: APIs, Sockets, WinSock
Transport - End to end Connections: TCP, UDP
Network - Packets: IP, ICMP, IPSec, IGMP
Data Link - Frames: Ethernet, PPP, Switch, Bridge
Physical - coax, fiber, wireless, hubs, repeaters


TCP/IP - primary way data gets around the Internet

Handshaking with syn/ack
Addressing with IPv4 and IPv6
Public Internet and private RFC1918 addressing
SSL/TLS - encrypted comms
SSH - access disks
Ports

80 - HTTP
22 - SSH
53 - DNS
443 - HTTPS
25 - SMTP
3306 - MySQL


Applying HTTP/HTTPS - works on L7 (Application Layer)

Understand your resources (URL/URI) and how parameters are applied
Know verbs: GET, POST, PUT, DELETE & PATCH, OPTIONS, TRACE, CONNECT
Have firm grasp of caching: headers and locations (browsers, proxies, CDN, memory cache)
Be familiar with CORS
HTTP/HTTPS status codes

100 Information

100 - Continue
101 - Switching protocol


200 Successful response

200 - Okay
201 - Create
202 - Accepted
204 - No content
206 - Partial content


300 Redirection

301 - Moved permananently
304 - Not modified (caching)
307 - Temporary redirect
308 - Permanent redirect


400 Client Errors

400 - Bad request
401 - Unauthorized
403 - Forbidden
408 - Request timeout
429 - Too many requests


500 Server Error

500 - Internal server error
501 - No implemented
502 - Bad gateway
503 - Service unavailable / quota exceeded
504 - Gateway timeout
511 - Network authentication required


Understanding SRE Principles - What happens when a software engineer is tasked with what used to be called operations (Ben Traynor ~ 2003)

SLI - Service Level Indicator (carefully defined quantitative measure of level of service provided over time)

Request latency - how long to return a response to a request
Failure rate - fraction of all rates received
Batch throughput - proportion of time that data processing rate > threshold set


SLO - Service Level Objective (specify target level for reliability of service)

100% is unrealistic, more expensive, often not necessary from users and best to find where they don't notice - difference, more resources focused on value add of service


SLA - contractual obligation

includes consequences of meeting or missing SLOs it contains


SLI - drives - SLO - informs - SLA


Compliance

Compliance with what

Legistation - targeted areas (health regs, privacy, children's privacy, ownership)
Commercial - protect sensitive data, credit cards / PII
Industry certifications - ensure following health, safety, and environmental regulations
Audits - create necessary structure to allow for 3rd-party audits

Getting help with compliance

Visit the Compliance Center - sortable by region, industry, and focus area
General Data Protection Regulations (GDPR) - continue to have major impact on web services around the world
BAA - Google business association agreement (customer must request BAA from account manager for HIPAA compliance)

Relevant products and services

2-factor authentication
Cloud Security Command Center (CSCC)
Cloud IAM (global across all Google Cloud)
Cloud Logging
Cloud DLP (de-identification routines to protect PII)
Cloud Monitoring (surface compliance missteps / alerts in real time)


Annotations

Understanding annotations

Security Marks - assigned and utilized through Cloud Security Command Center (CSCC)
Labels - key-value pairs that help you organize cloud resources
Network tags - applied to VM instances used for routing traffic to/fro

Applying security marks

Adds business context to assets for compliance
Enhanced security focused insights into resources
Unique to CSCC
Set at org, project, or individually
Works with labels and network tags

Working with labels

Key-value pairs supported by wide range or GCP resources
Used for many scenarios

Identify individual teams or cost center resources
Distinguish deployment environments
Cost allocation and billing breakdowns
Monitor resource groups for metadata
Labels to projects, but NOT folders


Implementing network tags

Control traffic to/from VM instances
Identify VM instances subject to firewall rules and network routes

Use tags as source and destination values in firewall rules
Identify instances on a certain route


Configured with gcloud, console, or API

Choosing right annotation

Need to group/classify for compliance?

Yes : use Security Marks
No : Need billing breakdown?

Yes : use Labels
No : Need to manage network traffic to/from VMs?

Yes : use Network Tags


Managing Quotas & Costs

Working within quota limits - restrict how much of a shared GCP resource you can use

Not to be confused with fixed contstraints which cannot be increased or decreased (i.e. max file siz, database schema limitis)
Two types of quotas:

Rate quotas - limit number of API or service requests
Allocation quotas - restrict the resource available at any one time


Limits are specific to your org
Add your own limits to impose spending limits
Exceeded quotas can generat quota error and 503 status for HTTP requests

Cost optimization principles

Understand the total cost of ownership (TCO)
Commonly misunderstood when moving from on-prem (CapEx) model to cloud-based (OpEx)
Organize costs in relation to business needs
Maximize value of all expenses while eliminating waste
Implement standardized processes at the start

Best practices: use cost management tools


Organize and Structure - set up folders, projects, and use labels to structure costs in relation to business needs


Billing Reports - view costs and analyze trends and filter as needed


Custom dashboards - can also export to BigQuery, then visualize in Cloud Data Studio


Compute - pay for the compute you need

Identify idle VMs

use Idle VM recommender service to identify inactive VMs
Snapshot them before deleting
Stop without deleting


Start/stop VMs automatically or via Cloud Functions
Create custom VMs with right size CPUs and memory
Make the most of preemptible/spot VMs (often is an option - consider it for exam)


Cloud Storage - ways to keep more of your company's hard-earned money

Choose right storage class: nearline 30, coldline 90, archive
Modify storage class as needed with lifecycle policies
Deduplicate data wherever possible (i.e. Cloud Dataflow)
Choose multi-region rather than single region buckets whewre viable
Set object versioning policies to keep copies down (i.e. delete oldest after 2 versions)


Keep BigQuery from BigCosts

Limit query costs with the maximum bytes billed setting
Partition tables based on ingestion time, data, timestamp or integer range column
Switch from on-demand to flat rate pricing to process unlimited bytes for fixed predictable cost
Combine Flex Slots (like preemptible) with annual and monthly commitments (blended)


## 03-case-studies.md

      
    Raw
  

              03-case-studies.md
            
          
    Case Studies

EHR Healthcare

Who is EHR Healthcare - leading provider of EHR software to medical industry (SaaS to multi-national medical offices, hospitals, and insurance providers)

Big company, medical industry, multi-national (regulations), hospitals/insurance (HIPAA)

Primary concerns

Growing exponentially
Scaling their environment
Disaster recovery plan
New continuous deployment
Replace colocation facilities with GCP

Lay of the land (existing tech)

Multiple colocation facilities; lease on one about to expire
Apps are in containers; candidate for Kubernetes
MySQL, MSSQL, Redis, Mongo DB
Legacy integrations (no current plan to move short term)
Users managed by Microsoft AD; monitoring via open source; email alerts often ignored

Business requirements

Onboard new insurance providers ASAP
Minimum 99.9% availability for customer apps
Centralize visibility, proactive performance and usage
Provide insights into healthcare trends (AI platform)
Reduce latency for all customers
Maintain regulatory compliance
Decrease infra administration costs (can be handled through cloud computing)
Make predictions and generate reports on industry trends based on provider data (models from external data sources)

Technical requirements

Maintain legacy interfaces to insurance providers for both on-premisis systems and cloud providers
Provide a consisten way to manage customer-facing, container-based applications (Anthos GKE)
Security and high-perf connection between on-premises systems and GCP
Consistent logging, log retention, monitoring, and alerting capabilities
Maintain and managed multiple container-based environments
Dynamically scale and provision new environments
Create interfaces to ingest and process data from new providers (Dataproc or Dataflow)

Big picture (exec statement)


Our on-prem strategy has worked for years but has required major investment of time and money in training our team on distinctly different systems, managing similar, but separate environments, and responding to outages.

CapEx and OpEx way too high (too many diverse systems increasing mgmt and training costs)


Many of these outages have been a result of misconfigured systems, inadequate capacity to manage spikes in traffic, and inconsisten monitoring practices.

Too old or broken to deal with customer load; off/on monitoring; seeking change


We want to use Google Cloud to leverage a scalable and resilient platform that can span multiple environments seamlessly and provide a consistent and stable user experience that positions us for future growth.

They see light at end of tunnel which is Google Cloud, capable of handling legacy to modern


Key takeaways:

governance and compliance play signicant role
while dedicated to cloud computing, must maintain legacy integrations and high speed connections between GCP and on-prem
attention to security concerns is strong thread, containers and protecting patient data


Helicopter Racing League

Who is Helicopter Racing League - HRL is a global sports league for competitive helicopter racing. Each year HRL holds the world championship and several regional league competitions where teams compete to earn a spot on the world championship. HRL offers a paid service to stream the races all over the world with live telemetry and predictions throughout the race.

Global (covering lot of territory w/ lots of regional focus); cater to entire globe at one time, but also break down to smaller targeted services; commercial enterprise so uptime is important; gathering a lot of data in real time and analyzing and forecasting with it.

Primary concerns

Migrate to new platform
Expand use of AI and ML
Fans in emerging regions
Move service of content, real-time and recorded
Closer to viewers to keep latency down

Lay of the land

Already in the cloud (unnamed)
Existing content stored in Object Storage service on cloud
Video recording and editing handled at race tracks
VMs for every job handle Video Encode/Transcode in cloud
TensorFlow predictions run on other VMs in cloud

Business requirements

Expose the predictive models to partners (API and private connectivity)
Increase predictive capabilities during and before races
Increase telemetry and create additional insights (enhance experience)
Measure fan engagement and new predictions
Enhance global availability and quality of broadcasts
Increase the number of concurrent viewers (streaming capacity increase)
Minimize operational complexity (standardize)
Ensure compliance with regulations
Create a merchandising revenue stream (e-comm app or connection to one)

Technical requirements

Maintain or increase prediction throughput and accuracy (ramp up efficiency)
Reduce viewer latency (get content closer to viewers)
Increate transcoding performance (vertically scale up VMs)
Create real-time analytics of viewer consumption patterns and engagement (streaming data and pipeline)
Create data mart to enable processing of large volumes of race data (batch data)

Big picture (exec statement)
Our CEO, S. Hawke, wants to bring high-adrenaline racing to fans all around the world. We listen to our fans, and they want enhanced video streams that include predictions of events within the race (e.g., overtaking).

Global, ramped up graphic processing, heavily data dependant and may include video analysis

Our current platform allows us to predict race outcomes but lacks the facility to support real-time predictions during races and the capacity to process season-long results.

Streaming data analysis, batch analysis

Key takeaways:

emphasizes numerous scenarios involving data predictions and forecasts that would entail significant use of AI and ML
global org and intent on extending their reach and market while maintaining high quality and low latency
HRL must process a tremendous amount of data in near real-time and output the results worldwide to specific regions


Mountkirk Games

Who is Mountkirk Games - makes online, session-based, multiplayer games for mobile platforms. They have recently started expanding to other platforms after successfully migrating their on-premisis environments to Google Cloud. Their most recent endeavor is to create a retro-style first-person shooter (FPS) game that allows hundreds of simultanous players to join a geo-specific digital arena from multiple platforms and locations. A real-time digital banner will display a global leaderboard of all the top players across every active arena.
Primary concerns

Building a new multiplayer game
Want to use GKE
Use global load balancer to keep latency down
Keep global leader board in sync (streaming data)
Willing to use Cloud Spanner as their database engine

Lay of the land

Recently lift & shift 5 games to GCP
Each game in own project under one folder (most permissions and network policies)
Some legacy games with little traffic consolidated to single project
Separate environments for development and testing

Business requirements

Support multiple gaming platforms (from mobile only to multiple platforms)
Support multiple regions (protect data and diff compliance regs)
Support rapid iteration of game features (CICD)
Minimize latency
Optimize for dynamic scaling
Use managed services and pooled resources (standardization)
Minimize costs

Technical requirements

Dynamically scale based on game activity
Publish scoring data on near real-time global leaderboard
Store game activity logs in structured files for future analysis
Use GPU processing to render graphics server-side for multi-platform support
Support eventual migration of legacy games to this new platform

Big picture (exec statement)
Our last game was the first time we used Google Cloud and it was a success. We were able to analyze player behavior and game telemetry in ways that we never could before. This success allowed us to bet on a full migration to the cloud and to start building all new games using cloud native design principles.

See advantage reviewing user actions and game responses; going completely cloud native

Our new game is our most ambitious to date and will open doors for us to support more gaming platforms beyond mobile. Latency is our top priority, although cost management is the next most important challenge.

Higher performance; lower cost

As with our first cloud-based game, we have grown to expect the cloud to enable advanced analytics capabilities so we can rapidly iterate on our deployments of bug fixes and new functionality.

Double down on analytical approach that gave them an edge; invest in Cloud Spanner to achieve goals

Key takeaways

Wants to expand reach to other gaming platforms and other regions of the world
Very specific ideas on how to architect their next steps, including Kubernetes, Load Balancer, and Cloud Spanner
Latency as top priority and cost management as second; happy users while keeping eye on bottom line


TerramEarth

Who is TerramEarth - manufactures heavy equipment for the mining and agriculture industries. They have over 500 dealers and service centers in 100 countries. Their mission is to build product that make their customers more productive.

Sophisticated earth-moving equipment; solid network; customer focused

Primary concerns

2 million TE vehicles in operation
Collect telemetry data from many sensors (IoT)
Subset of critical data in real time
Rest of data collected, compressed, and uploaded daily
200-500MB of data per vehicle per day (1 PB each day)

Lay of the land

Infra in GCP serving clients all around the world (data gathering and analysis)
Private data center integration (2 main mfr plants sent to) with multiple Interconnects

Business requirements

Predict and detect vehicle malfunction
Ship parts to dealerships for just-in-time repair with little/no downtime
Decrease cloud operational costs and adapt to seasonality
Increase speed and reliability of developer workflow (SRE)
Allow remote developers to productive without compromising code or data security
Create flexible and scalable platform for custom API Services for dealers and partners (Apigee)

Technical requirements

Create new abstraction layer for HTTP API access to legacy systems to enable a gradual migration without- disrupting operations (API gateway)
Modernize all CI/CD pipelines to allow developers to deploy container-based workloads in highly scalable- environments (GKE, Cloud Run, Cloud Build)
Allow developers to experiment without compromising security and governance (new test project)
Create a self-service portal for internal and partner developers to create new projects, request resources for- data analytics jobs, and centrally manage access to the API endpoints (secure new web front end with access to- spin up resources; network tags)
Use cloud-native solutions for keys and secrets management and optimize for identity-based access (IAM, Secrets- Manager, and KMS)
Improve and standardize tools necessary for application and network monitoring and troubleshooting (Cloud Operations: Monitoring, Logging, Debugging)

Big picture (exec statement)
Our advantage has always been our focus on the customer, with our ability to provide excellent customer service and minimize vehicle downtime. After moving multiple systems to Google Cloud, we are seeking new ways to provide best-in-class online fleet management services to our customers and improve operations of our dealerships.

Customer is successful, they are; keeping vehicles operational leads to success; always improving

5-year strategic plan is to create a partner ecosystem of new products by enabling access to our data, increasing autonomous operation capabilities of our vehicles, and creating a path to move the remaining legacy systems to the cloud.

Moving physical and digital information daily

Key takeaways

places great emphasis on customer and partner support which requires consistent and secure communication between systems and devices
after success of initial migration, TE seeks to expand their global integration without disrupting operations or regulations
company's equipment must be able to transmit and analyze a great deal of telemetry data to maintain high-performance levels and just-in-time repairs


## 04-processing-data.md

      
    Raw
  

              04-processing-data.md
            
          
    Processing Data

Compute Services

Overview

Compute Engine

fast-booting VMs
highly configurable, zonal service
choose machine types: general purpose, compute-optimized, memory-optimized, processor-optimized (GPU)
select public or private disk image
options include preemptible (or spot)
also good to know about sole-tenant nodes (byol dedicated hardware requirements), instance groups (MIG/UIG)

Kubernetes Engine (GKE)
Container orchestration system with clusters, node pools, and control plane

regional, managed container service
standard (total control), autopilot (fully managed)
supports auto repair and auto upgrade
know the following:

kubectl syntax
private clusters (VPC native w/ RFC1918 IP addresses)
how to deploy, scale, expose services


App Engine
Oldest of all GCP services and comes in 2 versions: Standard, Flexible

Standard

regional, platform as a service for serverless apps
zero server mgmt and config
instantaneous scaling, down to zero VMs
features

second gen runtimes: python 3, java 11, nodejs, php 7, ruby, go 1.12+
1st gen is limited


Flex

for containerized apps
zero server mgmt and config
best for apps with consistent traffic, gradual scaling is acceptable
robust runtimes

python 2.7/3.6, java 8, nodejs, php 5/7, ruby, go, .net


Cloud Run

great for modern websites, REST APIs, back-end office admin
regional, fully managed serverless service for containers
integrated support for cloud operations
built on Knative open-source standards for easy portability
supports any language, library, or binary
scales from zero and back in an instant

Cloud Functions

regional, event-drive, serverless functions as a service (FaaS)
triggers

HTTP
Cloud Storage
Cloud Pub/Sub
Cloud Firestore
Audit Logs
Cloud Scheduler


totally serverless
automatic horizontal scaling
networks well with hybrid and multi-cloud
acts as glue between services
great for streaming data and IoT apps


Choosing the correct compute option


Summary

Mobile apps: Firebase
event-driven functions: Cloud Functions
specific OS or kernel: Compute Engine
no hybrid or multi-cloud: App Engine Standard (rapid scale) or Flex
containers: Cloud Run or Kubernetes Engine


Compute autoscaling comparison


Summary

when working with Compute Engine, remember that MIGs coupled with Cloud Load Balancer results in faster autoscaling response
for HA, Kubernetes Engine node pool is best used with minimum of 3 nodes in production
Cloud Run scales almost as fast as App Engine Standard, and you are only charged when a request is made


Evolving the Cloud with AI and ML services

AI Data Lifecycle Steps
Key DATA lifecycle steps (covered earlier)

Ingest
Store
Process / Analyze
Explore / Visualize

Key AI Data lifecycle steps

Ingest
Store
Process / Analyze
Train
Model
Evaluate
Deploy
Predict

Reviewing AI and ML Services
AI has been evolving on Google, and "currently" called "Vertex AI"
ML Services

Vision API (OCR, tagging)
Video Intelligence API (local, cloud storage, track objects, recognize text)
Translation API (Cloud Translation for 100 language pairs, with auto-detect)

Basic / Advanced (also includes batch requests, custom models, glossaries)


Text-to-speech / Speech-to-text
Natural Language API
Cloud TPU (hardware behind of APIs above)

8 VMs w/ GPU took 200 minutes vs 1 TPU 8 minutes; faster and cheaper for some tasks


ML Best Practices
Setting up the ML environment

use Notebooks for development
create a Notebook instance for each teammate

treat each notebook instance as virtual workspace
stop when not in use


store prepared data and model in same project

ML development

prepare a good amount of training data
store tabular data in BigQuery
store unstructured data (images, video, audio) in Cloud Storage

includes tf files, avro, etc.
aim for files > 100MB and between 100 - 10,000 shards


During data processing

use Tensorflow Extended for TF projects

NEW: Vertex AI Pipelines (replacement in future)


process tabular data with BigQuery

can use BigQuery ML and save results in BQ permanent table


process unstructured data with Cloud Dataflow (based on Apache Beam)

can generate TF record
if using Apache Spark, then can use Dataproc


Link data to model with managed datasets

Putting the model into production

specify appropriate (virtual) hardware

may be straight VMs or with GPU/TPU


plan for additional inputs (features) to model

i.e. data lake, messaging


enable autoscaling

Summary

AI data lifecycle epands traditional lifecycle

ingest, store, transform, train, model, evaluate, deploy, and predict


Vertex AI is Google Cloud's AI platform, incorporating all machine learning APIs, such as Vision API, its AutoML services, and even related hardware, like Cloud TPU
Be sure to use the proper GCP service for the various stages in the AI data lifecycl, such as using BigQuery for storing and processing tabular data, and Dataflow / Dataproc for processing unstructured data


Handling Big Data and IoT

Working with Cloud IoT Core Devices

remember TerramEarth
Cloud IoT Core - full managed

Device manager (identity, auth, config, control)
Protocol bridge (publishes incoming telemetry data to Pub/Sub for processing)


Features

Secure connection via HTTPS or MQTT
CA signed certs verify device ownership
2-way comms allow updates, on and offline


How it works

Devices -> Cloud IoT Core -> Pub/Sub -> CF or Dataflow (update device config after process)


Massive Messaging via Cloud Pub/Sub

Scalable, durable, global messaging and ingestion service, based on at-least-once publish/subscribe model
Connects many services together and helps small increments of data to flow better
Supports both push and pull modes, with exactly-once processing
Pull mode delivers message and waits for ACK
Features

Truly global: consistent latency from anywhere
Messages can be ordered and/or filtered
Lower-cost Pub/Sub Lite is availabke, requiring more management and lower availability and durability


The Big Data Dog: Cloud BigQuery

Serverless, multi-regional, multi-cloud SQL column-store data warehouse
Scales to handle terrabytes in seconds and petabytes in seconds
Built-in integration for ML and backbone for Business Intelligence Engine
Supports real-time analytics with streams from Pub/Sub, Dataflow, and Datastream
Automatically replicates data and keeps seven-day history of changes

Transforming Big Data
Cloud Dataprep

visually explore, clean, and prepare data for analysis and ML, used by data analysts
integrated partner service offered by Trifacta in conjection with Google
automatically detects schemas, data types, possible joins, and anomalies like missing values, outliars, and duplicates
interprets data transformation intent by user selection and predicts next transformation
transformation functions include

aggregation, pivot, unpivot, joins, union, extraction, calculation, comparison, condition, merge, and regex


works with CSV, JSON, or relational data from Cloud Storage, BigQuery, or upload
outputs to Dataflow, BigQuery, or exports to other file formats

Cloud Dataproc (map reduce)

Zonal resource that manages Spark and Hadoop clusters for batch MapReduce processing
Can be scaled (up or down) while running jobs
Offers image versioning to switch between versions of Spark
Best for migrating existing Spark or Hadoop jobs to the cloud
Most VMs in cluster can be preemptible, but at least one node must be non-preemptible

Cloud Dataflow (more recent approach)

Unified Data Processing
Serverless, fast, and cost-effective
Handles both batch and streaming data with one processing model (compared to one only in others)
Fully managed service, suitable for a wide variety of data processing patterns
Horizontal autoscaling with reliable, consistent, exactly-once processing
Based on open-source Apache Beam

Beam open-source, unified model for defining both batch and streaming data - parallel processing pipelines
Use Beam SDK to build a program that defines a pipeline

Java, Python, Go


Apache Beam (illuminates Dataflow) by being supported distributed processing backend, and executes the pipeline


Choosing the right tool

Summary

Cloud IoT Core - global, fully-managed service to connect, manage and ingest data from Internet-connected devices and a primary source for streaming data
Cloud Pub/Sub - global messaging and ingestion service that supports both push and pull modes with exactly-once-processing for many GCP services
Cloud BigQuery - serverless, multi-regional, multi-cloud, SQL column-store data warehouse used for data analytics and ML capable of scaling to petabytes in minutes
GCP has a number of big data processing services

Cloud Dataprep for visually preparing data
Cloud Dataproc for working with Spark and Hadoop-based workloads
Cloud Dataflow for both batch and streaming data with one processing model


## 05-containers.md

      
    Raw
  

              05-containers.md
            
          
    Containers and Specialized Workloads

Kubernetes Engine

Coordinating Clusters

includes at least one control plane and multiple worker machines (a.k.a. nodes)
can create zonal or regional clusters

single or multi-zonal (single control plane replica)
regional cluster (control plane replicated to multiple zones in regional)


private clusters are VPC native, dependent on internal IP addresses
HA apps, distribute your workload using multi-zonal node pools
Horizontal Pod Autoscaler (HPA) checks the workload's metrics against target thresholds
Configure horizontal pod autoscaling on deployment, rather than ReplicaSet

Working with Workloads (application running on Kubernetes)

custom and external metrics use HPA to scale based on conditions besides the workloads

custom metric reported from your app running in K8S
external metric reported from service outside cluster


configuring limits for Pods based on workload is highly recommended
ConfigMaps bind non-sensitive configuration artifacts to your pod containers at runtime
Deployments are best for stateless apps with ReadOnlyMany or ReadWriteMany volumes
DaemonSets are good for ongoing background tasks that do not require user intervention

attempt to adhere to 1 pod/node model (across cluster or subset of nodes)


StatefulSets are pods with unique persistent identities and hostnames

Networking Pods, Services and External Clients

VPC-native clusters scale better than routes-based clusters and are needed for private clusters

VPC native uses alias IP
routes-based uses static routes


Shared VPC networks are best for orgs with centralized management team

attach Service Projects to the Host Project (sharing selected subnets/ranges)


GKE Ingress (internal or external) implements Ingress resources as Google Cloud load balancers for HTTP(S) workloads
Workload Identity links Kubernetes service accounts to Google service accounts to safely access other Google services

Keeping an eye on Operations

monitoring and logging can be enabled for both new and existing clusters
GKE container logs are removed when the host pod is removed, when their disk runs out of space, or when replaced by newer logs
GKE generates two types of metrics:

System metrics - metrics from essential system components describing CPU, memory, storage
Workload metrics - exposed by any GKE workload like cronjob, etc.


use Istio Fault Injection to test apps resiliency (chaos engineering)

Summary

clusters can be zonal (single or multi-zonal), regional, or private. Use regional clusters for high-availability production workloads
keep in mind the best use cases and scenarios for Deployments, StatefulSets, DaemonSets, and ConfigMaps
remember that VPC-native networks are required for private clusters and Ingress objects create load balancers both for external and internall HTTP traffic. Remember to use Workload identity to connect clusters to other Google services


Anthos: Closer Look

Uncovering the Anthos 411

application deployment anywhere: GPC, on prem, hybrid, and multicloud
supports K8S clusters, Cloud Run, Compute Engine VMs
use Migrate for Anthos to migrate and modernize existing workloads to containers
enhance app development and delivery with up-to-date CI/CD automated pipelines

uses open source


enabled defense-in-depth security strategy for comprehensive security controls across all deployments
fully integrated with GCP Monitoring and Logging, including hybrid and on-prem configs

Managing Microservices with Anthos Service Mesh

suite of tools to monitor and manage service mesh on-prem or Google Cloud
ASM enables managed, observable, and secure communication across microservices, on-prem, and GCP
Power by open-source Istio, ASM is one or more control planes and a data plane which monitors all traffic through a proxy
ASM controls traffic flow between services as well as ingress and egress

supports canary and blue-green deployments
configure load balancing between services
provides in-depth telemetry with Cloud Monitoring, Logging, and Trace


The Kubernetes Engine Connection

Anthos Clusters provide a unified way to work with K8S clusters as part of Anthos, extending GKE to work in multiple environments
Anthos on GCP uses "traditional" GKE, while on-premises uses VMWare and Bare Metal
Logically group and normalize multiple clusters via Fleets to manage multi-cluster capabilities and apply consistent policies
Anthos Config Management (ACM) creates a common configuration across all infra, including custom policies, applied both on-premises and in the cloud
Binary Authorization configures a validation policy enforced when deploying a container image

only explicitly-authorized images deployed using an "Attester"


Accessing Cloud Run for Anthos

flexible serverless development platform for hybrid and on prem enviromnets
managed with Knative, which enables serverless workloads on K8S
streamlines operational needs with advanced workload autoscaling and automatic networking
Scale idle workloads to zero or set min instance count for baseline availability
Out-of-the-box integration with Monitoring, Logging, and Error Reporting
Easily perform A/B tests with traffic splitting and quickly roll back to known working services

Summary

Anthos makes it possible to deploy, manage and monitor applications anywhere an in multiple locations: GCP, on-prem, multicloud, or hybrid
in addition to supporting GKE, Cloud Run, and VMs, Anthos offers system-spanning services such as Migrate for Anthos, Anthos Service Mesh (ASM), and Anthos Config Management (ACM)
familiarize yourself with special features that Anthos offers, particularly in securing CI/CD pipelines like Binary Authorization, Service Mesh testing and reporting, and Cloud Run for Anthos traffic splitting


Bare Metal: Closer Look

All about Anthos Bare Metal

Anthos clusters on bare metal allow you to directly deploy applications on your own hardware
manages app deployment and health across existing datacenters for more efficient operations
control system security without compatibility issues for virtual machines and OS
scale up apps while maintaining reliability regardless of fluctuations in workload and network traffic thanks to advanced monitoring
security can be customized with minimal connections to outside resources

Discovering Deployment Options

Admin Cluster: manages user clusters
User Cluster: control plane + workers
3 basic models to choose from

Standalone: single cluster both user and admin

best for single teams or workloads
no need for separate admin clusters
works great for edge locations


Multi-cluster: one admin and one or more user clusters

works well for fleet of clusters with central mgmt
provides separation between teams
isolates development and production workloads


Hybrid: runs user workloads on admin

create from standalone by adding more user clusters
use only if no security concerns with user workloads on admin
configure HA for user clusters independently


Operating Bare Metal Clusters

use Connect to associate your bare metal clusters to Google Cloud
access is enabled for workload management and unified UI (Cloud Console)
Cloud Console displays health of all connected workloads and allows modifications to all
Put nodes into maintenance mode to train pods/workloads and exclude them from pod scheduling

Summary

Anthos on bare metal gives best flexibility using companies own hardware
Bare metal offers 3 kinds of deployment for admin/user clusters: standalone, multi-cluster, and hybrid
once Connect has been used to associate your clusters with Google Cloud, the Cloud Console is enabled and provides a unified user interface for all clusters, regardless of location


## 06-storing-data.md

      
    Raw
  

              06-storing-data.md
            
          
    Storing Data

Storing Objects and Files

Going straight to Local SSD

fastest block storage option (physical disk attached to computer)
very fast zonal resource, 375 GB solid state disk directly attached to server hosting VM instance
expandable to 3, 6, 9 TB with increasing performance up to 2.4M reads and 1.2M write IOPS
all data encrypted at rest, lost when VM stops but it can survive live migration
best for transient data (media rendering, analytics, high-perf computing, caches)

Persevering with Persistant Disks

major benefit: persistence, available after VM shutdown
independent of VMs where data is distributed across disks for redundancy
highly durable (up to six 9s) and secure: data encrypted at rest and in transit
configurations

zonal

data in a single zone
4 types: Standard, Balanced, SSD, Extreme
can be used for both snapshot and boot disk
can add more storage space, throughput, and IOPS


regional

data in 2 zones in same region
3 types: Standard, Balanced, SSD
can be used for snapshots but not for boot disks
ONLY storage can be changed, not throughput or IOPS


Managing File-based Storage

file stored as whole unit without data being broken down into blocks
fully managed file-based storage service (like a NAS)
provision instance in specific zone
access using NFSv3 protocol
consistently fast and good for lift and shift migration
read-only snapshots are supported
3 tiers

Basic: best for file sharing, k8s, dev, web hosting (1-63.9 TiB)
Enterprise: best for critical large-scale ops, GCE, K8S (1-10 TiB)
High Scale: best for high-perf computing (i.e. genome sequencing: 10-100 TiB)


Keeping Objects in Cloud Storage

infinitely scalable, fully-managed, highly durable object storage service (11 9s of durability)
for mutable, unstructured data such as images, videos, and documents
all objects stored in buckets

can be regional or multi-regional
support folders/sub-folders
supports versioning per bucket, with live object and noncurrent versions


permissions granted by bucket or by object and limited to teams, or people, or fully public
storage classes

standard: most frequently accessed or for brief time
nearline: for data you plan to access once a month or less
coldline: access at most once every 90 days
archive: access less than once/year
use lifecycle management rules to move objects between classes

age
when put in bucket (on or before of after)
current version


Summary

local SSDs are highest performance block storage but lose data if VM is stopped
file-based storage, go with Filestore: Basic, Enterprise, High Scale
control costs with bucket locations and lifecycle rules


Selecting the proper storage


Buckets


Saving your data on GCP

Cloud SQL

regional, fully-managed relational db service for SQL Server, MySQL, and PostgreSQL
automatic replication with automatic failover, backup, point-in-time recovery
scale manually up to 96 cores, more than 624GB RAM, add replicas as needed
features

built-in high availability
automaticaly scale storage up to 30TB
connects with GAE, GCE, GKE, and BigQuery, among other services


Cloud Spanner

fully-managed relational DB with up to 5 9s availability and unlimited scale (Mountkirk Games)
create spanner instance by define instance config, and compute capacity
best practice use query parameters to increase efficiency and lower costs
features

automatic sharding
external consistency (all transactions sequentially [even though distributed])
backup/restore and PITR


Cloud Bigtable

fully-managed, scalable NoSQL db service used for large analyticals and operational workloads

no related tables, primary or foreign keys
key / value store


handles large amount of data in key-value store and supports high read and write at low latency
tables stored in instances that contain up to 4 nodes located in different zones
use cases

time-series data
marketing and/or financial data
IoT data


Firestore

fully-managed, scalable, NoSQL serverless database
live syncronization and offline mode allow multi-user, collaboritive applications on mobile web
supports Datastore dbs and Datastore API
workloads include:

live asset and activity tracking
real-time analytics
media and product catalogs
social user profiles and gaming leaderboards (Mountkirk Games?)


Examining other DB options

Datastream

serverless CDC and replication service
synchronizes data across heterogenous database and applications reliably


Firebase Realtime DB

serverless NoSQL databas for storing and syncing data
enhances collaboration among user across devices and web in real time


MemoryStore

in-memory service for Redis and Memcached
provides low-latency access and high-throughput for heavily-accessed data


Summary

relational services: Cloud SQL, Cloud Spanner
NoSQL services: Firestore (document-based) and Bigtable (key-value)
cached, gaming, streaming data use Memorystore, which supports both Redis and Memcached


Deciding on the best databases


Summary

first question is whether data structured or not; if not go with Cloud Storage unless you need mobile SDKs
if workload primarily data analytics, best options are Bigtable (if NoSQL and low latency), and otherwise BigQuery
if workload structured data, CloudSQL for basic relational DB needs and Cloud Spanner for horizontal scalability


## 07-networking-data.md

      
    Raw
  

              07-networking-data.md
            
          
    Networking Data

Google has 75K miles networking cable and 150+ POPs around the globe
Globally Connecting with external networking

Cloud Domains

global registrar for domain names using Google Domains
uses built-in DNS or allows custom nameservers
supports DNSSEC and private WhoIs records
integration features

managed as a GCP project, including billing
automatic domain verification with Search Console, App Engine, Cloud Run, etc.
works with Cloud IAM for access management
partners including Shopify, Wix, Squarepace, Bluehost, Weebly, and others


Cloud DNS

hierarchical DB linking domain names to IP addresses
global, scalable, fully-managed authoritive domain name service
100% SLA uptime guarantee
offers both public and private managed zones

private visible to one or more private VPC that you specify


features

Cloud IAM and Cloud Logging integration
DNS peering and DNS forwarding
Anycast nameservers - allows multiple machines to share same IP (nearest machine)
DNSSEC support


Static External IP Addresses

reserve static external IP addresses in projects and assign to resources
GCP supports two types: regional and global
regional IP addresses can be assigned to compute engine VMs and network load balancers
global IP addresses are assigned to Anycast IPs and global load balancers (HTTP/S, SSL, and TCP proxies)

IPV6 only global and only global load balancers


static IP addresses can be assigned through console, gcloud command line, API, or Terraform
no charge except for IP addresses that are reserved but not used

Cloud Load Balancing

fully distributed, software-defined managed service that spreads network traffic across multiple instances of your apps
Layer 4 and Layer 7 load balancing with Cloud CDN integration and automatic scaling

L4: transport layer (TCP, UDP)
L7: application layer (HTTP, HTTPS)


Regional load balancing features

health checks
session affinity
IPv4 only
good for single region, IPv4 only, or compliance


Global load balancing features

multi-region failover
connects to closest region for lowest latency
IPv4 and IPv6
good for backends distributed across multiple regions, want to deliver nearest to user with Anycast


Cloud CDN

serves content closer to the user
relies on Google Cloud's global edge network
works with external GCP HTTPS load balancing
manage cache rules with cache control header or allow it to automatically cache static content
content sources include

instance groups
zonal network endpoint groups (NEGs)
serverless NEGs like GAE or Cloud Functions
Cloud Storage buckets


Summary

optimal external networking requires establishing a domain, connection to DNS, reserving external IP, integrating load balancer, and accessing a CDN
because Cloud Load Balancing is software-defined and does not rely on any devices - virtual or physical - it can handle spikes with no prewarming
cloud CDN works only with content delivered from sources within Google Cloud, such as GCE instance groups or GCS buckets. Custom origins are not allowed.

Networking internally

Virtual Private Cloud (VPC)

delivers networking for all your orgs GCP resources
global IPv4 unicast software-defined network
automatic or custom creation (configure subnets, firewalls rules, routes, VPNs and BGP)
VPC is global, but subnets are regional
Options include

Shared VPC
VPC peering (make services available private across different VPC networks)
Bring your own IP addresses
Packet mirroring


Cloud Interconnect

extends on-premises network to VPC through HA, low-latency connection
dedicated

direct physical connection to GCP
best for high bandwidth needs
10Gbps or 100Gbps circuits
capacities from 50Mbs - 50Gbs
traffic not encrypted but can be added
cannot use Cloud VPN with it


partner

connects to GCP via partner
better for lower bandwidth needs
depends on partner capabilities
capacities from 50Mbs - 50Gbs
traffic not encrypted but can be added
cannot use Cloud VPN with it


Cloud VPN

securely connects peer network to VPC

any network, including those on other providers


traffic is encrypted by one IPsec VPN gateway and decrypted by another
requires static IP address for persistence and does not support dynamic, e.g. "dial-in" VPN
best practices

keep cloud VPN resource in own project
use dynamic routing and BGP
establish secure firewal rules for VPN
generate strong pre-shared keys for tunnels


Examinining Other Networking Services (Cloud Router, CDN Interconnect)

Cloud Router

provides dynamic routing for hybrid networks linking VPCs to external networks via BGP
works with Cloud VPN and Dedicated Interconnect
automatically learns subnets in VPC and announces them to on-premises network
works with router appliances


CDN Interconnect

direct low-latency connectivity to certain CDN providers, with lower egress fees
works for both pull and push cache fills
best for high-volume egress traffic (lowers cost) and frequent content updates (lower latency)
supports Akamai, Verizon, Cloudflare, Fastly, and few others


Summary

because GCP VPC is software-defined network, can use single VPC to cover multiple regions without trafficking across the public internet
if your company requires direct connection from on-prem datacenters to their VPC, use either Dedicated Interconnect (highest capacity) or Partner Interconnect
for lower traffic requirements that require a secure connection, connect to VPC with Cloud VPN and Cloud Router using a static IP address

Finding a Load Balancer

External

Internal

Summary

when handling HTTP or HTTPS traffic around the world, use external HTTP(S) Load Balancing
if you have TCP traffic and would prefer to offload the SSL/TLS, best choice is SSL Proxy Load Balancing
Internal TCP or UDP traffic should rely on regional internal TCP/UDP Load Balancing for lowest latency and most direct connection


## 08-manage-secure-data.md

      
    Raw
  

              08-manage-secure-data.md
            
          
    Managing and Securing Data

Establishing Core Security (Cloud IAM)

Cloud IAM

determining WHO has ACCESS to WHICH resources
Who (principals or members)

Google account
Service account
Google group (best practice)
Google Workspace account (org)
Cloud Identity domain (org less Workspace Apps / Features)
All authenticated users (users on Internet authenticated by Google)
All users (anyone on the Internet)


Access (roles)

Billing Account Administrator
Billing Account User
Storage Object Creator
Storage Object Viewer
Cloud SQL Editor
Cloud SQL Instance User
Security Admin (get/set any IAM policy)


Which resources

VM instance
GKE cluster
Storage bucket
Pub/Sub topic
Organization
Folder
Project


Roles

Primitive (oldest, pre-date Cloud IAM, broadest permissions)
Predefined (target specific resources w/ actions at granular level)
Custom (unique set of permissions, most granular level)

requires Role Administrator role


Policies

Role binding - 1 or more principals assigned to a role (policy)


Summary (Part 1: Cloud IAM)

globally manages access control for organizations
resource access is granted to roles (collection of permissions), and roles are granted to principals
recommender helps identify excess or needed permissions from principals
grants IAM access to external identities (AD, etc.) with workload identity federation


Resource Manager

centrally manages and secures organization's projects with custom folder hiearchy
example:

company

dept Y

team B

product 1

dev
test

GCE, GAE, GCS resources


production


modified Cloud IAM policies across an org
Cloud Asset Inventory monitors and analyizes all GCP assets, including IAM policies
Organization Policy Service sets constraints on resources and helps orgs stay in compliance

Cloud Identity

fully-managed Identity as a Service (IaaS) for provisioning and managing identity resources
each user and group given a Cloud Identity account allow Cloud IAM to manage access
can be configured to federate identities with other identity providers (i.e. Active Directory)
features

SSO with other apps
Multi-factor authentication (MFA)
Device security with endpoint management
Context-aware access without VPN


Cloud Identity-Aware Proxy (IAP)

establishes a central authorization layer for apps accessed by HTTPS, also internally by HTTP
enforces access control policies for apps and resources
based on load balancer and IAM, permits only auth request
supports

App Engine
Compute Engine
Kubernetes Engine
Cloud Run
On-premises


Summary

Cloud IAM: principals, roles, and resources relation and well as IAM policies creation and inheritance
keep the principle of least privilege in mind and practice; GCP stresses this concept and offers the Recommender service to help implement it
controlling and managing access is critical to an orgs security. GCP offers two services: Cloud Identity and Cloud Identity-Aware Proxy

Detecting and Responding to Security Threats

Cloud Security Command Center - hub for GCP protective resources

comprehensive security management and risk platform
two tiers: standard and premium
designed to prevent, detect, and respond to threats from a single pane of glass
integrates and monitors many security services on GCP as well as external services
identifies security compliance violations and misconfiguration in Google Cloud assets
exports SCC data to Splunk as well as other SIEMs
standard

SHA: security health analytics
WSS: web security scanner
CA/WAF: cloud armor
DLP: cloud data loss prevention
anomaly detection
Foreseti Security integration


premium

SHA: adds monitoring/reporting for compliance
WSS: adds managed scans
ETD: event threat detection
CTD: container threat detection
continuous exports to Pub/Sub


Web Security Scanner - guarding frontlines of Internet traffic

detects key vulnerabilities in App Engine, Compute Engine, and Kubernetes Engine applications
crawler based, supports public URLs and IPs not behind a firewall
standard

custom scans


premium

managed scans


detects

Cross-site scripting (XSS)
Flash injection
mixed (HTTP/HTTPS) content
outdated and insecure JavaScript libraries
readable text passwords


Cloud Armor

edge-level, enterprise-grade DDoS protection and web application firewall (WAF)
leverages Google Cloud load balancing
mitigates OWASP's top ten risks
features

allow or deny traffic by IPs or CIDR ranges
preview changes before pushing policy live
configure WAF fules to reduce false positives
reference named IP address lists from CDN partners (Fastly, Cloudflare, Imperva)


Event Threat Detection - malware, crypto mining

identify threats in near-real time by monitoring and analyziing Cloud Logging
threats are defined by rules, which specify needed logs
create custom rules by running queires on log data exported to BigQuery
quickly detect many types of attacks

malware
crypto mining
outgoing DDoS attacks
port scanning
IAM anomalous grant
brute-force SSH


Cloud Data Loss Prevention

inspection, classification, and de-identification platform to protect sensitive data
includes over 150 data detectors for personal identifiable information (PII)
connect DLP results to SCC, Data Catalog, or export to external SIEM or governance tool
detects data in

streams of data or structured text
files in cloud storage or BigQuery
images


Summary

the Cloud Security Command Center (SCC) platform monitors the majority of GCP's security services and is accessible through Standard and Premium tiers
if you use Google Cloud's external HTTPS load balancer, protect your web-based applications hosted on GAE, GCE, or GKE with Web Security Scanner
when Event Threat Detection (EDT) is enabled, GCP analyzes a range of logs from Cloud Logging to find signs of malware, crypto mining, outgoing DDoS attacks, brute-force SSH, and other threats

Managing Encrypted Keys

A cryto key is a string of characters when used with an encryption algorithm, it makes ordinary text unreadable. When that key, or another, is used with a decryption algorithm, it makes the text readable.
In order to be effective, cryto keys have to be complex and not something anyone should memorize. As such, we need a service to maintains them like KMS.
Cloud Key Management Service (KMS)

highly available, low-latency service to generate, manage and apply cryptographic keys
Cloud KMS encrypts and decrypts - does not store secrets itself - and controls access to keys
supports both symmetrical (e.g. AES) and asymmetrical (e.g. RSA or EC), algorithms
includes a 24-hour delay for key material destruction, to prevent accidental or malicious data loss
supports regulatory compliance and adds optional variations

Cloud HSM
Cloud EKM
CMEK
CSEK


google recommends you regularly and automatically rotate symmetric keys

asymmetric keys cannot be automated, but good practice


Cloud Hardware Security Module (HSM)

hosts encryption keys and performs cryptographic actions in cluster of FIPS 140-2 level 3 certified devices
enables compliance with hardware requirements
HSM keys are crytographically bound to region, with support for multi-regions
Cloud HSM properties

keys are non-exportable
tamper resistant
provides tamper evidence
auto-scales horizontally


Cloud External Key Managment (EKM)

use keys from supported external key management partners instead of GCP
works only with supported CMEK integration services

BigQuery
Compute Engine
Cloud Run
Cloud Spanner
Cloud Storage
GKE
Pub/Sub
Secret Manager


key ring should be created in same location as external key management partner
benefits include

key provenance
access control

must grant GCP project access to key


centralized key management


Secret Manager

allows storage of passwords and variables to use in applications
fully managed service for storing, managing, and accessing secrets as binary blobs or text strings
used for storing sensitive runtim info such as database passwords, API keys, or TLS certificates
data of each secret is immutable and new versions are created each time value is modified
best practices

follow principle of least privilege
limit access with IAM conditions
use the Secret Manager API instead of env vars
reference secrets by version number, not "latest"


Encrypted Keys Flowcharts


Summary

cloud KMS offers a full range of key sources: Google-managed, Cloud HSM devices, or Cloud EKM partners as well as customer-managed or supplied keys
regular automatic rotation of symmetric algorithm keys is considered a best practics; Cloud KMS does not support automatic rotation of asymmetric keys
follow the principle of least privilege when assigning access to Secret Manager entries by using Cloud IAM conditions or secret-level binding