Skip to content

Instantly share code, notes, and snippets.

@mikesparr
Last active February 21, 2024 07:40
Show Gist options
  • Star 8 You must be signed in to star a gist
  • Fork 4 You must be signed in to fork a gist
  • Save mikesparr/144c4ee0a80c3a8156af28fac063b9b7 to your computer and use it in GitHub Desktop.
Save mikesparr/144c4ee0a80c3a8156af28fac063b9b7 to your computer and use it in GitHub Desktop.
Study Guide for GCP Professional Cloud Architect exam (notes from refresher course)

Architecting for the cloud

  • Architect solutions to be scalable and reilient
  • Business requirements involve lowering costs / enhancing user experience
  • Keep an eye on technical needs during development and operation

3 Major Questions To Ask

  1. Where is the company coming from

    • business, technical, personnel
  2. Where is the company going to

    • on GCP, hybrid, multi-cloud / regional, national, global
  3. What's next

    • allow for future changes

Key Data Lifecycle Steps (4)

  1. Ingest - pull in raw data via streaming, batch, or app processes
  2. Store - keep the retrieved data in a durable and accessible environment
  3. Process/Analyze - transform the data into actionable information
  4. Explore/Visualize - convert processed data into shareable, relatable content

Ingesting Data (11 services)

Streaming

  • Cloud Pub/Sub - messaging middleware system

Batch

  • Cloud Storage - object storage in buckets
  • Storage Transfer Service - move data from one place to another
  • BigQuery Transfer Service - move structured data from one place to another
  • Storage Transfer Appliance - move very large amounts of data (physical to cloud)

Application

  • Cloud Logging - outputs
  • Cloud Pub/Sub -
  • Cloud SQL - structured data
  • Cloud Firestore - serverless document data for NoSQL data
  • Cloud Bigtable - large amounts of NoSQL data
  • Cloud Spanner - fully managed relational database for structured SQL data

Storing Data

Objects

  • Cloud Storage
  • Cloud Storage for Firebase - mostly mobile / web apps with some overlap

Databases

  • Cloud SQL - relational DB for MySQL, Postgres, SQL Server
  • Cloud Spanner - large distributed SQL
  • Cloud Bigtable - large NoSQL
  • Cloud Firestore - serverless NoSQL

Warehouse

  • BigQuery - serverless highly-scalable multi-cloud data warehouse

Processing and Analyzing Data

  • big data, ETL pipelines, machine learning

Compute

  • Compute Engine - virtual compute machines
  • Kubernetes Engine - orchestration of containerized workloads
  • App Engine - quickly get apps up and running

Large-Scale

  • Cloud Dataproc - modern data lake, ETL, (hadoop, spark, flink, presto, + 30 tools/frameworks)
  • Cloud Dataflow - based on Apache Beam
  • Cloud Dataprep - intelligent cloud data service to visually explore, clean, and prepare for analysis/ML

Analysis

  • BigQuery - analyze petabytes of data at incredible speeds with zero operational overhead

Exploring and Visualizing Data

Science

  • Cloud Datalab - uses jupyter notebooks to interact and visualize data

Visualizing

  • BigQuery BI - business intelligence functionality for BQ
  • Cloud Data Studio - can be utilized by host of data
  • Looker - frontent enterprise platform for BI, apps, embedded data analytics

Key points:

  • 4 phases: ingest, store, processed/analyzed, explored and visualized
  • Data ingested via streaming, batch, or application processes
  • Data structure can change, depending on its source and destination
  • Google offers a wide range of services to manage data in every phase of its lifecycle

Overall Principles

Grasping Key Tech Fundamentals

  • Describing distributed systems
  • Core networking fundamentals
  • Applying HTTP/HTTPS
  • Understanding SRE principles

Keeping in Compliance - follow spirit and letter of "the law"

  • Compliance with what?
  • Getting help with compliance
  • Relevant products and services

Annotating Resources Properly

  • Understanding annotation options
  • Applying security marks
  • Working with labels
  • Implementing networking tags
  • Choosing the right annotation

Managing Quotas & Costs

  • Working with quota limits
  • Cost optimization principles
  • Best practices (overall, compute, storage and data analysis)

Key Fundamentals

Distributed System - group of servers working together as to appear as a single server to end user

  • Scale Horizontally - increase capacity by adding more servers that work together
  • Scale Vertically - Increasing capacity by adding more memory or using a faster CPU
  • Sharding - Splitting server into multiple servers, a.k.a. "partitioning"

Networking - be familiar with 7-layer OSI model

  • 7 Layer OSI model
    • Application - End user layer (human comp interaction): HTTP, FTP, IRC, SSH, DNS
    • Presentation - Syntax layer: SSL, SSH, IMAP, FTP, MPEG, JPEG
    • Session - Sync and send to port: APIs, Sockets, WinSock
    • Transport - End to end Connections: TCP, UDP
    • Network - Packets: IP, ICMP, IPSec, IGMP
    • Data Link - Frames: Ethernet, PPP, Switch, Bridge
    • Physical - coax, fiber, wireless, hubs, repeaters
  • TCP/IP - primary way data gets around the Internet
    • Handshaking with syn/ack
    • Addressing with IPv4 and IPv6
    • Public Internet and private RFC1918 addressing
    • SSL/TLS - encrypted comms
    • SSH - access disks
    • Ports
      • 80 - HTTP
      • 22 - SSH
      • 53 - DNS
      • 443 - HTTPS
      • 25 - SMTP
      • 3306 - MySQL

Applying HTTP/HTTPS - works on L7 (Application Layer)

  • Understand your resources (URL/URI) and how parameters are applied
  • Know verbs: GET, POST, PUT, DELETE & PATCH, OPTIONS, TRACE, CONNECT
  • Have firm grasp of caching: headers and locations (browsers, proxies, CDN, memory cache)
  • Be familiar with CORS
  • HTTP/HTTPS status codes
    • 100 Information
      • 100 - Continue
      • 101 - Switching protocol
    • 200 Successful response
      • 200 - Okay
      • 201 - Create
      • 202 - Accepted
      • 204 - No content
      • 206 - Partial content
    • 300 Redirection
      • 301 - Moved permananently
      • 304 - Not modified (caching)
      • 307 - Temporary redirect
      • 308 - Permanent redirect
    • 400 Client Errors
      • 400 - Bad request
      • 401 - Unauthorized
      • 403 - Forbidden
      • 408 - Request timeout
      • 429 - Too many requests
    • 500 Server Error
      • 500 - Internal server error
      • 501 - No implemented
      • 502 - Bad gateway
      • 503 - Service unavailable / quota exceeded
      • 504 - Gateway timeout
      • 511 - Network authentication required

Understanding SRE Principles - What happens when a software engineer is tasked with what used to be called operations (Ben Traynor ~ 2003)

  • SLI - Service Level Indicator (carefully defined quantitative measure of level of service provided over time)
    • Request latency - how long to return a response to a request
    • Failure rate - fraction of all rates received
    • Batch throughput - proportion of time that data processing rate > threshold set
  • SLO - Service Level Objective (specify target level for reliability of service)
    • 100% is unrealistic, more expensive, often not necessary from users and best to find where they don't notice - difference, more resources focused on value add of service
  • SLA - contractual obligation
    • includes consequences of meeting or missing SLOs it contains
  • SLI - drives - SLO - informs - SLA

Compliance

Compliance with what

  • Legistation - targeted areas (health regs, privacy, children's privacy, ownership)
  • Commercial - protect sensitive data, credit cards / PII
  • Industry certifications - ensure following health, safety, and environmental regulations
  • Audits - create necessary structure to allow for 3rd-party audits

Getting help with compliance

  • Visit the Compliance Center - sortable by region, industry, and focus area
  • General Data Protection Regulations (GDPR) - continue to have major impact on web services around the world
  • BAA - Google business association agreement (customer must request BAA from account manager for HIPAA compliance)

Relevant products and services

  • 2-factor authentication
  • Cloud Security Command Center (CSCC)
  • Cloud IAM (global across all Google Cloud)
  • Cloud Logging
  • Cloud DLP (de-identification routines to protect PII)
  • Cloud Monitoring (surface compliance missteps / alerts in real time)

Annotations

Understanding annotations

  • Security Marks - assigned and utilized through Cloud Security Command Center (CSCC)
  • Labels - key-value pairs that help you organize cloud resources
  • Network tags - applied to VM instances used for routing traffic to/fro

Applying security marks

  • Adds business context to assets for compliance
  • Enhanced security focused insights into resources
  • Unique to CSCC
  • Set at org, project, or individually
  • Works with labels and network tags

Working with labels

  • Key-value pairs supported by wide range or GCP resources
  • Used for many scenarios
    • Identify individual teams or cost center resources
    • Distinguish deployment environments
    • Cost allocation and billing breakdowns
    • Monitor resource groups for metadata
    • Labels to projects, but NOT folders

Implementing network tags

  • Control traffic to/from VM instances
  • Identify VM instances subject to firewall rules and network routes
    • Use tags as source and destination values in firewall rules
    • Identify instances on a certain route
  • Configured with gcloud, console, or API

Choosing right annotation

  • Need to group/classify for compliance?
    • Yes : use Security Marks
    • No : Need billing breakdown?
      • Yes : use Labels
      • No : Need to manage network traffic to/from VMs?
        • Yes : use Network Tags

Managing Quotas & Costs

Working within quota limits - restrict how much of a shared GCP resource you can use

  • Not to be confused with fixed contstraints which cannot be increased or decreased (i.e. max file siz, database schema limitis)
  • Two types of quotas:
    • Rate quotas - limit number of API or service requests
    • Allocation quotas - restrict the resource available at any one time
  • Limits are specific to your org
  • Add your own limits to impose spending limits
  • Exceeded quotas can generat quota error and 503 status for HTTP requests

Cost optimization principles

  • Understand the total cost of ownership (TCO)
  • Commonly misunderstood when moving from on-prem (CapEx) model to cloud-based (OpEx)
  • Organize costs in relation to business needs
  • Maximize value of all expenses while eliminating waste
  • Implement standardized processes at the start

Best practices: use cost management tools

  • Organize and Structure - set up folders, projects, and use labels to structure costs in relation to business needs

  • Billing Reports - view costs and analyze trends and filter as needed

  • Custom dashboards - can also export to BigQuery, then visualize in Cloud Data Studio

  • Compute - pay for the compute you need

    • Identify idle VMs
      • use Idle VM recommender service to identify inactive VMs
      • Snapshot them before deleting
      • Stop without deleting
    • Start/stop VMs automatically or via Cloud Functions
    • Create custom VMs with right size CPUs and memory
    • Make the most of preemptible/spot VMs (often is an option - consider it for exam)
  • Cloud Storage - ways to keep more of your company's hard-earned money

    • Choose right storage class: nearline 30, coldline 90, archive
    • Modify storage class as needed with lifecycle policies
    • Deduplicate data wherever possible (i.e. Cloud Dataflow)
    • Choose multi-region rather than single region buckets whewre viable
    • Set object versioning policies to keep copies down (i.e. delete oldest after 2 versions)
  • Keep BigQuery from BigCosts

    • Limit query costs with the maximum bytes billed setting
    • Partition tables based on ingestion time, data, timestamp or integer range column
    • Switch from on-demand to flat rate pricing to process unlimited bytes for fixed predictable cost
    • Combine Flex Slots (like preemptible) with annual and monthly commitments (blended)

Case Studies

EHR Healthcare

Who is EHR Healthcare - leading provider of EHR software to medical industry (SaaS to multi-national medical offices, hospitals, and insurance providers)

  • Big company, medical industry, multi-national (regulations), hospitals/insurance (HIPAA)

Primary concerns

  • Growing exponentially
  • Scaling their environment
  • Disaster recovery plan
  • New continuous deployment
  • Replace colocation facilities with GCP

Lay of the land (existing tech)

  • Multiple colocation facilities; lease on one about to expire
  • Apps are in containers; candidate for Kubernetes
  • MySQL, MSSQL, Redis, Mongo DB
  • Legacy integrations (no current plan to move short term)
  • Users managed by Microsoft AD; monitoring via open source; email alerts often ignored

Business requirements

  • Onboard new insurance providers ASAP
  • Minimum 99.9% availability for customer apps
  • Centralize visibility, proactive performance and usage
  • Provide insights into healthcare trends (AI platform)
  • Reduce latency for all customers
  • Maintain regulatory compliance
  • Decrease infra administration costs (can be handled through cloud computing)
  • Make predictions and generate reports on industry trends based on provider data (models from external data sources)

Technical requirements

  • Maintain legacy interfaces to insurance providers for both on-premisis systems and cloud providers
  • Provide a consisten way to manage customer-facing, container-based applications (Anthos GKE)
  • Security and high-perf connection between on-premises systems and GCP
  • Consistent logging, log retention, monitoring, and alerting capabilities
  • Maintain and managed multiple container-based environments
  • Dynamically scale and provision new environments
  • Create interfaces to ingest and process data from new providers (Dataproc or Dataflow)

Big picture (exec statement)

  • Our on-prem strategy has worked for years but has required major investment of time and money in training our team on distinctly different systems, managing similar, but separate environments, and responding to outages.

    • CapEx and OpEx way too high (too many diverse systems increasing mgmt and training costs)
  • Many of these outages have been a result of misconfigured systems, inadequate capacity to manage spikes in traffic, and inconsisten monitoring practices.

    • Too old or broken to deal with customer load; off/on monitoring; seeking change
  • We want to use Google Cloud to leverage a scalable and resilient platform that can span multiple environments seamlessly and provide a consistent and stable user experience that positions us for future growth.

    • They see light at end of tunnel which is Google Cloud, capable of handling legacy to modern

Key takeaways:

  • governance and compliance play signicant role
  • while dedicated to cloud computing, must maintain legacy integrations and high speed connections between GCP and on-prem
  • attention to security concerns is strong thread, containers and protecting patient data

Helicopter Racing League

Who is Helicopter Racing League - HRL is a global sports league for competitive helicopter racing. Each year HRL holds the world championship and several regional league competitions where teams compete to earn a spot on the world championship. HRL offers a paid service to stream the races all over the world with live telemetry and predictions throughout the race.

  • Global (covering lot of territory w/ lots of regional focus); cater to entire globe at one time, but also break down to smaller targeted services; commercial enterprise so uptime is important; gathering a lot of data in real time and analyzing and forecasting with it.

Primary concerns

  • Migrate to new platform
  • Expand use of AI and ML
  • Fans in emerging regions
  • Move service of content, real-time and recorded
  • Closer to viewers to keep latency down

Lay of the land

  • Already in the cloud (unnamed)
  • Existing content stored in Object Storage service on cloud
  • Video recording and editing handled at race tracks
  • VMs for every job handle Video Encode/Transcode in cloud
  • TensorFlow predictions run on other VMs in cloud

Business requirements

  • Expose the predictive models to partners (API and private connectivity)
  • Increase predictive capabilities during and before races
  • Increase telemetry and create additional insights (enhance experience)
  • Measure fan engagement and new predictions
  • Enhance global availability and quality of broadcasts
  • Increase the number of concurrent viewers (streaming capacity increase)
  • Minimize operational complexity (standardize)
  • Ensure compliance with regulations
  • Create a merchandising revenue stream (e-comm app or connection to one)

Technical requirements

  • Maintain or increase prediction throughput and accuracy (ramp up efficiency)
  • Reduce viewer latency (get content closer to viewers)
  • Increate transcoding performance (vertically scale up VMs)
  • Create real-time analytics of viewer consumption patterns and engagement (streaming data and pipeline)
  • Create data mart to enable processing of large volumes of race data (batch data)

Big picture (exec statement) Our CEO, S. Hawke, wants to bring high-adrenaline racing to fans all around the world. We listen to our fans, and they want enhanced video streams that include predictions of events within the race (e.g., overtaking).

  • Global, ramped up graphic processing, heavily data dependant and may include video analysis

Our current platform allows us to predict race outcomes but lacks the facility to support real-time predictions during races and the capacity to process season-long results.

  • Streaming data analysis, batch analysis

Key takeaways:

  • emphasizes numerous scenarios involving data predictions and forecasts that would entail significant use of AI and ML
  • global org and intent on extending their reach and market while maintaining high quality and low latency
  • HRL must process a tremendous amount of data in near real-time and output the results worldwide to specific regions

Mountkirk Games

Who is Mountkirk Games - makes online, session-based, multiplayer games for mobile platforms. They have recently started expanding to other platforms after successfully migrating their on-premisis environments to Google Cloud. Their most recent endeavor is to create a retro-style first-person shooter (FPS) game that allows hundreds of simultanous players to join a geo-specific digital arena from multiple platforms and locations. A real-time digital banner will display a global leaderboard of all the top players across every active arena.

Primary concerns

  • Building a new multiplayer game
  • Want to use GKE
  • Use global load balancer to keep latency down
  • Keep global leader board in sync (streaming data)
  • Willing to use Cloud Spanner as their database engine

Lay of the land

  • Recently lift & shift 5 games to GCP
  • Each game in own project under one folder (most permissions and network policies)
  • Some legacy games with little traffic consolidated to single project
  • Separate environments for development and testing

Business requirements

  • Support multiple gaming platforms (from mobile only to multiple platforms)
  • Support multiple regions (protect data and diff compliance regs)
  • Support rapid iteration of game features (CICD)
  • Minimize latency
  • Optimize for dynamic scaling
  • Use managed services and pooled resources (standardization)
  • Minimize costs

Technical requirements

  • Dynamically scale based on game activity
  • Publish scoring data on near real-time global leaderboard
  • Store game activity logs in structured files for future analysis
  • Use GPU processing to render graphics server-side for multi-platform support
  • Support eventual migration of legacy games to this new platform

Big picture (exec statement) Our last game was the first time we used Google Cloud and it was a success. We were able to analyze player behavior and game telemetry in ways that we never could before. This success allowed us to bet on a full migration to the cloud and to start building all new games using cloud native design principles.

  • See advantage reviewing user actions and game responses; going completely cloud native

Our new game is our most ambitious to date and will open doors for us to support more gaming platforms beyond mobile. Latency is our top priority, although cost management is the next most important challenge.

  • Higher performance; lower cost

As with our first cloud-based game, we have grown to expect the cloud to enable advanced analytics capabilities so we can rapidly iterate on our deployments of bug fixes and new functionality.

  • Double down on analytical approach that gave them an edge; invest in Cloud Spanner to achieve goals

Key takeaways

  • Wants to expand reach to other gaming platforms and other regions of the world
  • Very specific ideas on how to architect their next steps, including Kubernetes, Load Balancer, and Cloud Spanner
  • Latency as top priority and cost management as second; happy users while keeping eye on bottom line

TerramEarth

Who is TerramEarth - manufactures heavy equipment for the mining and agriculture industries. They have over 500 dealers and service centers in 100 countries. Their mission is to build product that make their customers more productive.

  • Sophisticated earth-moving equipment; solid network; customer focused

Primary concerns

  • 2 million TE vehicles in operation
  • Collect telemetry data from many sensors (IoT)
  • Subset of critical data in real time
  • Rest of data collected, compressed, and uploaded daily
  • 200-500MB of data per vehicle per day (1 PB each day)

Lay of the land

  • Infra in GCP serving clients all around the world (data gathering and analysis)
  • Private data center integration (2 main mfr plants sent to) with multiple Interconnects

Business requirements

  • Predict and detect vehicle malfunction
  • Ship parts to dealerships for just-in-time repair with little/no downtime
  • Decrease cloud operational costs and adapt to seasonality
  • Increase speed and reliability of developer workflow (SRE)
  • Allow remote developers to productive without compromising code or data security
  • Create flexible and scalable platform for custom API Services for dealers and partners (Apigee)

Technical requirements

  • Create new abstraction layer for HTTP API access to legacy systems to enable a gradual migration without- disrupting operations (API gateway)
  • Modernize all CI/CD pipelines to allow developers to deploy container-based workloads in highly scalable- environments (GKE, Cloud Run, Cloud Build)
  • Allow developers to experiment without compromising security and governance (new test project)
  • Create a self-service portal for internal and partner developers to create new projects, request resources for- data analytics jobs, and centrally manage access to the API endpoints (secure new web front end with access to- spin up resources; network tags)
  • Use cloud-native solutions for keys and secrets management and optimize for identity-based access (IAM, Secrets- Manager, and KMS)
  • Improve and standardize tools necessary for application and network monitoring and troubleshooting (Cloud Operations: Monitoring, Logging, Debugging)

Big picture (exec statement) Our advantage has always been our focus on the customer, with our ability to provide excellent customer service and minimize vehicle downtime. After moving multiple systems to Google Cloud, we are seeking new ways to provide best-in-class online fleet management services to our customers and improve operations of our dealerships.

  • Customer is successful, they are; keeping vehicles operational leads to success; always improving

5-year strategic plan is to create a partner ecosystem of new products by enabling access to our data, increasing autonomous operation capabilities of our vehicles, and creating a path to move the remaining legacy systems to the cloud.

  • Moving physical and digital information daily

Key takeaways

  • places great emphasis on customer and partner support which requires consistent and secure communication between systems and devices
  • after success of initial migration, TE seeks to expand their global integration without disrupting operations or regulations
  • company's equipment must be able to transmit and analyze a great deal of telemetry data to maintain high-performance levels and just-in-time repairs

Processing Data

Compute Services

Overview Screen Shot 2022-07-19 at 11 09 09 AM

Compute Engine

  • fast-booting VMs
  • highly configurable, zonal service
  • choose machine types: general purpose, compute-optimized, memory-optimized, processor-optimized (GPU)
  • select public or private disk image
  • options include preemptible (or spot)
  • also good to know about sole-tenant nodes (byol dedicated hardware requirements), instance groups (MIG/UIG)

Kubernetes Engine (GKE) Container orchestration system with clusters, node pools, and control plane

  • regional, managed container service
  • standard (total control), autopilot (fully managed)
  • supports auto repair and auto upgrade
  • know the following:
    • kubectl syntax
    • private clusters (VPC native w/ RFC1918 IP addresses)
    • how to deploy, scale, expose services

App Engine Oldest of all GCP services and comes in 2 versions: Standard, Flexible

  • Standard
    • regional, platform as a service for serverless apps
    • zero server mgmt and config
    • instantaneous scaling, down to zero VMs
    • features
      • second gen runtimes: python 3, java 11, nodejs, php 7, ruby, go 1.12+
      • 1st gen is limited
  • Flex
    • for containerized apps
    • zero server mgmt and config
    • best for apps with consistent traffic, gradual scaling is acceptable
    • robust runtimes
      • python 2.7/3.6, java 8, nodejs, php 5/7, ruby, go, .net

Cloud Run

  • great for modern websites, REST APIs, back-end office admin
  • regional, fully managed serverless service for containers
  • integrated support for cloud operations
  • built on Knative open-source standards for easy portability
  • supports any language, library, or binary
  • scales from zero and back in an instant

Cloud Functions

  • regional, event-drive, serverless functions as a service (FaaS)
  • triggers
    • HTTP
    • Cloud Storage
    • Cloud Pub/Sub
    • Cloud Firestore
    • Audit Logs
    • Cloud Scheduler
  • totally serverless
  • automatic horizontal scaling
  • networks well with hybrid and multi-cloud
  • acts as glue between services
  • great for streaming data and IoT apps

Choosing the correct compute option

Screen Shot 2022-07-19 at 11 30 19 AM Screen Shot 2022-07-19 at 11 30 34 AM

Summary

  • Mobile apps: Firebase
  • event-driven functions: Cloud Functions
  • specific OS or kernel: Compute Engine
  • no hybrid or multi-cloud: App Engine Standard (rapid scale) or Flex
  • containers: Cloud Run or Kubernetes Engine

Compute autoscaling comparison

Screen Shot 2022-07-19 at 11 49 18 AM

Summary

  • when working with Compute Engine, remember that MIGs coupled with Cloud Load Balancer results in faster autoscaling response
  • for HA, Kubernetes Engine node pool is best used with minimum of 3 nodes in production
  • Cloud Run scales almost as fast as App Engine Standard, and you are only charged when a request is made

Evolving the Cloud with AI and ML services

AI Data Lifecycle Steps

Key DATA lifecycle steps (covered earlier)

  1. Ingest
  2. Store
  3. Process / Analyze
  4. Explore / Visualize

Key AI Data lifecycle steps

  1. Ingest
  2. Store
  3. Process / Analyze
  4. Train
  5. Model
  6. Evaluate
  7. Deploy
  8. Predict

Reviewing AI and ML Services AI has been evolving on Google, and "currently" called "Vertex AI"

ML Services

  • Vision API (OCR, tagging)
  • Video Intelligence API (local, cloud storage, track objects, recognize text)
  • Translation API (Cloud Translation for 100 language pairs, with auto-detect)
    • Basic / Advanced (also includes batch requests, custom models, glossaries)
  • Text-to-speech / Speech-to-text
  • Natural Language API
  • Cloud TPU (hardware behind of APIs above)
    • 8 VMs w/ GPU took 200 minutes vs 1 TPU 8 minutes; faster and cheaper for some tasks

ML Best Practices Setting up the ML environment

  • use Notebooks for development
  • create a Notebook instance for each teammate
    • treat each notebook instance as virtual workspace
    • stop when not in use
  • store prepared data and model in same project

ML development

  • prepare a good amount of training data
  • store tabular data in BigQuery
  • store unstructured data (images, video, audio) in Cloud Storage
    • includes tf files, avro, etc.
    • aim for files > 100MB and between 100 - 10,000 shards

During data processing

  • use Tensorflow Extended for TF projects
    • NEW: Vertex AI Pipelines (replacement in future)
  • process tabular data with BigQuery
    • can use BigQuery ML and save results in BQ permanent table
  • process unstructured data with Cloud Dataflow (based on Apache Beam)
    • can generate TF record
    • if using Apache Spark, then can use Dataproc
  • Link data to model with managed datasets

Putting the model into production

  • specify appropriate (virtual) hardware
    • may be straight VMs or with GPU/TPU
  • plan for additional inputs (features) to model
    • i.e. data lake, messaging
  • enable autoscaling

Summary

  • AI data lifecycle epands traditional lifecycle
    • ingest, store, transform, train, model, evaluate, deploy, and predict
  • Vertex AI is Google Cloud's AI platform, incorporating all machine learning APIs, such as Vision API, its AutoML services, and even related hardware, like Cloud TPU
  • Be sure to use the proper GCP service for the various stages in the AI data lifecycl, such as using BigQuery for storing and processing tabular data, and Dataflow / Dataproc for processing unstructured data

Handling Big Data and IoT

Working with Cloud IoT Core Devices

  • remember TerramEarth
  • Cloud IoT Core - full managed
    • Device manager (identity, auth, config, control)
    • Protocol bridge (publishes incoming telemetry data to Pub/Sub for processing)
  • Features
    • Secure connection via HTTPS or MQTT
    • CA signed certs verify device ownership
    • 2-way comms allow updates, on and offline
  • How it works
    • Devices -> Cloud IoT Core -> Pub/Sub -> CF or Dataflow (update device config after process) Screen Shot 2022-07-25 at 8 59 34 AM

Massive Messaging via Cloud Pub/Sub

  • Scalable, durable, global messaging and ingestion service, based on at-least-once publish/subscribe model
  • Connects many services together and helps small increments of data to flow better
  • Supports both push and pull modes, with exactly-once processing
  • Pull mode delivers message and waits for ACK
  • Features
    • Truly global: consistent latency from anywhere
    • Messages can be ordered and/or filtered
    • Lower-cost Pub/Sub Lite is availabke, requiring more management and lower availability and durability

The Big Data Dog: Cloud BigQuery

  • Serverless, multi-regional, multi-cloud SQL column-store data warehouse
  • Scales to handle terrabytes in seconds and petabytes in seconds
  • Built-in integration for ML and backbone for Business Intelligence Engine
  • Supports real-time analytics with streams from Pub/Sub, Dataflow, and Datastream
  • Automatically replicates data and keeps seven-day history of changes

Transforming Big Data

Cloud Dataprep

  • visually explore, clean, and prepare data for analysis and ML, used by data analysts
  • integrated partner service offered by Trifacta in conjection with Google
  • automatically detects schemas, data types, possible joins, and anomalies like missing values, outliars, and duplicates
  • interprets data transformation intent by user selection and predicts next transformation
  • transformation functions include
    • aggregation, pivot, unpivot, joins, union, extraction, calculation, comparison, condition, merge, and regex
  • works with CSV, JSON, or relational data from Cloud Storage, BigQuery, or upload
  • outputs to Dataflow, BigQuery, or exports to other file formats

Cloud Dataproc (map reduce)

  • Zonal resource that manages Spark and Hadoop clusters for batch MapReduce processing
  • Can be scaled (up or down) while running jobs
  • Offers image versioning to switch between versions of Spark
  • Best for migrating existing Spark or Hadoop jobs to the cloud
  • Most VMs in cluster can be preemptible, but at least one node must be non-preemptible

Cloud Dataflow (more recent approach)

  • Unified Data Processing
  • Serverless, fast, and cost-effective
  • Handles both batch and streaming data with one processing model (compared to one only in others)
  • Fully managed service, suitable for a wide variety of data processing patterns
  • Horizontal autoscaling with reliable, consistent, exactly-once processing
  • Based on open-source Apache Beam
    • Beam open-source, unified model for defining both batch and streaming data - parallel processing pipelines
    • Use Beam SDK to build a program that defines a pipeline
      • Java, Python, Go
    • Apache Beam (illuminates Dataflow) by being supported distributed processing backend, and executes the pipeline

Choosing the right tool Screen Shot 2022-07-25 at 9 14 55 AM

Summary

  • Cloud IoT Core - global, fully-managed service to connect, manage and ingest data from Internet-connected devices and a primary source for streaming data
  • Cloud Pub/Sub - global messaging and ingestion service that supports both push and pull modes with exactly-once-processing for many GCP services
  • Cloud BigQuery - serverless, multi-regional, multi-cloud, SQL column-store data warehouse used for data analytics and ML capable of scaling to petabytes in minutes
  • GCP has a number of big data processing services
    • Cloud Dataprep for visually preparing data
    • Cloud Dataproc for working with Spark and Hadoop-based workloads
    • Cloud Dataflow for both batch and streaming data with one processing model

Containers and Specialized Workloads

Kubernetes Engine

Coordinating Clusters

  • includes at least one control plane and multiple worker machines (a.k.a. nodes)
  • can create zonal or regional clusters
    • single or multi-zonal (single control plane replica)
    • regional cluster (control plane replicated to multiple zones in regional)
  • private clusters are VPC native, dependent on internal IP addresses
  • HA apps, distribute your workload using multi-zonal node pools
  • Horizontal Pod Autoscaler (HPA) checks the workload's metrics against target thresholds
  • Configure horizontal pod autoscaling on deployment, rather than ReplicaSet

Working with Workloads (application running on Kubernetes)

  • custom and external metrics use HPA to scale based on conditions besides the workloads
    • custom metric reported from your app running in K8S
    • external metric reported from service outside cluster
  • configuring limits for Pods based on workload is highly recommended
  • ConfigMaps bind non-sensitive configuration artifacts to your pod containers at runtime
  • Deployments are best for stateless apps with ReadOnlyMany or ReadWriteMany volumes
  • DaemonSets are good for ongoing background tasks that do not require user intervention
    • attempt to adhere to 1 pod/node model (across cluster or subset of nodes)
  • StatefulSets are pods with unique persistent identities and hostnames

Networking Pods, Services and External Clients

  • VPC-native clusters scale better than routes-based clusters and are needed for private clusters
    • VPC native uses alias IP
    • routes-based uses static routes
  • Shared VPC networks are best for orgs with centralized management team
    • attach Service Projects to the Host Project (sharing selected subnets/ranges)
  • GKE Ingress (internal or external) implements Ingress resources as Google Cloud load balancers for HTTP(S) workloads
  • Workload Identity links Kubernetes service accounts to Google service accounts to safely access other Google services

Keeping an eye on Operations

  • monitoring and logging can be enabled for both new and existing clusters
  • GKE container logs are removed when the host pod is removed, when their disk runs out of space, or when replaced by newer logs
  • GKE generates two types of metrics:
    • System metrics - metrics from essential system components describing CPU, memory, storage
    • Workload metrics - exposed by any GKE workload like cronjob, etc.
  • use Istio Fault Injection to test apps resiliency (chaos engineering)

Summary

  • clusters can be zonal (single or multi-zonal), regional, or private. Use regional clusters for high-availability production workloads
  • keep in mind the best use cases and scenarios for Deployments, StatefulSets, DaemonSets, and ConfigMaps
  • remember that VPC-native networks are required for private clusters and Ingress objects create load balancers both for external and internall HTTP traffic. Remember to use Workload identity to connect clusters to other Google services

Anthos: Closer Look

Uncovering the Anthos 411

  • application deployment anywhere: GPC, on prem, hybrid, and multicloud
  • supports K8S clusters, Cloud Run, Compute Engine VMs
  • use Migrate for Anthos to migrate and modernize existing workloads to containers
  • enhance app development and delivery with up-to-date CI/CD automated pipelines
    • uses open source
  • enabled defense-in-depth security strategy for comprehensive security controls across all deployments
  • fully integrated with GCP Monitoring and Logging, including hybrid and on-prem configs

Managing Microservices with Anthos Service Mesh

  • suite of tools to monitor and manage service mesh on-prem or Google Cloud
  • ASM enables managed, observable, and secure communication across microservices, on-prem, and GCP
  • Power by open-source Istio, ASM is one or more control planes and a data plane which monitors all traffic through a proxy
  • ASM controls traffic flow between services as well as ingress and egress
    • supports canary and blue-green deployments
    • configure load balancing between services
    • provides in-depth telemetry with Cloud Monitoring, Logging, and Trace

The Kubernetes Engine Connection

  • Anthos Clusters provide a unified way to work with K8S clusters as part of Anthos, extending GKE to work in multiple environments
  • Anthos on GCP uses "traditional" GKE, while on-premises uses VMWare and Bare Metal
  • Logically group and normalize multiple clusters via Fleets to manage multi-cluster capabilities and apply consistent policies
  • Anthos Config Management (ACM) creates a common configuration across all infra, including custom policies, applied both on-premises and in the cloud
  • Binary Authorization configures a validation policy enforced when deploying a container image
    • only explicitly-authorized images deployed using an "Attester"

Accessing Cloud Run for Anthos

  • flexible serverless development platform for hybrid and on prem enviromnets
  • managed with Knative, which enables serverless workloads on K8S
  • streamlines operational needs with advanced workload autoscaling and automatic networking
  • Scale idle workloads to zero or set min instance count for baseline availability
  • Out-of-the-box integration with Monitoring, Logging, and Error Reporting
  • Easily perform A/B tests with traffic splitting and quickly roll back to known working services

Summary

  • Anthos makes it possible to deploy, manage and monitor applications anywhere an in multiple locations: GCP, on-prem, multicloud, or hybrid
  • in addition to supporting GKE, Cloud Run, and VMs, Anthos offers system-spanning services such as Migrate for Anthos, Anthos Service Mesh (ASM), and Anthos Config Management (ACM)
  • familiarize yourself with special features that Anthos offers, particularly in securing CI/CD pipelines like Binary Authorization, Service Mesh testing and reporting, and Cloud Run for Anthos traffic splitting

Bare Metal: Closer Look

All about Anthos Bare Metal

  • Anthos clusters on bare metal allow you to directly deploy applications on your own hardware
  • manages app deployment and health across existing datacenters for more efficient operations
  • control system security without compatibility issues for virtual machines and OS
  • scale up apps while maintaining reliability regardless of fluctuations in workload and network traffic thanks to advanced monitoring
  • security can be customized with minimal connections to outside resources

Discovering Deployment Options

  • Admin Cluster: manages user clusters
  • User Cluster: control plane + workers
  • 3 basic models to choose from
    • Standalone: single cluster both user and admin
      • best for single teams or workloads
      • no need for separate admin clusters
      • works great for edge locations
    • Multi-cluster: one admin and one or more user clusters
      • works well for fleet of clusters with central mgmt
      • provides separation between teams
      • isolates development and production workloads
    • Hybrid: runs user workloads on admin
      • create from standalone by adding more user clusters
      • use only if no security concerns with user workloads on admin
      • configure HA for user clusters independently

Operating Bare Metal Clusters

  • use Connect to associate your bare metal clusters to Google Cloud
  • access is enabled for workload management and unified UI (Cloud Console)
  • Cloud Console displays health of all connected workloads and allows modifications to all
  • Put nodes into maintenance mode to train pods/workloads and exclude them from pod scheduling

Summary

  • Anthos on bare metal gives best flexibility using companies own hardware
  • Bare metal offers 3 kinds of deployment for admin/user clusters: standalone, multi-cluster, and hybrid
  • once Connect has been used to associate your clusters with Google Cloud, the Cloud Console is enabled and provides a unified user interface for all clusters, regardless of location

Storing Data

Storing Objects and Files

Going straight to Local SSD

  • fastest block storage option (physical disk attached to computer)
  • very fast zonal resource, 375 GB solid state disk directly attached to server hosting VM instance
  • expandable to 3, 6, 9 TB with increasing performance up to 2.4M reads and 1.2M write IOPS
  • all data encrypted at rest, lost when VM stops but it can survive live migration
  • best for transient data (media rendering, analytics, high-perf computing, caches)

Persevering with Persistant Disks

  • major benefit: persistence, available after VM shutdown
  • independent of VMs where data is distributed across disks for redundancy
  • highly durable (up to six 9s) and secure: data encrypted at rest and in transit
  • configurations
    • zonal
      • data in a single zone
      • 4 types: Standard, Balanced, SSD, Extreme
      • can be used for both snapshot and boot disk
      • can add more storage space, throughput, and IOPS
    • regional
      • data in 2 zones in same region
      • 3 types: Standard, Balanced, SSD
      • can be used for snapshots but not for boot disks
      • ONLY storage can be changed, not throughput or IOPS

Managing File-based Storage

  • file stored as whole unit without data being broken down into blocks
  • fully managed file-based storage service (like a NAS)
  • provision instance in specific zone
  • access using NFSv3 protocol
  • consistently fast and good for lift and shift migration
  • read-only snapshots are supported
  • 3 tiers
    • Basic: best for file sharing, k8s, dev, web hosting (1-63.9 TiB)
    • Enterprise: best for critical large-scale ops, GCE, K8S (1-10 TiB)
    • High Scale: best for high-perf computing (i.e. genome sequencing: 10-100 TiB)

Keeping Objects in Cloud Storage

  • infinitely scalable, fully-managed, highly durable object storage service (11 9s of durability)
  • for mutable, unstructured data such as images, videos, and documents
  • all objects stored in buckets
    • can be regional or multi-regional
    • support folders/sub-folders
    • supports versioning per bucket, with live object and noncurrent versions
  • permissions granted by bucket or by object and limited to teams, or people, or fully public
  • storage classes
    • standard: most frequently accessed or for brief time
    • nearline: for data you plan to access once a month or less
    • coldline: access at most once every 90 days
    • archive: access less than once/year
    • use lifecycle management rules to move objects between classes
      • age
      • when put in bucket (on or before of after)
      • current version

Summary

  • local SSDs are highest performance block storage but lose data if VM is stopped
  • file-based storage, go with Filestore: Basic, Enterprise, High Scale
  • control costs with bucket locations and lifecycle rules

Selecting the proper storage

Screen Shot 2022-07-25 at 2 27 42 PM

Buckets

Screen Shot 2022-07-25 at 2 28 53 PM


Saving your data on GCP

Cloud SQL

  • regional, fully-managed relational db service for SQL Server, MySQL, and PostgreSQL
  • automatic replication with automatic failover, backup, point-in-time recovery
  • scale manually up to 96 cores, more than 624GB RAM, add replicas as needed
  • features
    • built-in high availability
    • automaticaly scale storage up to 30TB
    • connects with GAE, GCE, GKE, and BigQuery, among other services

Cloud Spanner

  • fully-managed relational DB with up to 5 9s availability and unlimited scale (Mountkirk Games)
  • create spanner instance by define instance config, and compute capacity
  • best practice use query parameters to increase efficiency and lower costs
  • features
    • automatic sharding
    • external consistency (all transactions sequentially [even though distributed])
    • backup/restore and PITR

Cloud Bigtable

  • fully-managed, scalable NoSQL db service used for large analyticals and operational workloads
    • no related tables, primary or foreign keys
    • key / value store
  • handles large amount of data in key-value store and supports high read and write at low latency
  • tables stored in instances that contain up to 4 nodes located in different zones
  • use cases
    • time-series data
    • marketing and/or financial data
    • IoT data

Firestore

  • fully-managed, scalable, NoSQL serverless database
  • live syncronization and offline mode allow multi-user, collaboritive applications on mobile web
  • supports Datastore dbs and Datastore API
  • workloads include:
    • live asset and activity tracking
    • real-time analytics
    • media and product catalogs
    • social user profiles and gaming leaderboards (Mountkirk Games?)

Examining other DB options

  • Datastream
    • serverless CDC and replication service
    • synchronizes data across heterogenous database and applications reliably
  • Firebase Realtime DB
    • serverless NoSQL databas for storing and syncing data
    • enhances collaboration among user across devices and web in real time
  • MemoryStore
    • in-memory service for Redis and Memcached
    • provides low-latency access and high-throughput for heavily-accessed data

Summary

  • relational services: Cloud SQL, Cloud Spanner
  • NoSQL services: Firestore (document-based) and Bigtable (key-value)
  • cached, gaming, streaming data use Memorystore, which supports both Redis and Memcached

Deciding on the best databases

Screen Shot 2022-07-26 at 3 28 03 PM

Summary

  • first question is whether data structured or not; if not go with Cloud Storage unless you need mobile SDKs
  • if workload primarily data analytics, best options are Bigtable (if NoSQL and low latency), and otherwise BigQuery
  • if workload structured data, CloudSQL for basic relational DB needs and Cloud Spanner for horizontal scalability

Networking Data

Google has 75K miles networking cable and 150+ POPs around the globe

Globally Connecting with external networking

Cloud Domains

  • global registrar for domain names using Google Domains
  • uses built-in DNS or allows custom nameservers
  • supports DNSSEC and private WhoIs records
  • integration features
    • managed as a GCP project, including billing
    • automatic domain verification with Search Console, App Engine, Cloud Run, etc.
    • works with Cloud IAM for access management
    • partners including Shopify, Wix, Squarepace, Bluehost, Weebly, and others

Cloud DNS

  • hierarchical DB linking domain names to IP addresses
  • global, scalable, fully-managed authoritive domain name service
  • 100% SLA uptime guarantee
  • offers both public and private managed zones
    • private visible to one or more private VPC that you specify
  • features
    • Cloud IAM and Cloud Logging integration
    • DNS peering and DNS forwarding
    • Anycast nameservers - allows multiple machines to share same IP (nearest machine)
    • DNSSEC support

Static External IP Addresses

  • reserve static external IP addresses in projects and assign to resources
  • GCP supports two types: regional and global
  • regional IP addresses can be assigned to compute engine VMs and network load balancers
  • global IP addresses are assigned to Anycast IPs and global load balancers (HTTP/S, SSL, and TCP proxies)
    • IPV6 only global and only global load balancers
  • static IP addresses can be assigned through console, gcloud command line, API, or Terraform
  • no charge except for IP addresses that are reserved but not used

Cloud Load Balancing

  • fully distributed, software-defined managed service that spreads network traffic across multiple instances of your apps
  • Layer 4 and Layer 7 load balancing with Cloud CDN integration and automatic scaling
    • L4: transport layer (TCP, UDP)
    • L7: application layer (HTTP, HTTPS)
  • Regional load balancing features
    • health checks
    • session affinity
    • IPv4 only
    • good for single region, IPv4 only, or compliance
  • Global load balancing features
    • multi-region failover
    • connects to closest region for lowest latency
    • IPv4 and IPv6
    • good for backends distributed across multiple regions, want to deliver nearest to user with Anycast

Cloud CDN

  • serves content closer to the user
  • relies on Google Cloud's global edge network
  • works with external GCP HTTPS load balancing
  • manage cache rules with cache control header or allow it to automatically cache static content
  • content sources include
    • instance groups
    • zonal network endpoint groups (NEGs)
    • serverless NEGs like GAE or Cloud Functions
    • Cloud Storage buckets

Summary

  • optimal external networking requires establishing a domain, connection to DNS, reserving external IP, integrating load balancer, and accessing a CDN
  • because Cloud Load Balancing is software-defined and does not rely on any devices - virtual or physical - it can handle spikes with no prewarming
  • cloud CDN works only with content delivered from sources within Google Cloud, such as GCE instance groups or GCS buckets. Custom origins are not allowed.

Networking internally

Virtual Private Cloud (VPC)

  • delivers networking for all your orgs GCP resources
  • global IPv4 unicast software-defined network
  • automatic or custom creation (configure subnets, firewalls rules, routes, VPNs and BGP)
  • VPC is global, but subnets are regional
  • Options include
    • Shared VPC
    • VPC peering (make services available private across different VPC networks)
    • Bring your own IP addresses
    • Packet mirroring

Cloud Interconnect

  • extends on-premises network to VPC through HA, low-latency connection
  • dedicated
    • direct physical connection to GCP
    • best for high bandwidth needs
    • 10Gbps or 100Gbps circuits
    • capacities from 50Mbs - 50Gbs
    • traffic not encrypted but can be added
    • cannot use Cloud VPN with it
  • partner
    • connects to GCP via partner
    • better for lower bandwidth needs
    • depends on partner capabilities
    • capacities from 50Mbs - 50Gbs
    • traffic not encrypted but can be added
    • cannot use Cloud VPN with it

Cloud VPN

  • securely connects peer network to VPC
    • any network, including those on other providers
  • traffic is encrypted by one IPsec VPN gateway and decrypted by another
  • requires static IP address for persistence and does not support dynamic, e.g. "dial-in" VPN
  • best practices
    • keep cloud VPN resource in own project
    • use dynamic routing and BGP
    • establish secure firewal rules for VPN
    • generate strong pre-shared keys for tunnels

Examinining Other Networking Services (Cloud Router, CDN Interconnect)

  • Cloud Router
    • provides dynamic routing for hybrid networks linking VPCs to external networks via BGP
    • works with Cloud VPN and Dedicated Interconnect
    • automatically learns subnets in VPC and announces them to on-premises network
    • works with router appliances
  • CDN Interconnect
    • direct low-latency connectivity to certain CDN providers, with lower egress fees
    • works for both pull and push cache fills
    • best for high-volume egress traffic (lowers cost) and frequent content updates (lower latency)
    • supports Akamai, Verizon, Cloudflare, Fastly, and few others

Summary

  • because GCP VPC is software-defined network, can use single VPC to cover multiple regions without trafficking across the public internet
  • if your company requires direct connection from on-prem datacenters to their VPC, use either Dedicated Interconnect (highest capacity) or Partner Interconnect
  • for lower traffic requirements that require a secure connection, connect to VPC with Cloud VPN and Cloud Router using a static IP address

Finding a Load Balancer

External Screen Shot 2022-07-29 at 3 20 29 PM

Internal Screen Shot 2022-07-29 at 3 21 26 PM

Summary

  • when handling HTTP or HTTPS traffic around the world, use external HTTP(S) Load Balancing
  • if you have TCP traffic and would prefer to offload the SSL/TLS, best choice is SSL Proxy Load Balancing
  • Internal TCP or UDP traffic should rely on regional internal TCP/UDP Load Balancing for lowest latency and most direct connection

Managing and Securing Data

Establishing Core Security (Cloud IAM)

Cloud IAM

  • determining WHO has ACCESS to WHICH resources
  • Who (principals or members)
    • Google account
    • Service account
    • Google group (best practice)
    • Google Workspace account (org)
    • Cloud Identity domain (org less Workspace Apps / Features)
    • All authenticated users (users on Internet authenticated by Google)
    • All users (anyone on the Internet)
  • Access (roles)
    • Billing Account Administrator
    • Billing Account User
    • Storage Object Creator
    • Storage Object Viewer
    • Cloud SQL Editor
    • Cloud SQL Instance User
    • Security Admin (get/set any IAM policy)
  • Which resources
    • VM instance
    • GKE cluster
    • Storage bucket
    • Pub/Sub topic
    • Organization
    • Folder
    • Project
  • Roles
    • Primitive (oldest, pre-date Cloud IAM, broadest permissions)
    • Predefined (target specific resources w/ actions at granular level)
    • Custom (unique set of permissions, most granular level)
      • requires Role Administrator role
  • Policies
    • Role binding - 1 or more principals assigned to a role (policy)
  • Summary (Part 1: Cloud IAM)
    • globally manages access control for organizations
    • resource access is granted to roles (collection of permissions), and roles are granted to principals
    • recommender helps identify excess or needed permissions from principals
    • grants IAM access to external identities (AD, etc.) with workload identity federation

Resource Manager

  • centrally manages and secures organization's projects with custom folder hiearchy
  • example:
    • company
      • dept Y
        • team B
          • product 1
            • dev
            • test
              • GCE, GAE, GCS resources
            • production
  • modified Cloud IAM policies across an org
  • Cloud Asset Inventory monitors and analyizes all GCP assets, including IAM policies
  • Organization Policy Service sets constraints on resources and helps orgs stay in compliance

Cloud Identity

  • fully-managed Identity as a Service (IaaS) for provisioning and managing identity resources
  • each user and group given a Cloud Identity account allow Cloud IAM to manage access
  • can be configured to federate identities with other identity providers (i.e. Active Directory)
  • features
    • SSO with other apps
    • Multi-factor authentication (MFA)
    • Device security with endpoint management
    • Context-aware access without VPN

Cloud Identity-Aware Proxy (IAP)

  • establishes a central authorization layer for apps accessed by HTTPS, also internally by HTTP
  • enforces access control policies for apps and resources
  • based on load balancer and IAM, permits only auth request
  • supports
    • App Engine
    • Compute Engine
    • Kubernetes Engine
    • Cloud Run
    • On-premises

Summary

  • Cloud IAM: principals, roles, and resources relation and well as IAM policies creation and inheritance
  • keep the principle of least privilege in mind and practice; GCP stresses this concept and offers the Recommender service to help implement it
  • controlling and managing access is critical to an orgs security. GCP offers two services: Cloud Identity and Cloud Identity-Aware Proxy

Detecting and Responding to Security Threats

Cloud Security Command Center - hub for GCP protective resources

  • comprehensive security management and risk platform
  • two tiers: standard and premium
  • designed to prevent, detect, and respond to threats from a single pane of glass
  • integrates and monitors many security services on GCP as well as external services
  • identifies security compliance violations and misconfiguration in Google Cloud assets
  • exports SCC data to Splunk as well as other SIEMs
  • standard
    • SHA: security health analytics
    • WSS: web security scanner
    • CA/WAF: cloud armor
    • DLP: cloud data loss prevention
    • anomaly detection
    • Foreseti Security integration
  • premium
    • SHA: adds monitoring/reporting for compliance
    • WSS: adds managed scans
    • ETD: event threat detection
    • CTD: container threat detection
    • continuous exports to Pub/Sub

Web Security Scanner - guarding frontlines of Internet traffic

  • detects key vulnerabilities in App Engine, Compute Engine, and Kubernetes Engine applications
  • crawler based, supports public URLs and IPs not behind a firewall
  • standard
    • custom scans
  • premium
    • managed scans
  • detects
    • Cross-site scripting (XSS)
    • Flash injection
    • mixed (HTTP/HTTPS) content
    • outdated and insecure JavaScript libraries
    • readable text passwords

Cloud Armor

  • edge-level, enterprise-grade DDoS protection and web application firewall (WAF)
  • leverages Google Cloud load balancing
  • mitigates OWASP's top ten risks
  • features
    • allow or deny traffic by IPs or CIDR ranges
    • preview changes before pushing policy live
    • configure WAF fules to reduce false positives
    • reference named IP address lists from CDN partners (Fastly, Cloudflare, Imperva)

Event Threat Detection - malware, crypto mining

  • identify threats in near-real time by monitoring and analyziing Cloud Logging
  • threats are defined by rules, which specify needed logs
  • create custom rules by running queires on log data exported to BigQuery
  • quickly detect many types of attacks
    • malware
    • crypto mining
    • outgoing DDoS attacks
    • port scanning
    • IAM anomalous grant
    • brute-force SSH

Cloud Data Loss Prevention

  • inspection, classification, and de-identification platform to protect sensitive data
  • includes over 150 data detectors for personal identifiable information (PII)
  • connect DLP results to SCC, Data Catalog, or export to external SIEM or governance tool
  • detects data in
    • streams of data or structured text
    • files in cloud storage or BigQuery
    • images

Summary

  • the Cloud Security Command Center (SCC) platform monitors the majority of GCP's security services and is accessible through Standard and Premium tiers
  • if you use Google Cloud's external HTTPS load balancer, protect your web-based applications hosted on GAE, GCE, or GKE with Web Security Scanner
  • when Event Threat Detection (EDT) is enabled, GCP analyzes a range of logs from Cloud Logging to find signs of malware, crypto mining, outgoing DDoS attacks, brute-force SSH, and other threats

Managing Encrypted Keys

A cryto key is a string of characters when used with an encryption algorithm, it makes ordinary text unreadable. When that key, or another, is used with a decryption algorithm, it makes the text readable.

In order to be effective, cryto keys have to be complex and not something anyone should memorize. As such, we need a service to maintains them like KMS.

Cloud Key Management Service (KMS)

  • highly available, low-latency service to generate, manage and apply cryptographic keys
  • Cloud KMS encrypts and decrypts - does not store secrets itself - and controls access to keys
  • supports both symmetrical (e.g. AES) and asymmetrical (e.g. RSA or EC), algorithms
  • includes a 24-hour delay for key material destruction, to prevent accidental or malicious data loss
  • supports regulatory compliance and adds optional variations
    • Cloud HSM
    • Cloud EKM
    • CMEK
    • CSEK
  • google recommends you regularly and automatically rotate symmetric keys
    • asymmetric keys cannot be automated, but good practice

Cloud Hardware Security Module (HSM)

  • hosts encryption keys and performs cryptographic actions in cluster of FIPS 140-2 level 3 certified devices
  • enables compliance with hardware requirements
  • HSM keys are crytographically bound to region, with support for multi-regions
  • Cloud HSM properties
    • keys are non-exportable
    • tamper resistant
    • provides tamper evidence
    • auto-scales horizontally

Cloud External Key Managment (EKM)

  • use keys from supported external key management partners instead of GCP
  • works only with supported CMEK integration services
    • BigQuery
    • Compute Engine
    • Cloud Run
    • Cloud Spanner
    • Cloud Storage
    • GKE
    • Pub/Sub
    • Secret Manager
  • key ring should be created in same location as external key management partner
  • benefits include
    • key provenance
    • access control
      • must grant GCP project access to key
    • centralized key management

Secret Manager

  • allows storage of passwords and variables to use in applications
  • fully managed service for storing, managing, and accessing secrets as binary blobs or text strings
  • used for storing sensitive runtim info such as database passwords, API keys, or TLS certificates
  • data of each secret is immutable and new versions are created each time value is modified
  • best practices
    • follow principle of least privilege
    • limit access with IAM conditions
    • use the Secret Manager API instead of env vars
    • reference secrets by version number, not "latest"

Encrypted Keys Flowcharts Service Types

Flowchart

Summary

  • cloud KMS offers a full range of key sources: Google-managed, Cloud HSM devices, or Cloud EKM partners as well as customer-managed or supplied keys
  • regular automatic rotation of symmetric algorithm keys is considered a best practics; Cloud KMS does not support automatic rotation of asymmetric keys
  • follow the principle of least privilege when assigning access to Secret Manager entries by using Cloud IAM conditions or secret-level binding
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment