Framework Step | Details |
---|---|
Situation |
|
Task | The urgent task was to stabilize and scale the bank’s data processing capabilities to not only retain the e-commerce client but also to set a foundation for scalable, compliant growth suitable for high-volume transaction environments. |
- Data processing frameworks
- Batch and real-time streaming analytics
- SQL versus NoSQL use cases and use case patterns
- Enterprise data governance and metadata management
Category | Tools |
---|
Here's the table sorted chronologically based on the release date of each Google Cloud service:
Google Cloud Service | Release Date | Based on/Open-source Inspiration | Open-source Start Date | Notes |
---|---|---|---|---|
Google BigQuery | 2010 | Dremel (Internal Google Tech) | N/A | BigQuery is inspired by Dremel but is not directly based on open-source technology. |
Google Cloud Dataflow | 2014 | Apache Beam | 2016 (as Apache Beam) | Initially developed by Google as Google Dataflow, then donated to the Apache Software Foundation as Apache Beam. |
Google Cloud Composer | 2018 | Apache Airflow | 2015 | Developed by Airbnb and later open-sourced as Apache Airflow, which Google adopted for Cloud Composer. |
Google Data Fusion | 2019 | CDAP (Cask Data Application Platform) | 2011 |
Watermarks and Allowed Lateness are both vital techniques in managing late data in stream processing systems, but they serve slightly different purposes and are often used in conjunction to maximize data integrity and processing efficiency. Here’s an in-depth look at when and why you might choose to use each technique, or both together, along with real-world industry examples.
Purpose: Watermarks are primarily used to handle out-of-order data. They provide a way to estimate the "completeness" of data up to a certain point in time, based on event timestamps.
When to Use: Use watermarks when:
- You expect data to arrive out of order.
- You need a mechanism to know when to close a window and process its data.
-
GCP PDF 2: Data Engineering with Streaming Data
-
[Apache Spark Notes](https:
This document outlines the structured content of my learning journey through Apache Spark, covering various topics from installation to advanced data processing techniques.
Course name: Spark Programming in Python for Beginners with Apache Spark 3
- Chapter 1: Apache Spark Introduction
- Chapter 2: Installing and Using Apache Spark
- Chapter 3: Spark Execution Model and Architecture
Here's a concise table summarizing the key Hadoop ecosystem components along with their cloud service equivalents:
Component | Purpose | Created by | Language Support | Limitations | Alternatives | Fit | GCP Service | AWS Service | Azure Service |
---|---|---|---|---|---|---|---|---|---|
Apache Hive | SQL-like data querying in Hadoop. | HiveQL | High latency for some queries. | P |