Rupesh Tiwari rupeshtiwari

## 01_usecase Scalable Triumph: Migrating FinTrust Bank's Data Infrastructure to the Cloud.md

      
              3 files
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                rupeshtiwari
                / 01_usecase Scalable Triumph: Migrating FinTrust Bank's Data Infrastructure to the Cloud.md
            
            
              Last active
              April 26, 2024 14:59
            
              
                Use cases for Data Analytics Customer Story, GCP, AWS, customer story, use cases, real world, usecase
              
          
    Data Warehouse Migration Story for FinTrust Bank


Framework Step
Details


Situation
FinTrust Bank, with an annual revenue of $12 billion, was facing a high-stakes challenge when its existing systems couldn't handle over 500 million transactions per day during a critical testing phase with a key e-commerce client.
This client was projected to increase annual revenues by 15% ($1.8 billion).
Key stakeholders involved were the client's CIO, CTO, and CSO, highlighting the strategic importance of the project.


Task
The urgent task was to stabilize and scale the bank’s data processing capabilities to not only retain the e-commerce client but also to set a foundation for scalable, compliant growth suitable for high-volume transaction environments.


## 00_Data Architecture README.md

      
              17 files
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                rupeshtiwari
                / 00_Data Architecture README.md
            
            
              Last active
              April 25, 2024 17:17
            
              
                All Apache Data Processing Frameworks and Tools
              
          
Data processing frameworks
Batch and real-time streaming analytics
SQL versus NoSQL use cases and use case patterns
Enterprise data governance and metadata management

CI/CD DataOps


Category
Tools


## 00_GCP Data Services.md

      
              2 files
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                rupeshtiwari
                / 00_GCP Data Services.md
            
            
              Last active
              April 25, 2024 14:57
            
              
                GCP Data Services, GCP services, Google services, Google 
              
          
    Here's the table sorted chronologically based on the release date of each Google Cloud service:


Google Cloud Service
Release Date
Based on/Open-source Inspiration
Open-source Start Date
Notes


Google BigQuery
2010
Dremel (Internal Google Tech)
N/A
BigQuery is inspired by Dremel but is not directly based on open-source technology.


Google Cloud Dataflow
2014
Apache Beam
2016 (as Apache Beam)
Initially developed by Google as Google Dataflow, then donated to the Apache Software Foundation as Apache Beam.


Google Cloud Composer
2018
Apache Airflow
2015
Developed by Airbnb and later open-sourced as Apache Airflow, which Google adopted for Cloud Composer.


Google Data Fusion
2019
CDAP (Cask Data Application Platform)
2011


## 00_Fundamentals ofData Engineering.md

      
              5 files
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                rupeshtiwari
                / 00_Fundamentals ofData Engineering.md
            
            
              Last active
              April 25, 2024 14:27
            
              
                Data Engineering Fundamentals
              
          
    Fundamentals ofData Engineering

https://drive.google.com/file/d/15iRNG9Pg4iz9ulwghLFAtPhVVBvRQCuV/view?usp=drive_link


## Data Analytics Late Arrival Handling.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                rupeshtiwari
                / Data Analytics Late Arrival Handling.md
            
            
              Last active
              April 25, 2024 14:21
            
              
                Data Analytics Late Arrival Handling
              
          
    Techniques to handle late arrival

Watermarks and Allowed Lateness are both vital techniques in managing late data in stream processing systems, but they serve slightly different purposes and are often used in conjunction to maximize data integrity and processing efficiency. Here’s an in-depth look at when and why you might choose to use each technique, or both together, along with real-world industry examples.
Watermarks

Purpose: Watermarks are primarily used to handle out-of-order data. They provide a way to estimate the "completeness" of data up to a certain point in time, based on event timestamps.
When to Use: Use watermarks when:

You expect data to arrive out of order.
You need a mechanism to know when to close a window and process its data.


## 01_GCP for Data Analytics Customer Engineer.md

      
              2 files
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                rupeshtiwari
                / 01_GCP for Data Analytics Customer Engineer.md
            
            
              Last active
              April 23, 2024 20:40
            
              
                GCP for Data Analytics, Google Data Analytics , Customer engineer, gcp, 
              
          
GCP PDF 1: Big Data and Machine Learning on Google Cloud


GCP PDF 2: Data Engineering with Streaming Data


Apache Data Processing Frameworks


[Apache Spark Notes](https:


## 00_README.md

      
              4 files
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                rupeshtiwari
                / 00_README.md
            
            
              Last active
              April 23, 2024 20:40
            
              
                Kubernetes from Basics to Guru
              
          
    Kubernetes from Basics to Guru

Kubernetes: From Basics to Guru

  
## 00_README.md

      
              10 files
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                rupeshtiwari
                / 00_README.md
            
            
              Last active
              April 23, 2024 19:06
            
              
                Learning Apache spark notes
              
          
    Apache Spark Learning Journey

This document outlines the structured content of my learning journey through Apache Spark, covering various topics from installation to advanced data processing techniques.
Course name: Spark Programming in Python for Beginners with Apache Spark 3

Chapter 1: Apache Spark Introduction
Chapter 2: Installing and Using Apache Spark
Chapter 3: Spark Execution Model and Architecture


## 01_Apache Hadoop Ecosystem Components Tools.md

      
              2 files
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                rupeshtiwari
                / 01_Apache Hadoop Ecosystem Components Tools.md
            
            
              Last active
              April 23, 2024 19:05
            
              
                Learning Apache Hadoop Notes
              
          
    Comprehensive Overview of Hadoop Ecosystem Components with Cloud Service Equivalents

Here's a concise table summarizing the key Hadoop ecosystem components along with their cloud service equivalents:


Component
Purpose
Created by
Language Support
Limitations
Alternatives
Fit
GCP Service
AWS Service
Azure Service


Apache Hive
SQL-like data querying in Hadoop.
Facebook
HiveQL
High latency for some queries.
P


## Overview of Open-Source Projects Related to Google's Technologies.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                rupeshtiwari
                / Overview of Open-Source Projects Related to Google's Technologies.md
            
            
              Last active
              April 21, 2024 15:50
            
              
                Overview of Open-Source Projects Related to Google's Technologies
              
          
    Here's an updated and detailed table that categorizes the open-source projects related to Google's technologies, including those inspired by, created by, or driven by Google. I've added a category column to better classify each project based on its primary function or architecture:


Open Source Project
Description
Inspired by Google Technology
Category


Apache Hadoop
A framework for distributed storage and processing of large data sets on computer clusters using simple programming models.
Google's MapReduce and GFS (Google File System).
Distributed Processing


Apache Cassandra
A distributed NoSQL database designed to handle large amounts of data across many commodity servers.
Google's Bigtable, a distributed storage system for managing structured data.
NoSQL Database


Apache Beam
A unified programming model for defining and executing data processing pipelines, including ETL, batch, and s
Google Cloud Service	Release Date	Based on/Open-source Inspiration	Open-source Start Date	Notes
Google BigQuery	2010	Dremel (Internal Google Tech)	N/A	BigQuery is inspired by Dremel but is not directly based on open-source technology.
Google Cloud Dataflow	2014	Apache Beam	2016 (as Apache Beam)	Initially developed by Google as Google Dataflow, then donated to the Apache Software Foundation as Apache Beam.
Google Cloud Composer	2018	Apache Airflow	2015	Developed by Airbnb and later open-sourced as Apache Airflow, which Google adopted for Cloud Composer.
Google Data Fusion	2019	CDAP (Cask Data Application Platform)	2011
Open Source Project	Description	Inspired by Google Technology	Category
Apache Hadoop	A framework for distributed storage and processing of large data sets on computer clusters using simple programming models.	Google's MapReduce and GFS (Google File System).	Distributed Processing
Apache Cassandra	A distributed NoSQL database designed to handle large amounts of data across many commodity servers.	Google's Bigtable, a distributed storage system for managing structured data.	NoSQL Database
Apache Beam	A unified programming model for defining and executing data processing pipelines, including ETL, batch, and s