jonwhittlestone/📚 learn-data-engineering.md

## 📚 learn-data-engineering.md

      
    Raw
  

              📚 learn-data-engineering.md
            
          
    📚 learn-data-engineering.md


tags to look out for - for OVO interview

Terraform to cloud
Kafka
Labelling
Tagging
Data ownership
cataloging / metadata: advantages
data quality
monitoring for reliability


zoomcamp 2023


https://github.com/DataTalksClub/data-engineering-zoomcamp

Week 1 - Intro and Prerequisites

https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/week_1_basics_n_setup

✔️ Introduction
⌛ Docker + Postgres


1.2.5 postgres, docker-compose

https://www.youtube.com/watch?v=hKI6PkPhpa0&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=9


1.2.6

https://www.youtube.com/watch?v=QEcps_iskgg&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=10


1.3.1 : Intro to Terraform


GCP trial with feeds@howapped.com

https://console.cloud.google.com/home/dashboard?project=global-maxim-338111&authuser=0&organizationId=300007409953


project ID: global-maxim-3380111


service account credentials

https://trello.com/c/QRlq6ZDZ

export GOOGLE_APPLICATION_CREDENTIALS="/home/jon/.google/credentials/google_credentials.json"

gcloud auth application-default login

# I ALSO NEEDED TO ADD THIS STEP WHICH WAS NOT IN THE VIDEO
gcloud auth activate-service-account --key-file="/home/jon/.google/credentials/google_credentials.json"


1.3.2 : Creating GCP infrastructure with terraform

After terraform apply, see created bucket

https://console.cloud.google.com/storage/browser?authuser=0&organizationId=300007409953&project=global-maxim-338111&prefix=&forceOnBucketsSortingFiltering=true


1.4.1 : set up Google cloud env with ssh

Generate SSH key


https://cloud.google.com/compute/docs/connect/create-ssh-keys
Add at Metadata > SSH KEYS

https://console.cloud.google.com/compute/metadata?authuser=0&project=global-maxim-338111&tab=sshkeys


Compute Engine > VM Instances


ssh jon@35.242.175.10

Download Anaconda onto VM
connect with VSCode remote ssh
port forwarding
run the notebook on VM with forwarded port 8888
Run the ingest code on a remotely hosted notebook


1.4.2 : Port mapping and networks

https://www.youtube.com/watch?v=tOr4hTsHOzU&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=14&pp=iAQB


1.4.3 : Using codepsaces

https://www.youtube.com/watch?v=Jl2_Hkxkbyc&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=16


❌ Dataset / SQL


Section


My notes


Q&A:

DBT vs. Spark
Docker was the more challenging of the weeks
commitment: 6-8 hours per week but depends on your skill level on the subjects
bigquery on gcp vs. snowflake on AWS
chose GCP as it has generous free tier which covers bigquery


Docker + Postgres

we use pgAdmin on host computer to administer postgre within containers
Can run pipelines in the cloud (AWS Batch, Kubernetes jobs) or serverless (AWS lambda, Google functions)


docker run -it --entrypoint=bash python:3.11  # or create a Dockerfile to give container 'memory'

pip install pgcli : pgcli -h localhost -p 5432 -u root -d ny_taxi

Week 2 - Workflow Orchestration

main

https://raw.githubusercontent.com/DataTalksClub/data-engineering-zoomcamp/main/week_2_workflow_orchestration/README.md

code

https://github.com/discdiver/prefect-zoomcamp

homework

https://raw.githubusercontent.com/DataTalksClub/data-engineering-zoomcamp/main/cohorts/2023/week_2_workflow_orchestration/homework.md


2.1.1 - what is data lake, ETL/ELT

https://www.youtube.com/watch?v=W3Zm6rjOq70&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=17
storing big data: storing structured and unstructured data on inexpensive hardware to be accessed quickly and not have to wait for teams to define relationships
store unlimited data as quickly as possible. Data lakes came about as data scientists increased.
datalake vs data warehouse

https://youtu.be/W3Zm6rjOq70?list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&t=89


use cases of DL:

stream processing
real time analytics


use case of DWH:

batch processing


ETL [small amount of data] [dwh] vs. ELT [large amount of data] [DL]
problem is incompatible file formats so it becomes a data swamp and there is no possibility to join datasets

https://www.snowflake.com/trending/avro-vs-parquet


2.2.1 : intro workflow orchestration

https://www.youtube.com/watch?v=8oLs6pzHp68&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=18


2.2.2 : Prefect concepts

https://www.youtube.com/watch?v=cdtN6dhp708&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=19
my code:

https://github.com/jonwhittlestone/learn-data-engineering-zoomcamp/commit/a54218c1e71726136db061bd9f2459c205460f20


flows functions call task functions
blocks - sqlalchemy / postgres connector 31m00s

use a block so you don't have to hardcode credentials


collections are pip installable packages

https://docs.prefect.io/2.7/collections/catalog/


2.2.3 : ETL with GCP & Prefect


Have created an ETL flow that got some data from the web, took that data and cleaned it up and uploaded it to Google Cloud Storage


Flow one: ETL into a data lake
Create a Prefect Flow for pandas df to parquet to Google Cloud Storage (GCS)

once extracted and cleaned up, save a parquet file in data lake (Google Cloud Storage)


1,369,756 rows
create a GCP service account block and use credentials key file in Prefect


2.2.4 Google Cloud storage ➡ Big Query

Flow two: Take data from GCS to Big Query datawarehouse

create the dataset in gcp and create table 'rides' without data


2.2.5 : Parametrizing flow


original src


To use pandas with NixOS (because it uses a c extension)

Use doylestone02 (it uses Ubuntu 22.04)
use an SSH container in vscode


#export PREFECT_ORION_API_HOST=0.0.0.0
# see vscode forwarded port
# Go to: http://127.0.0.1:4200/flow-runs


SQL error: sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database is locked
GitHub issue:
PrefectHQ/prefect#10188

I remember one time this error caused one of my flows to fail to run, so I have to set PREFECT_API_SERVICES_FLOW_RUN_NOTIFICATIONS_ENABLED=false to suppress it.


Prefect deployments schedule trigger flow runs via API
With a deployment definition that outline, when, where and how a workflow should be run.
This means a multiple deployments can be scheduled to run with different parameters (eg. yellow and green)
# to build this deployment on the command line (can also be done via python)
prefect deployment build ./parameterized_flow.py:etl_parent_flow -n "Parameterized ETL"

...

prefect deployment apply etl_parent_flow-deployment.yaml

# now view at
# http://127.0.0.1:4200/deployments
Then use an agent to orchestrate when, where the flow will run, to run a scheduled deployment with:
# http://127.0.0.1:4200/work-pools/work-pool/default-agent-pool?tab=Work+Queues

prefect agent start --pool "default-agent-pool" --work-queue "default"


Set up a notification [15m32s]
http://127.0.0.1:4200/notifications/create


2.2.6 : schedules and docker storage with infrastructure


# build and apply deployment
prefect deployment build flows/03_deployments/parameterized_flows.py:etl_parent_flow -n etl2 --cron "0 0 * * *" -a

ways to productionise flows

deploy flow code on GitHub/BitBucket/GitLab
AWS S3 / GCS / Azure blob storage
store code in docker image and deploy on Docker Hub

/home/jon/code/learn/data-engineering/data-engineering-zoomcamp/week_2_workflow_orchestration/prefect/design/prefect-zoomcamp/Dockerfile
docker image build -t jonwhittlestone/zoom-2.2.6 .
docker image push jonwhittlestone/zoom-2.2.6


Once docker block is created, create a deployment programatically with:
python flows/03_deployments/docker_deploy.py

prefect profile ls


2.2.7 : prefect cloud / additional resources

Anna's guide on Prefect cloud and GCP

https://annageller.medium.com/gcp-and-prefect-cloud-from-docker-container-to-cloud-vm-on-google-compute-engine-2dffa026d16b
Code

https://github.com/anna-geller/prefect-cloud-gcp


Week 3 - Data Warehouse

Week 4 - Analytics Engineering

Week 5 - Batch Processing

Week 6 - Streaming

Week 7, 8 & 9 - Project

zoomcamp 2022 (w/ Airflow)


https://www.youtube.com/watch?v=bkJZDmreIpA&list=PL3MmuxUbc_hKVX8VnwWCPaWlIHf1qmg8s&index=1

Week 1 - Introduction and prereuisites

Week 2 - Ingestion and orchestration


2.2.1 : Introduction to work flow orchestration


2.3.1 : Set up Airflow

https://www.youtube.com/watch?v=lqDMzReAtrw&list=PL3MmuxUbc_hKVX8VnwWCPaWlIHf1qmg8s&index=5


2.3.2 : Ingesting data to GCP

https://www.youtube.com/watch?v=9ksX9REfL8w&list=PL3MmuxUbc_hKVX8VnwWCPaWlIHf1qmg8s&index=7


Week 3 - Data Warehouse (BigQuery)


Partitioning and Clustering
with Airflow
Best Practices

Week 4 - Analytics engineering (DBT)

Week 5 - Batch Processing (Spark)

Week 6 - Streaming (Kafka)

Weeks 7 - 10 Project