- tags to look out for - for OVO interview
- Terraform to cloud
- Kafka
- Labelling
- Tagging
- Data ownership
- cataloging / metadata: advantages
- data quality
- monitoring for reliability
https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/week_1_basics_n_setup
- βοΈ Introduction
- β Docker + Postgres
-
1.2.5 postgres, docker-compose
-
1.2.6
-
1.3.1 : Intro to Terraform
-
GCP trial with feeds@howapped.com
-
project ID: global-maxim-3380111
-
service account credentials
- https://trello.com/c/QRlq6ZDZ
-
export GOOGLE_APPLICATION_CREDENTIALS="/home/jon/.google/credentials/google_credentials.json" gcloud auth application-default login # I ALSO NEEDED TO ADD THIS STEP WHICH WAS NOT IN THE VIDEO gcloud auth activate-service-account --key-file="/home/jon/.google/credentials/google_credentials.json"
-
-
1.3.2 : Creating GCP infrastructure with terraform
- After
terraform apply
, see created bucket
- After
-
1.4.1 : set up Google cloud env with ssh
- Generate SSH key
- https://cloud.google.com/compute/docs/connect/create-ssh-keys
- Add at Metadata > SSH KEYS
- Compute Engine > VM Instances
ssh jon@35.242.175.10
- Download Anaconda onto VM
- connect with VSCode remote ssh
- port forwarding
- run the notebook on VM with forwarded port 8888
- Run the ingest code on a remotely hosted notebook
-
1.4.2 : Port mapping and networks
-
1.4.3 : Using codepsaces
-
β Dataset / SQL
-
Section
- Q&A:
- DBT vs. Spark
- Docker was the more challenging of the weeks
- commitment: 6-8 hours per week but depends on your skill level on the subjects
- bigquery on gcp vs. snowflake on AWS
- chose GCP as it has generous free tier which covers bigquery
- Docker + Postgres
- we use pgAdmin on host computer to administer postgre within containers
- Can run pipelines in the cloud (AWS Batch, Kubernetes jobs) or serverless (AWS lambda, Google functions)
docker run -it --entrypoint=bash python:3.11 # or create a Dockerfile to give container 'memory'
pip install pgcli
:pgcli -h localhost -p 5432 -u root -d ny_taxi
main
code
homework
-
2.1.1 - what is data lake, ETL/ELT
- https://www.youtube.com/watch?v=W3Zm6rjOq70&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=17
- storing big data: storing structured and unstructured data on inexpensive hardware to be accessed quickly and not have to wait for teams to define relationships
- store unlimited data as quickly as possible. Data lakes came about as data scientists increased.
- datalake vs data warehouse
- use cases of DL:
- stream processing
- real time analytics
- use case of DWH:
- batch processing
- ETL [small amount of data] [dwh] vs. ELT [large amount of data] [DL]
- problem is incompatible file formats so it becomes a data swamp and there is no possibility to join datasets
-
2.2.1 : intro workflow orchestration
-
2.2.2 : Prefect concepts
- https://www.youtube.com/watch?v=cdtN6dhp708&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=19
- my code:
- flows functions call task functions
- blocks - sqlalchemy / postgres connector 31m00s
- use a block so you don't have to hardcode credentials
- collections are pip installable packages
-
2.2.3 : ETL with GCP & Prefect
-
Have created an ETL flow that got some data from the web, took that data and cleaned it up and uploaded it to Google Cloud Storage
- Flow one: ETL into a data lake
- Create a Prefect Flow for pandas df to parquet to Google Cloud Storage (GCS)
- once extracted and cleaned up, save a parquet file in data lake (Google Cloud Storage)
- 1,369,756 rows
- create a GCP service account block and use credentials key file in Prefect
-
-
2.2.4 Google Cloud storage β‘ Big Query
- Flow two: Take data from GCS to Big Query datawarehouse
- create the dataset in gcp and create table 'rides' without data
- Flow two: Take data from GCS to Big Query datawarehouse
-
2.2.5 : Parametrizing flow
-
To use pandas with NixOS (because it uses a c extension)
- Use doylestone02 (it uses Ubuntu 22.04)
- use an SSH container in vscode
#export PREFECT_ORION_API_HOST=0.0.0.0 # see vscode forwarded port # Go to: http://127.0.0.1:4200/flow-runs
-
SQL error:
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database is locked
GitHub issue: PrefectHQ/prefect#10188
I remember one time this error caused one of my flows to fail to run, so I have to set
PREFECT_API_SERVICES_FLOW_RUN_NOTIFICATIONS_ENABLED=false
to suppress it. -
Prefect deployments schedule trigger flow runs via API
With a deployment definition that outline, when, where and how a workflow should be run.
This means a multiple deployments can be scheduled to run with different parameters (eg.
yellow
andgreen
)# to build this deployment on the command line (can also be done via python) prefect deployment build ./parameterized_flow.py:etl_parent_flow -n "Parameterized ETL" ... prefect deployment apply etl_parent_flow-deployment.yaml # now view at # http://127.0.0.1:4200/deployments
Then use an agent to orchestrate when, where the flow will run, to run a scheduled deployment with:
# http://127.0.0.1:4200/work-pools/work-pool/default-agent-pool?tab=Work+Queues prefect agent start --pool "default-agent-pool" --work-queue "default"
-
Set up a notification [15m32s]
-
2.2.6 : schedules and docker storage with infrastructure
-
# build and apply deployment prefect deployment build flows/03_deployments/parameterized_flows.py:etl_parent_flow -n etl2 --cron "0 0 * * *" -a
- ways to productionise flows
- deploy flow code on GitHub/BitBucket/GitLab
- AWS S3 / GCS / Azure blob storage
- store code in docker image and deploy on Docker Hub
/home/jon/code/learn/data-engineering/data-engineering-zoomcamp/week_2_workflow_orchestration/prefect/design/prefect-zoomcamp/Dockerfile
docker image build -t jonwhittlestone/zoom-2.2.6 .
docker image push jonwhittlestone/zoom-2.2.6
- Once docker block is created, create a deployment programatically with:
python flows/03_deployments/docker_deploy.py
prefect profile ls
-
-
2.2.7 : prefect cloud / additional resources
- Anna's guide on Prefect cloud and GCP
-
2.3.1 : Set up Airflow
-
2.3.2 : Ingesting data to GCP
- Partitioning and Clustering
- with Airflow
- Best Practices