Skip to content

Instantly share code, notes, and snippets.

@jonwhittlestone
Last active October 14, 2023 06:03
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jonwhittlestone/3fdb5566b59c1d130a043ff881969dcc to your computer and use it in GitHub Desktop.
Save jonwhittlestone/3fdb5566b59c1d130a043ff881969dcc to your computer and use it in GitHub Desktop.
πŸ“š learn-data-engineering.md

πŸ“š learn-data-engineering.md

  • tags to look out for - for OVO interview
    • Terraform to cloud
    • Kafka
    • Labelling
    • Tagging
    • Data ownership
    • cataloging / metadata: advantages
    • data quality
    • monitoring for reliability

zoomcamp 2023

Week 1 - Intro and Prerequisites

https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/week_1_basics_n_setup

  1. βœ”οΈ Introduction
  2. βŒ› Docker + Postgres
  1. ❌ Dataset / SQL

  2. Section

My notes

  • Q&A:
    • DBT vs. Spark
    • Docker was the more challenging of the weeks
    • commitment: 6-8 hours per week but depends on your skill level on the subjects
    • bigquery on gcp vs. snowflake on AWS
    • chose GCP as it has generous free tier which covers bigquery
  • Docker + Postgres
    • we use pgAdmin on host computer to administer postgre within containers
    • Can run pipelines in the cloud (AWS Batch, Kubernetes jobs) or serverless (AWS lambda, Google functions)
    docker run -it --entrypoint=bash python:3.11  # or create a Dockerfile to give container 'memory'
  • pip install pgcli : pgcli -h localhost -p 5432 -u root -d ny_taxi

Week 2 - Workflow Orchestration

main

code

homework


  • 2.1.1 - what is data lake, ETL/ELT

  • 2.2.1 : intro workflow orchestration

  • 2.2.2 : Prefect concepts

  • 2.2.3 : ETL with GCP & Prefect

    • Have created an ETL flow that got some data from the web, took that data and cleaned it up and uploaded it to Google Cloud Storage

    • Flow one: ETL into a data lake
    • Create a Prefect Flow for pandas df to parquet to Google Cloud Storage (GCS)
      • once extracted and cleaned up, save a parquet file in data lake (Google Cloud Storage)
    • 1,369,756 rows
    • create a GCP service account block and use credentials key file in Prefect
  • 2.2.4 Google Cloud storage ➑ Big Query

    • Flow two: Take data from GCS to Big Query datawarehouse
      • create the dataset in gcp and create table 'rides' without data
  • 2.2.5 : Parametrizing flow

    • original src

    • To use pandas with NixOS (because it uses a c extension)

      • Use doylestone02 (it uses Ubuntu 22.04)
      • use an SSH container in vscode
      #export PREFECT_ORION_API_HOST=0.0.0.0
      # see vscode forwarded port
      # Go to: http://127.0.0.1:4200/flow-runs
    • SQL error: sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database is locked

      GitHub issue: PrefectHQ/prefect#10188

      I remember one time this error caused one of my flows to fail to run, so I have to set PREFECT_API_SERVICES_FLOW_RUN_NOTIFICATIONS_ENABLED=false to suppress it.

    • Prefect deployments schedule trigger flow runs via API

      With a deployment definition that outline, when, where and how a workflow should be run.

      This means a multiple deployments can be scheduled to run with different parameters (eg. yellow and green)

      # to build this deployment on the command line (can also be done via python)
      prefect deployment build ./parameterized_flow.py:etl_parent_flow -n "Parameterized ETL"
      
      ...
      
      prefect deployment apply etl_parent_flow-deployment.yaml
      
      # now view at
      # http://127.0.0.1:4200/deployments

      Then use an agent to orchestrate when, where the flow will run, to run a scheduled deployment with:

      # http://127.0.0.1:4200/work-pools/work-pool/default-agent-pool?tab=Work+Queues
      
      prefect agent start --pool "default-agent-pool" --work-queue "default"
      
    • Set up a notification [15m32s]

      http://127.0.0.1:4200/notifications/create

  • 2.2.6 : schedules and docker storage with infrastructure

    • # build and apply deployment
      prefect deployment build flows/03_deployments/parameterized_flows.py:etl_parent_flow -n etl2 --cron "0 0 * * *" -a
    • ways to productionise flows
      • deploy flow code on GitHub/BitBucket/GitLab
      • AWS S3 / GCS / Azure blob storage
      • store code in docker image and deploy on Docker Hub
        • /home/jon/code/learn/data-engineering/data-engineering-zoomcamp/week_2_workflow_orchestration/prefect/design/prefect-zoomcamp/Dockerfile
        • docker image build -t jonwhittlestone/zoom-2.2.6 .
        • docker image push jonwhittlestone/zoom-2.2.6
      • Once docker block is created, create a deployment programatically with:
        python flows/03_deployments/docker_deploy.py
      • prefect profile ls
  • 2.2.7 : prefect cloud / additional resources

Week 3 - Data Warehouse

Week 4 - Analytics Engineering

Week 5 - Batch Processing

Week 6 - Streaming

Week 7, 8 & 9 - Project

zoomcamp 2022 (w/ Airflow)

Week 1 - Introduction and prereuisites

Week 2 - Ingestion and orchestration

Week 3 - Data Warehouse (BigQuery)

  • Partitioning and Clustering
  • with Airflow
  • Best Practices

Week 4 - Analytics engineering (DBT)

Week 5 - Batch Processing (Spark)

Week 6 - Streaming (Kafka)

Weeks 7 - 10 Project

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment