Skip to content

Instantly share code, notes, and snippets.

@ltrainpr
Last active July 8, 2023 21:22
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save ltrainpr/a592d09007ecda8546a0c9e1b1cf3b5b to your computer and use it in GitHub Desktop.
Save ltrainpr/a592d09007ecda8546a0c9e1b1cf3b5b to your computer and use it in GitHub Desktop.
From Software to Data Engineer

Data Engineer's Responsibilities (not all encompassing):

  • Building data platforms
  • Define data architecture and data modeling
  • Handle data in various formats
  • Create ETL or ELT pipelines as well as streaming data pipelines
  • Schedule and deploy pipelines
  • Build frameworks or code for data management activities
  • Make data accessible with right governance in place
  • Enable self service access to data

Why does data engineering exist? It exists as an answer to these questions from data analysts and scientists.

  • How do I find my data?
  • Every data has its own format
  • How do I get pull/prepare data into my model
  • How can I get the data to insight ready format

Essential skills:

  • Python (and/or R programming)
  • SQL (SQLZOO)
  • Basic Statistics
  • Data modeling (ETL/ELT)
  • Data cleaning
  • At Tuft & Needle Looker and Metabase (aka BI tool)
  • At Tuft & Needle AWS & Docker containers (Some type of cloud platform experience Google, Amazon, Microsoft, IBM)
    • One of the hurdles in learning data engineering is setting up a distributed cluster to develop on. Amazon provides a free-tier which can be used to learn the distributed technologies, rather than just using your local system.

Nice to haves:

  • Bayesian statistics and/or machine learning knowledge

Most important books:

Data Engineering Online video courses or MOOCs:

Data Science MOOCs (further education):

How to get into data engineering:

  • Look into AWS - Kinesis (Buffer), Processing Framework (Lambda), S3 and/or Dynamodb (storage), Amazon API Gateway
  • BI Tools - Tableu

Learning Path - Level 1:

  • Programming Language - Python
  • SQL
  • Data Warehousing Concepts
  • Understand Distributed Computing
  • When to use data lake vs data warehousing vs rdbms vs nosql
  • Mater Apache Spark (not sure this is a thing anymore)
  • Understand and pick one db nosql or rdbms

Learning Path - Level 2:

  • Understand various data architectures (Real Time, Batch, Event Driven, etc.)
  • Learn one streaming platform and processing engine
  • Pick one cloud provider and master their native data engineering product
  • Focus on cloud data warehouses, cloud big data services and managed spark services
  • Create and deploy pipelines on cloud with cloud based CI/CD

Learning Path - Level 3:

  • Deep dive into data architectures and data modeling
  • Understand and build Cloud Native data architectures and sandboxes (containers and K8s)
  • Hybrid Cloud
  • Focus on data management and Data Security Architecture
  • Build platforms that can democratize data and accelerate analysis
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment