Skip to content

Instantly share code, notes, and snippets.

@donigian
Last active February 16, 2024 10:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save donigian/d69afcfb640c4e01d779 to your computer and use it in GitHub Desktop.
Save donigian/d69afcfb640c4e01d779 to your computer and use it in GitHub Desktop.
building a production data warehouse

Building a Production Data Warehouse

So I hear business is going well...customers are flocking to your site/app...people just can't get enough? Congratulations, this is a huge step most companies don't reach.

What now? Well, business analysts, modelers, engineers, customer support and marketing want to access an authoritative, consistent, timely access to your organization's data. Perhaps we can build some beautiful dashboards to visualize KPIs for executive management.

You're in need of a Data Warehouse? Maybe a Data Lake? Where do you start? What sort of considerations & alternatives should you think about.

Turns out there are a variety of things you must take into consideration depending on your requirements. A good place to start is to interview & establish a relationship with team leads to better understand their needs. As you're going through the discovery process, here's an algorithm of the types of things you'll need to consider in your discovery phase.

What is a Data Warehouse? Data Lake?

Data Warehouse is a large store of data accumulated from a wide range of sources within a company and used to guide management decisions.

A Data Lake is a storage repository that holds a vast amount of raw data in its native format until it is needed.

An important distinction I want to emphasize is that a Data Warehouse is a subset of a Data Lake. There are various types/formats of data which may not be available in your Enterprise Data Warehouse by design.

6 Steps to Production Data Warehouse/Lake

###1) Planning During the planning phase, here are some questions you'll want to think about

Cloud

  • Performance vs Price tradeoff
  • Qualified vendors
  • Cost vs on-premise solution
  • Cloud stack features

On-premise

  • How long will it take to build?
  • One-off vs platform
  • Do we have the staff/expertise?

As a Service

  • Identify the cons to this cloud agnostic, managed & automated service

Building blocks of a well designed Data Warehouse/Lake uses the following:

Component AWS
Compute Elastic Compute Cloud (EC2)
Disk Storage Elastic Block Storage (EBS)
Object Storage Amazon Simple Storage (S3)
Network Virtual Private Cloud (VPC)
Key Management AWS Key Management Service

###2) Provisioning ####Workload & Node types Bigger nodes aren't always better, find a sweet spot. Figure out if your workload is driven by EBS I/O or compute relative to cost.

Technology

Which technology? Which vendor?

  • ANSI SQL?

  • UDF?

  • What happens when queries run out of memory? Performance cliff?

  • Concurrency

  • Hadoop (catch all)

  • MPP SQL

  • Spark • Pipeline architecture? – Centralized

  • Distributed • Storage architecture

  • Persistent object store vs. instance store? Price/performance impact

  • Data collection, movement and ingest/extract architecture critical

  • What's the best way to perform configuration management?

###3) Enterprise Integration

Data Sources

  • Inernal
  • Third Party Data
  • Structured log files (batch or streaming)

Data Kinematics

  • Ingestion rate?
  • Update rate?
  • Compression?

Schema management

  • What format are files stored in?
  • Metadata needs to be associated with creation of schema to store the data

Monitoring & Alerting

  • What scenarios constitute a failure?
  • Can/How do we recover?
  • Can you undo?

###4) Compliance/Security

  • Stand up a VPC
  • Lock down single tenant VPC
  • Audit & log everything
  • Compliance to legal regulations
  • Encrypt data at rest
  • Encrypt data going in/out (in-flight)

###5) SLA

End-2-End Monitoring

  • Performance
  • Health
  • Dashboards for data pipelines

###6) Data Pipelines

  • Refresh rate
  • Which frameworks to use (if any)?
  • Would we ever need data marts? If so when?
  • Scheduler needs to setup to run jobs periodically

The steps above are simply a starting point to get your thinking of how your environment should look like!

@vastris
Copy link

vastris commented Feb 16, 2024

Your detailed breakdown on building a production data warehouse is a valuable guide for companies navigating the challenges of scaling their data infrastructure. The considerations you outline, from cloud options to technology choices and enterprise integration, align well with the broader understanding of an enterprise data warehouse and its value in business intelligence, as eloquently explained in this LinkedIn article: https://www.linkedin.com/pulse/what-enterprise-data-warehouse-its-value-business-intelligence-nhogf/. Your comprehensive approach serves as a roadmap for organizations seeking to establish a robust data environment to support their growing needs and enhance decision-making capabilities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment