Skip to content

Instantly share code, notes, and snippets.

@kzzzr
Last active December 18, 2023 16:58
Show Gist options
  • Save kzzzr/4ab36bec6897e48e44e792dc2e706de9 to your computer and use it in GitHub Desktop.
Save kzzzr/4ab36bec6897e48e44e792dc2e706de9 to your computer and use it in GitHub Desktop.
Data Vault 2.0 + Greenplum + dbtVault assignment

Assignment

Data Warehouse Analyst – Analytics Engineer @ OTUS.ru

Follow instructions in data_vault_tutorial.md step by step.

Add README.md with answers to the following questions:

1. Load 4-5 days to Data Vault.

Play with materialization types: view vs table.

(?) Measure and compare results.

2. Prepare Point-in-Time & Bridge Tables

Create models that combine Hubs + Satellites, Hubs + Links.

You can either develop one-off models or automate it with dbt macro.

Get inspired by dbtVault:

3. Now run a couple of queries on top of models you have built

(?) Adapt these queries to your PIT and Bridge tables and attach query results:

-- Q1
SELECT
    l_returnflag,
    l_linestatus,
    sum(l_quantity) as sum_qty,
    sum(l_extendedprice) as sum_base_price,
    sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
    avg(l_quantity) as avg_qty,
    avg(l_extendedprice) as avg_price,
    avg(l_discount) as avg_disc,
    count(*) as count_order
FROM
    lineitem
WHERE
    l_shipdate <= date '1998-12-01' - interval '90' day
GROUP BY
    l_returnflag,
    l_linestatus
ORDER BY
    l_returnflag,
    l_linestatus;


-- Q2
SELECT
    l_orderkey,
    sum(l_extendedprice * (1 - l_discount)) as revenue,
    o_orderdate,
    o_shippriority
FROM
    customer,
    orders,
    lineitem
WHERE
    c_mktsegment = 'BUILDING'
    AND c_custkey = o_custkey
    AND l_orderkey = o_orderkey
    AND o_orderdate < date '1995-03-15'
    AND l_shipdate > date '1995-03-15'
GROUP BY
    l_orderkey,
    o_orderdate,
    o_shippriority
ORDER BY
    revenue desc,
    o_orderdate
LIMIT 20;

-- Q3

SELECT
    100.00 * sum(case
        when p_type like 'PROMO%'
            then l_extendedprice * (1 - l_discount)
        else 0
    end) / sum(l_extendedprice * (1 - l_discount)) as promo_revenue
FROM
    lineitem,
    part
WHERE
    l_partkey = p_partkey
    AND l_shipdate >= date '1995-09-01'
    AND l_shipdate < date '1995-09-01' + interval '1' month;

4. (optional) Compare disk space

Used to store raw tables vs Data Vault

Don't forget to filter equal number of days.

5. (optional) Play with Clickhouse

Is it possible to run Data Vault on Clickhouse? Will it be able to handle incremental materialization for raw vault?

1. Create Greenplum cluster on Yandex.Cloud

Request access to Manged Greenplum Service (beta), than create a cluster.

Configuration I have used:

2. Spin up a VM + Prepare raw data

Spin up a VM: Detailed instruction available on DWH: Airbyte + Yandex.Metrika assignment

yc compute instance create \
    --name dbgen \
    --ssh-key ~/.ssh/key.pub \
    --create-boot-disk image-folder-id=standard-images,image-family=ubuntu-1804-lts,size=50,auto-delete=true \
    --network-interface subnet-name=default-ru-central1-a,nat-ip-version=ipv4 \
    --memory 2G \
    --cores 2 \
    --hostname dbgen

ssh -i ~/.ssh/key yc-user@{external_ip} # ssh to VM using public IP

sudo apt-get install -y gcc git awscli postgresql # install libs

2a. Generate data with DBGen

sudo apt-get install -y gcc git awscli postgresql # install libs

git clone https://github.com/electrum/tpch-dbgen.git # TPCH generator
cd tpch-dbgen
make makefile.suite

./dbgen -v -h -s 10 # generate data

for i in `ls *.tbl`; do sed 's/|$//' $i > ${i/tbl/csv}; echo $i; done; # convert to a CSV format compatible with PostgreSQL

Read more at:

2b. Access generated data from Object Storage

Alternatively just get all the files from s3://otus-dwh/tpch-dbgen/ (Yandex Object Storage) via curl or AWS CLI.

mkdir tpch && cd tpch

# option 1 – curl
curl -O "https://storage.yandexcloud.net/otus-dwh/tpch-dbgen/{customer,lineitem,nation,orders,part,partsupp,region,supplier}.csv"

# option 2 – aws s3
aws configure # enter your Key ID / Secret Key
aws --endpoint-url=https://storage.yandexcloud.net s3 ls s3://otus-dwh/tpch-dbgen/ # list files
aws --endpoint-url=https://storage.yandexcloud.net s3 sync s3://otus-dwh/tpch-dbgen/ . # sync files

Configure AWS CLI to access Yandex Object Storage:

  • Create a service account.
  • Assign a role to a service account.
  • Create a static access key.

3. COPY to database

First create table definitions.

Then load data into it.

-- DDL scripts to create table

CREATE TABLE customer
(C_CUSTKEY INT, 
C_NAME VARCHAR(25),
C_ADDRESS VARCHAR(40),
C_NATIONKEY INTEGER,
C_PHONE CHAR(15),
C_ACCTBAL DECIMAL(15,2),
C_MKTSEGMENT CHAR(10),
C_COMMENT VARCHAR(117))
WITH (appendonly=true, orientation=column)
DISTRIBUTED BY (C_CUSTKEY);

CREATE TABLE lineitem
(L_ORDERKEY BIGINT,
L_PARTKEY INT,
L_SUPPKEY INT,
L_LINENUMBER INTEGER,
L_QUANTITY DECIMAL(15,2),
L_EXTENDEDPRICE DECIMAL(15,2),
L_DISCOUNT DECIMAL(15,2),
L_TAX DECIMAL(15,2),
L_RETURNFLAG CHAR(1),
L_LINESTATUS CHAR(1),
L_SHIPDATE DATE,
L_COMMITDATE DATE,
L_RECEIPTDATE DATE,
L_SHIPINSTRUCT CHAR(25),
L_SHIPMODE CHAR(10),
L_COMMENT VARCHAR(44))
WITH (appendonly=true, orientation=column, compresstype=ZSTD)
DISTRIBUTED BY (L_ORDERKEY,L_LINENUMBER)
PARTITION BY RANGE (L_SHIPDATE)
(start('1992-01-01') INCLUSIVE end ('1998-12-31') INCLUSIVE every (30),
default partition others);

CREATE TABLE nation
(N_NATIONKEY INTEGER, 
N_NAME CHAR(25), 
N_REGIONKEY INTEGER, 
N_COMMENT VARCHAR(152))
WITH (appendonly=true, orientation=column)
DISTRIBUTED BY (N_NATIONKEY);

CREATE TABLE orders
(O_ORDERKEY BIGINT,
O_CUSTKEY INT,
O_ORDERSTATUS CHAR(1),
O_TOTALPRICE DECIMAL(15,2),
O_ORDERDATE DATE,
O_ORDERPRIORITY CHAR(15), 
O_CLERK  CHAR(15), 
O_SHIPPRIORITY INTEGER,
O_COMMENT VARCHAR(79))
WITH (appendonly=true, orientation=column, compresstype=ZSTD)
DISTRIBUTED BY (O_ORDERKEY)
PARTITION BY RANGE (O_ORDERDATE)
(start('1992-01-01') INCLUSIVE end ('1998-12-31') INCLUSIVE every (30),
default partition others);

CREATE TABLE part
(P_PARTKEY INT,
P_NAME VARCHAR(55),
P_MFGR CHAR(25),
P_BRAND CHAR(10),
P_TYPE VARCHAR(25),
P_SIZE INTEGER,
P_CONTAINER CHAR(10),
P_RETAILPRICE DECIMAL(15,2),
P_COMMENT VARCHAR(23))
WITH (appendonly=true, orientation=column)
DISTRIBUTED BY (P_PARTKEY);

CREATE TABLE partsupp
(PS_PARTKEY INT,
PS_SUPPKEY INT,
PS_AVAILQTY INTEGER,
PS_SUPPLYCOST DECIMAL(15,2),
PS_COMMENT VARCHAR(199))
WITH (appendonly=true, orientation=column)
DISTRIBUTED BY (PS_PARTKEY,PS_SUPPKEY);

CREATE TABLE region
(R_REGIONKEY INTEGER, 
R_NAME CHAR(25),
R_COMMENT VARCHAR(152))
WITH (appendonly=true, orientation=column)
DISTRIBUTED BY (R_REGIONKEY);

CREATE TABLE supplier 
(S_SUPPKEY INT,
S_NAME CHAR(25),
S_ADDRESS VARCHAR(40),
S_NATIONKEY INTEGER,
S_PHONE CHAR(15),
S_ACCTBAL DECIMAL(15,2),
S_COMMENT VARCHAR(101))
WITH (appendonly=true, orientation=column)
DISTRIBUTED BY (S_SUPPKEY);

On a VM with installed psql execute COPY pointing to local csv files:

export GREENPLUM_URI="postgres://greenplum:<pass>@<host>:5432/postgres"
psql $GREENPLUM_URI

\copy customer from  '/home/dbgen/tpch-dbgen/data/customer.csv' WITH (FORMAT csv, DELIMITER '|');
\copy lineitem from  '/home/dbgen/tpch-dbgen/data/lineitem.csv' WITH (FORMAT csv, DELIMITER '|');
\copy nation from  '/home/dbgen/tpch-dbgen/data/nation.csv' WITH (FORMAT csv, DELIMITER '|');
\copy orders from  '/home/dbgen/tpch-dbgen/data/orders.csv' WITH (FORMAT csv, DELIMITER '|');
\copy part from  '/home/dbgen/tpch-dbgen/data/part.csv' WITH (FORMAT csv, DELIMITER '|');
\copy partsupp from  '/home/dbgen/tpch-dbgen/data/partsupp.csv' WITH (FORMAT csv, DELIMITER '|');
\copy region from  '/home/dbgen/tpch-dbgen/data/region.csv' WITH (FORMAT csv, DELIMITER '|');
\copy supplier from  '/home/dbgen/tpch-dbgen/data/supplier.csv' WITH (FORMAT csv, DELIMITER '|');

4. Run dbtVault + Greenplum demo

1. First read the official guide:

dbtVault worked example

2. Clone repo with dbt project

Clone demo repo: https://github.com/kzzzr/dbtvault_greenplum_demo

git clone https://github.com/kzzzr/dbtvault_greenplum_demo.git

3. Configure database connection

Example profiles.yml

config:
  send_anonymous_usage_stats: False
  use_colors: True
  partial_parse: True

dbtvault_greenplum_demo:
  outputs:
    dev:
      type: postgres
      threads: 2
      host: {yc-greenplum-host}
      port: 5432
      user: greenplum
      pass: {yc-greenplum-pass}
      dbname: postgres
      schema: public
  target: dev

4. Make sure you run on dbt==0.19.0

You may use repo's Pipfile with pipenv or install dbt yourself

pipenv install
pipenv shell

dbt debug # check if OK

5. Install dependencies

Initial repo is intended to run on Snowflake only.

I have forked it and adapted to run on Greenplum/PostgreSQL. Check out what has been changed: 47e0261cea67c3284ea409c86dacdc31b1175a39

packages.yml:

packages:
  # - package: Datavault-UK/dbtvault
  #   version: 0.7.3
  - git: "https://github.com/kzzzr/dbtvault.git"
    revision: master
    warn-unpinned: false

Install package:

dbt deps

6. Adapt models to Greenplum/PostgreSQL

Check out the commit history.

  • a97a224 - adapt prepared staging layer for greenplum - Artemiy Kozyr (HEAD -> master, kzzzr/master)
  • dfc5866 - configure raw layer for greenplum - Artemiy Kozyr
  • bba7437 - configure data sources for greenplum - Artemiy Kozyr
  • aa25600 - configure package (adapted dbt_vault) for greenplum - Artemiy Kozyr
  • eafed95 - configure dbt_project.yml for greenplum - Artemiy Kozyr

7. Run models step-by-step

Load one day to Data Vault structures:

dbt run -m tag:raw
dbt run -m tag:stage
dbt run -m tag:hub
dbt run -m tag:link
dbt run -m tag:satellite
dbt run -m tag:t_link

8. Load next day

Simulate next day load by incrementing load_date varible:

# dbt_profiles.yml

vars:
  load_date: '1992-01-08' # increment by one day '1992-01-09'
@kzzzr
Copy link
Author

kzzzr commented Feb 28, 2022

  • Adapt to Postgres / Clickhouse (Greenplum is in Preview mode and not generally available)
  • Double-check steps in DBGen instruction or remove it

@kzzzr
Copy link
Author

kzzzr commented Feb 28, 2022

  • Pay attention to dbtVault module (multiple macros, resulting SQL queries)
  • Generate PIT + Bridge via macro (adapt dbtVault)
  • Dockerize environment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment