humorless/dbt-duckdb-tutorial.md

## dbt-duckdb-tutorial.md

      
    Raw
  

              dbt-duckdb-tutorial.md
            
          
    Modern data stack

Main features

REPLWARE 建議的資料分析技術棧 (modern data stack) 主要有下列特色：

ELT over ETL
SQL based analytics over non-SQL based analytics
Analytic Engineer as a new position
When the data is not exceeding 1T, your desktop/notebook is fast enough.

參考資料：

ELT方法的變革對數據驅動營運的重要性
Holistics Analytics Setup Guidebook
The analytic engineering guide: build data team
big data is dead

為何要用 SQL 取代 spreadsheet

當分析師在利用 spreadsheet 做分析工作時，每次向右邊拉出一個新的「欄」(column) ，就是在做一次的資料建模 (data modeling)。然而，由於上述的資料建模方式，每一次的建模只能表現一個維度 (dimension)，這樣子的方式，對於資料建模而言，表現的語彙是相當受限的。另一方面，如果是使用 SQL 的話，每一次的查詢 (query) 就是在做一次的資料建模 (data modeling)。由於，生成的資料是兩個維度的表格 (table)，表現能力自然會豐富許多。
需要的 technologies & tools


git
SQL formatter for Java
duckdb                        ;; data warehouse
dbt   ;; Transform - T
metabase                ;; Dashboard & Report
database schema graph  　　 ;; Database Schema Graph
Optional - nvim (用來編輯 yaml 與 sql)

課程規畫

lesson 1 - install software


config git
config nvim
安裝 pyenv (dbt 會需要 pip, python)
安裝 dbt
安裝 command-line duckdb

Optional

讀 Main features 那一段落的參考資料
https://www.startdataengineering.com/post/dbt-data-build-tool-tutorial/

lesson 2 - play in the example dbt project

玩一下 jaffle_shop 這個 example project 。
lesson 3 - start dbt from scratch


dbt init -> 生成 ~/.dbt/profiles.yml 與 project_name directory
duckdb 登入 database duckdb dbt.duckdb
登入 duckdb 之後，嘗試幾個指令： 參考

select * from duckdb_tables;
select * from duckdb_views;
select * from duckdb_schemas;
decribe [TABLE] 顯示 TABLE 的定義
.read [SQL_CMD_FILE_NAME]
.exit


Example content of ~/.dbt/profiles.yml
duck:
  outputs:
    dev:
      type: duckdb
      path: /Users/laurencechen/analytics/duck/dbt.duckdb  
  target: dev

lesson 4 - import data

dimension data


dbt seed
dbt seed --full-refresh

facts data


source
useful SQL command: CREATE TABLE, DROP TABLE, CREATE SCHEMA

TODO

補充其它有效的 load data into duckdb 方式
lesson 5 - basic SQL & modeling through dbt (source & ref)


The concept of primary key
3 types of table relationships: 1-to-1, 1-to-m, m-to-m
寫 dbt 的 model: 使用 ref, source
利用 dbt docs 來看 DAG
如何利用 SQL views 來做建模

lesson 6 - metabase


與 dbt 整合
Basic skill: dimension and measure
Metabase specific high-level semantic: segments and metrics
Core func 1: Visualization
Core func 2: Dashboard
Optional: Automation

command to execute metabase
java -jar metabase.jar

lesson 7 - basic SQL & reporting


4 種 SQL 的基礎語法: select *, select columns, where, join explaining join
group by, having, order by, aggregation functions
1Keydata SQL tutorial
distinct 的常見 4 種用法: distinct, distinct on, is distinct from, distinct in aggregation fun
Query Process Steps

Getting Data (From, Join)
Row Filter (Where)
Grouping (Group by)
Aggregate function
Group Filter (Having)
Window Function
SELECT
Distinct
Union
Order by
Offset
Limit/Fetch/Top


lesson 8 - intermediate SQL


NULL, IS NULL, IS NOT NULL, COALESCE
UNION, UNION ALL
GROUPING SET, ROLLUP, CUBE
There are 3 places to put filtering expressions:

case/end inside aggregation function
where
having


Pivot table created by SQL: using case end inside aggregation function with group by

lesson 9 - advanced SQL


Conceptual, Logical, And Physical Data Models
lateral join
window function: running sum, delta, rank

over

running => with order by, the default window frame is range between unbounded preceding and current row
moving => without order by, the default window frame is rows between unbounded preceding and unbounded following


aggregate functions, ranking functions, analytic functions


date spine & generate_series
ARRAY_AGG with GROUP BY

CREATE TABLE bar AS SELECT * FROM ( VALUES                                      
  (1, 2, 3),                                                                    
  (1, 2, 4),                                                                    
  (1, 2, 5),                                                                    
  (2, 2, 3),                                                                    
  (2, 2, 4),                                                                    
  (2, 3, 5)                                                                     
) AS t(x,y,z);                                                                  
                                                                                
select x, (array_agg(y))[3]  from bar group by x;                               
=>                                                                              
  x │ array_agg                                                                
──┼───────────                                                    
  2 │   {2,2,2} ->     2                                                       
  1 │   {2,2,3} ->     3     

lesson 10 - test & snapshot


dbt test & dbt test --select
四種 tests: unique, not_null, accepted_values, relationships
dbt snapshot

lesson 11 - jinja & macro


jinja

if
set
for


install plugin -> dbt deps (dbt-labs/dbt_utils, dbt-labs/codegen)
write your own macro
manage database UDF by dbt

lesson 12 - hook & operation & dbt project checklist


dbt hooks
dbt on-run-start, on-run-end
dbt run-operation
dbt project checkList

lesson 13 - Data Review


Profiling
Compare
Data lineage diff

lesson 14 - EL & CDC


Singer
Meltano
Real Time Analytics & stream processing

CDC: change data capture


Appendix - Analytic Engineering toolkit


SQL
Jinja/Macro
dbt test
Metabase - visualization & dashboard