Skip to content

Instantly share code, notes, and snippets.

@humorless
Last active August 25, 2023 12:24
Show Gist options
  • Save humorless/e1ca70f85a7174d1706ce3f9e87b6764 to your computer and use it in GitHub Desktop.
Save humorless/e1ca70f85a7174d1706ce3f9e87b6764 to your computer and use it in GitHub Desktop.
Modern data stack by REPLWARE

Modern data stack

Main features

REPLWARE 建議的資料分析技術棧 (modern data stack) 主要有下列特色:

  • ELT over ETL
  • SQL based analytics over non-SQL based analytics
  • Analytic Engineer as a new position
  • When the data is not exceeding 1T, your desktop/notebook is fast enough.

參考資料:

  1. ELT方法的變革對數據驅動營運的重要性
  2. Holistics Analytics Setup Guidebook
  3. The analytic engineering guide: build data team
  4. big data is dead

為何要用 SQL 取代 spreadsheet

當分析師在利用 spreadsheet 做分析工作時,每次向右邊拉出一個新的「欄」(column) ,就是在做一次的資料建模 (data modeling)。然而,由於上述的資料建模方式,每一次的建模只能表現一個維度 (dimension),這樣子的方式,對於資料建模而言,表現的語彙是相當受限的。另一方面,如果是使用 SQL 的話,每一次的查詢 (query) 就是在做一次的資料建模 (data modeling)。由於,生成的資料是兩個維度的表格 (table),表現能力自然會豐富許多。

需要的 technologies & tools

  1. git
  2. SQL formatter for Java
  3. duckdb ;; data warehouse
  4. dbt ;; Transform - T
  5. metabase ;; Dashboard & Report
  6. database schema graph    ;; Database Schema Graph
  7. Optional - nvim (用來編輯 yaml 與 sql)

課程規畫

lesson 1 - install software

  1. config git
  2. config nvim
  3. 安裝 pyenv (dbt 會需要 pip, python)
  4. 安裝 dbt
  5. 安裝 command-line duckdb

Optional

  1. 讀 Main features 那一段落的參考資料
  2. https://www.startdataengineering.com/post/dbt-data-build-tool-tutorial/

lesson 2 - play in the example dbt project

玩一下 jaffle_shop 這個 example project 。

lesson 3 - start dbt from scratch

  1. dbt init -> 生成 ~/.dbt/profiles.ymlproject_name directory
  2. duckdb 登入 database duckdb dbt.duckdb
  3. 登入 duckdb 之後,嘗試幾個指令: 參考
    • select * from duckdb_tables;
    • select * from duckdb_views;
    • select * from duckdb_schemas;
    • decribe [TABLE] 顯示 TABLE 的定義
    • .read [SQL_CMD_FILE_NAME]
    • .exit

Example content of ~/.dbt/profiles.yml

duck:
  outputs:
    dev:
      type: duckdb
      path: /Users/laurencechen/analytics/duck/dbt.duckdb  
  target: dev

lesson 4 - import data

dimension data

  1. dbt seed
  2. dbt seed --full-refresh

facts data

  1. source
  2. useful SQL command: CREATE TABLE, DROP TABLE, CREATE SCHEMA

TODO

補充其它有效的 load data into duckdb 方式

lesson 5 - basic SQL & modeling through dbt (source & ref)

  1. The concept of primary key
  2. 3 types of table relationships: 1-to-1, 1-to-m, m-to-m
  3. 寫 dbt 的 model: 使用 ref, source
  4. 利用 dbt docs 來看 DAG
  5. 如何利用 SQL views 來做建模

lesson 6 - metabase

  1. 與 dbt 整合
  2. Basic skill: dimension and measure
  3. Metabase specific high-level semantic: segments and metrics
  4. Core func 1: Visualization
  5. Core func 2: Dashboard
  6. Optional: Automation

command to execute metabase

java -jar metabase.jar

lesson 7 - basic SQL & reporting

  1. 4 種 SQL 的基礎語法: select *, select columns, where, join explaining join
  2. group by, having, order by, aggregation functions
  3. 1Keydata SQL tutorial
  4. distinct 的常見 4 種用法: distinct, distinct on, is distinct from, distinct in aggregation fun
  5. Query Process Steps
    1. Getting Data (From, Join)
    2. Row Filter (Where)
    3. Grouping (Group by)
    4. Aggregate function
    5. Group Filter (Having)
    6. Window Function
    7. SELECT
    8. Distinct
    9. Union
    10. Order by
    11. Offset
    12. Limit/Fetch/Top

lesson 8 - intermediate SQL

  1. NULL, IS NULL, IS NOT NULL, COALESCE
  2. UNION, UNION ALL
  3. GROUPING SET, ROLLUP, CUBE
  4. There are 3 places to put filtering expressions:
    • case/end inside aggregation function
    • where
    • having
  5. Pivot table created by SQL: using case end inside aggregation function with group by

lesson 9 - advanced SQL

  1. Conceptual, Logical, And Physical Data Models
  2. lateral join
  3. window function: running sum, delta, rank
    • over
      • running => with order by, the default window frame is range between unbounded preceding and current row
      • moving => without order by, the default window frame is rows between unbounded preceding and unbounded following
    • aggregate functions, ranking functions, analytic functions
  4. date spine & generate_series
  5. ARRAY_AGG with GROUP BY
CREATE TABLE bar AS SELECT * FROM ( VALUES                                      
  (1, 2, 3),                                                                    
  (1, 2, 4),                                                                    
  (1, 2, 5),                                                                    
  (2, 2, 3),                                                                    
  (2, 2, 4),                                                                    
  (2, 3, 5)                                                                     
) AS t(x,y,z);                                                                  
                                                                                
select x, (array_agg(y))[3]  from bar group by x;                               
=>                                                                              
  x │ array_agg                                                                
──┼───────────                                                    
  2 │   {2,2,2} ->     2                                                       
  1 │   {2,2,3} ->     3     

lesson 10 - test & snapshot

  1. dbt test & dbt test --select
  2. 四種 tests: unique, not_null, accepted_values, relationships
  3. dbt snapshot

lesson 11 - jinja & macro

  1. jinja
    • if
    • set
    • for
  2. install plugin -> dbt deps (dbt-labs/dbt_utils, dbt-labs/codegen)
  3. write your own macro
  4. manage database UDF by dbt

lesson 12 - hook & operation & dbt project checklist

  1. dbt hooks
  2. dbt on-run-start, on-run-end
  3. dbt run-operation
  4. dbt project checkList

lesson 13 - Data Review

lesson 14 - EL & CDC

  1. Singer
  2. Meltano
  3. Real Time Analytics & stream processing
    • CDC: change data capture

Appendix - Analytic Engineering toolkit

  1. SQL
  2. Jinja/Macro
  3. dbt test
  4. Metabase - visualization & dashboard
@humorless
Copy link
Author

humorless commented Aug 17, 2023

Configure Neovim for SQL/yaml

安裝 neovim 的插件管理 plug

在 terminal 先執行

sh -c 'curl -fLo "${XDG_DATA_HOME:-$HOME/.local/share}"/nvim/site/autoload/plug.vim --create-dirs \
       https://raw.githubusercontent.com/junegunn/vim-plug/master/plug.vim'

neovim config file

再把下列的 config 貼到 ~/.config/nvim/init.vim

" default yaml setup
autocmd FileType yaml setlocal ts=2 sts=2 sw=2 expandtab indentkeys-=0# indentkeys-=<:> foldmethod=indent nofoldenable

" Begin the plugin section
call plug#begin()
" Specify your required plugins here

Plug 'liuchengxu/vim-better-default'
call plug#end()

function! Sqlfmt()
 "`:call Sqlfmt()` to format the current sql file
 " The sqlfmt requires the installation of sqlparse
 " `pip install sqlparse`
 !sqlformat --reindent --keywords upper --identifiers lower % -o %
 :e
endfunction

function! Yamlfmt()
 "`:call Yamlfmt()` to format the current yaml file
 " pip install yamlfix
 !yamlfix %
 :e
endfunction

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment