Skip to content

Instantly share code, notes, and snippets.

@humorless
Last active August 25, 2023 12:24
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save humorless/e1ca70f85a7174d1706ce3f9e87b6764 to your computer and use it in GitHub Desktop.
Save humorless/e1ca70f85a7174d1706ce3f9e87b6764 to your computer and use it in GitHub Desktop.
Modern data stack by REPLWARE

Modern data stack

Main features

REPLWARE 建議的資料分析技術棧 (modern data stack) 主要有下列特色:

  • ELT over ETL
  • SQL based analytics over non-SQL based analytics
  • Analytic Engineer as a new position
  • When the data is not exceeding 1T, your desktop/notebook is fast enough.

參考資料:

  1. ELT方法的變革對數據驅動營運的重要性
  2. Holistics Analytics Setup Guidebook
  3. The analytic engineering guide: build data team
  4. big data is dead

為何要用 SQL 取代 spreadsheet

當分析師在利用 spreadsheet 做分析工作時,每次向右邊拉出一個新的「欄」(column) ,就是在做一次的資料建模 (data modeling)。然而,由於上述的資料建模方式,每一次的建模只能表現一個維度 (dimension),這樣子的方式,對於資料建模而言,表現的語彙是相當受限的。另一方面,如果是使用 SQL 的話,每一次的查詢 (query) 就是在做一次的資料建模 (data modeling)。由於,生成的資料是兩個維度的表格 (table),表現能力自然會豐富許多。

需要的 technologies & tools

  1. git
  2. SQL formatter for Java
  3. duckdb ;; data warehouse
  4. dbt ;; Transform - T
  5. metabase ;; Dashboard & Report
  6. database schema graph    ;; Database Schema Graph
  7. Optional - nvim (用來編輯 yaml 與 sql)

課程規畫

lesson 1 - install software

  1. config git
  2. config nvim
  3. 安裝 pyenv (dbt 會需要 pip, python)
  4. 安裝 dbt
  5. 安裝 command-line duckdb

Optional

  1. 讀 Main features 那一段落的參考資料
  2. https://www.startdataengineering.com/post/dbt-data-build-tool-tutorial/

lesson 2 - play in the example dbt project

玩一下 jaffle_shop 這個 example project 。

lesson 3 - start dbt from scratch

  1. dbt init -> 生成 ~/.dbt/profiles.ymlproject_name directory
  2. duckdb 登入 database duckdb dbt.duckdb
  3. 登入 duckdb 之後,嘗試幾個指令: 參考
    • select * from duckdb_tables;
    • select * from duckdb_views;
    • select * from duckdb_schemas;
    • decribe [TABLE] 顯示 TABLE 的定義
    • .read [SQL_CMD_FILE_NAME]
    • .exit

Example content of ~/.dbt/profiles.yml

duck:
  outputs:
    dev:
      type: duckdb
      path: /Users/laurencechen/analytics/duck/dbt.duckdb  
  target: dev

lesson 4 - import data

dimension data

  1. dbt seed
  2. dbt seed --full-refresh

facts data

  1. source
  2. useful SQL command: CREATE TABLE, DROP TABLE, CREATE SCHEMA

TODO

補充其它有效的 load data into duckdb 方式

lesson 5 - basic SQL & modeling through dbt (source & ref)

  1. The concept of primary key
  2. 3 types of table relationships: 1-to-1, 1-to-m, m-to-m
  3. 寫 dbt 的 model: 使用 ref, source
  4. 利用 dbt docs 來看 DAG
  5. 如何利用 SQL views 來做建模

lesson 6 - metabase

  1. 與 dbt 整合
  2. Basic skill: dimension and measure
  3. Metabase specific high-level semantic: segments and metrics
  4. Core func 1: Visualization
  5. Core func 2: Dashboard
  6. Optional: Automation

command to execute metabase

java -jar metabase.jar

lesson 7 - basic SQL & reporting

  1. 4 種 SQL 的基礎語法: select *, select columns, where, join explaining join
  2. group by, having, order by, aggregation functions
  3. 1Keydata SQL tutorial
  4. distinct 的常見 4 種用法: distinct, distinct on, is distinct from, distinct in aggregation fun
  5. Query Process Steps
    1. Getting Data (From, Join)
    2. Row Filter (Where)
    3. Grouping (Group by)
    4. Aggregate function
    5. Group Filter (Having)
    6. Window Function
    7. SELECT
    8. Distinct
    9. Union
    10. Order by
    11. Offset
    12. Limit/Fetch/Top

lesson 8 - intermediate SQL

  1. NULL, IS NULL, IS NOT NULL, COALESCE
  2. UNION, UNION ALL
  3. GROUPING SET, ROLLUP, CUBE
  4. There are 3 places to put filtering expressions:
    • case/end inside aggregation function
    • where
    • having
  5. Pivot table created by SQL: using case end inside aggregation function with group by

lesson 9 - advanced SQL

  1. Conceptual, Logical, And Physical Data Models
  2. lateral join
  3. window function: running sum, delta, rank
    • over
      • running => with order by, the default window frame is range between unbounded preceding and current row
      • moving => without order by, the default window frame is rows between unbounded preceding and unbounded following
    • aggregate functions, ranking functions, analytic functions
  4. date spine & generate_series
  5. ARRAY_AGG with GROUP BY
CREATE TABLE bar AS SELECT * FROM ( VALUES                                      
  (1, 2, 3),                                                                    
  (1, 2, 4),                                                                    
  (1, 2, 5),                                                                    
  (2, 2, 3),                                                                    
  (2, 2, 4),                                                                    
  (2, 3, 5)                                                                     
) AS t(x,y,z);                                                                  
                                                                                
select x, (array_agg(y))[3]  from bar group by x;                               
=>                                                                              
  x │ array_agg                                                                
──┼───────────                                                    
  2 │   {2,2,2} ->     2                                                       
  1 │   {2,2,3} ->     3     

lesson 10 - test & snapshot

  1. dbt test & dbt test --select
  2. 四種 tests: unique, not_null, accepted_values, relationships
  3. dbt snapshot

lesson 11 - jinja & macro

  1. jinja
    • if
    • set
    • for
  2. install plugin -> dbt deps (dbt-labs/dbt_utils, dbt-labs/codegen)
  3. write your own macro
  4. manage database UDF by dbt

lesson 12 - hook & operation & dbt project checklist

  1. dbt hooks
  2. dbt on-run-start, on-run-end
  3. dbt run-operation
  4. dbt project checkList

lesson 13 - Data Review

lesson 14 - EL & CDC

  1. Singer
  2. Meltano
  3. Real Time Analytics & stream processing
    • CDC: change data capture

Appendix - Analytic Engineering toolkit

  1. SQL
  2. Jinja/Macro
  3. dbt test
  4. Metabase - visualization & dashboard
@humorless
Copy link
Author

SQL windows function semantic summary

截圖 2023-08-14 上午12 24 56

@humorless
Copy link
Author

humorless commented Aug 17, 2023

SQL lateral join

What is it?

A LATERAL join is like a SQL foreach loop, in which the SQL server will iterate over each row in a result set and evaluate a subquery (correlated subquery) using that row as a parameter.

Syntax & Semantic

LATERAL [subquery | derived table]

Typical use cases

  1. Top N per group
  2. Define a local variable for select section to use
  3. Explode JSON

Example 1: Referencing variable

  • input
SELECT
  *
FROM
  generate_series(1, 4) AS t1
  CROSS JOIN LATERAL generate_series(1, t1.t1) AS t2;
  • output
 t1 │ t2 
────┼────
  1 │  1
  2 │  1
  2 │  2
  3 │  1
  3 │  2
  3 │  3
  4 │  1
  4 │  2
  4 │  3
  4 │  4
(10 rows)

Example 2: explode JSON

CREATE TABLE mytab (
   id bigint PRIMARY KEY,
   data jsonb
);
 
INSERT INTO mytab VALUES
   (1, '{ "key": ["five", "six"] }'),
   (2, '{ "key": ["pick", "up", "sticks"] }');
SELECT mytab.id, j.elem
FROM mytab
   CROSS JOIN LATERAL
      jsonb_array_elements_text(
         mytab.data -> 'key'
      ) AS j(elem);
 
 id |  elem  
----+--------
  1 | five
  1 | six
  2 | pick
  2 | up
  2 | sticks
(5 rows)

@humorless
Copy link
Author

humorless commented Aug 17, 2023

Using Postgres

Install and Config

  1. Download Postgres
  2. psql 設定檔
    • .psqlrc: config file for psql
    • .pgpass : 無密碼登入
  3. My .psqlrc
\set PROMPT1 '%~%x%# '
\x auto
4. Format of `.pgpass`
`host:port:db_name:user_name:password`

--- \set ECHO_HIDDEN false
\set ON_ERROR_STOP on
\set ON_ERROR_ROLLBACK interactive
\set HISTFILE ~/.psql_history-:DBNAME

\set VERBOSITY verbose

\pset null '¤'
\pset linestyle 'unicode'

-- \pset unicode_border_linestyle single
-- \pset unicode_column_linestyle single
-- \pset unicode_header_linestyle double

set intervalstyle to 'postgres_verbose';

-- \setenv LESS '-iMFXSx4R'

Basic shell command

  1. 生成一個 postgres 資料庫 -> createdb crm_db
  2. psql 登入 database crm_db psql -d crm_db or psql -h localhost -d crm_db -U $USERNAME

Commands after connecting to Postgres server

  1. 連線後改變 postgres schema: set search_path to dbt_develop;

Load data into Postgres

  • Using script to prepare the table
#!/usr/bin/env bash                                                                                             
PGOPTIONS="--search_path=dbt_develop"                                                                           
export PGOPTIONS                                                                                                
                                                                                                                
psql -d crm_db -c "DROP TABLE IF EXISTS raw_data_table_name;"                                                            
psql -d crm_db -c "CREATE TABLE raw_data_table_name(                                                                    
    id integer,                                                                            
    name text,
    created_at timestamp with time zone,
    updated_at timestamp with time zone                                            
);"
  • Using psql command to copy
psql -d crm_db -c "\copy raw_data_table_name FROM '/abs/path/to/the/source/file' DELIMITER ',' CSV HEADER"

Miscellaneous

  1. Postgres Don't do this
  2. pgloader
  3. Move Big Query to local Postgres to save your money

@humorless
Copy link
Author

humorless commented Aug 17, 2023

Configure Neovim for SQL/yaml

安裝 neovim 的插件管理 plug

在 terminal 先執行

sh -c 'curl -fLo "${XDG_DATA_HOME:-$HOME/.local/share}"/nvim/site/autoload/plug.vim --create-dirs \
       https://raw.githubusercontent.com/junegunn/vim-plug/master/plug.vim'

neovim config file

再把下列的 config 貼到 ~/.config/nvim/init.vim

" default yaml setup
autocmd FileType yaml setlocal ts=2 sts=2 sw=2 expandtab indentkeys-=0# indentkeys-=<:> foldmethod=indent nofoldenable

" Begin the plugin section
call plug#begin()
" Specify your required plugins here

Plug 'liuchengxu/vim-better-default'
call plug#end()

function! Sqlfmt()
 "`:call Sqlfmt()` to format the current sql file
 " The sqlfmt requires the installation of sqlparse
 " `pip install sqlparse`
 !sqlformat --reindent --keywords upper --identifiers lower % -o %
 :e
endfunction

function! Yamlfmt()
 "`:call Yamlfmt()` to format the current yaml file
 " pip install yamlfix
 !yamlfix %
 :e
endfunction

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment