ilhnctn/sql-basics.md

## sql-basics.md

      
    Raw
  

              sql-basics.md
            
          
    SQL with Postgresql

Agenda


Local env setup

Bonus: CloudSQL methods


Configuration: most useful configs for balanced performance
Basics of query dynamics:

How a Query Works
JOINs
CTEs
Transactions
Functions
Window Functions
Aggregations
Data Generation
Constraints


Useful Queries For Daily Business

Some of main 4-Cast queries


Performance Hints: How to explain the Queries

Indexes
Statistics, Vacuum, Analyze, Re-Index


Maintenance:

Backup & Restore
Storage: All About Sizes (see: link)
Upgrades
Migrations


Bonus:

Transactions
Isolation Levels
psql Tips
Columnar


Local env setup

➜ docker run --name demo_server -e POSTGRES_PASSWORD=secret -e POSTGRES_DB=demo -e POSTGRES_USER=admin -p 54455:5432 postgis/postgis:14-3.2-alpine -d

# overwrite `application_name` to enable transparant monitoring
➜ docker exec -it demo_server psql -U admin -d 'dbname=demo application_name=ilhan_cli_app'

# OR
➜ docker -e PGAPPNAME=ilhan_cli_app exec -it demo_server psql -U admin -d demo
CloudSQL

Get cloud proxy binaries from link
# Expose database to local access
PRROJECT_ID=... \
REGION=... \
INSTANCE_NAME=... \
./cloud_sql_proxy -instances=${PRROJECT_ID:-myproject}:{REGION:-myregioon}:{INSTANCE_NAME:-myinstance}=tcp:54552


Use clone functionality to have an exact copy of current database
Run experimentsorheavy qureies (over yearly reporting queries) against clone db
Delete clones as soon as you finished your experiment

Configuration

Use SHOW all; to get all available configurations.
Important values:

effective_cache_size
max_connextions
maintenance_work_mem
statement_timeout
log_min_duration_statement
max_locks_per_transaction

Recommendations

Use ~/.psqlrc for personalized values
Use LOCAL setting changes for transaction based changes
SET ... is only effectivfe for current session
Database Connections

Keep max_connextions low
Use pg_stat_* for overviews. PS: Google CloudSQL's query-insights also uses pg_stat_activity


SQL Query Dynamics


Query Execution Order: Top to down execution order => FROM, JOIN -> WHERE -> GROUP BY -> HAVING -> SELECT -> ORDER BY -> DISTINCT [ON] -> LIMIT
When more than one table reference is listed in the FROM clause, the tables are cross-joined (the Cartesian product)
Table Functions can be handy for query re-use

JOINs

JOIN clauses nest left-to-right by default. Parentheses can be used to control the JOIN order

CROSS JOIN: For every possible combination of rows from T1 and T2 a row consisting of all columns in T1 followed by all columns in T2. T1(N rows), T2(M rows) => N * M rows.
FROM T1 CROSS JOIN T2 is equivalent to FROM T1 INNER JOIN T2 ON TRUE. It is also equivalent to FROM T1, T2
The words INNER and OUTER are optional in all forms.
INNER is the default.
LEFT, RIGHT and FULL imply an OUTER JOIN.
JOIN T1, T2 USING (a, b) is equevalant to .. ON T1.a = T2.a AND T1.b = T2.b. This is handy when joined column names are same.
NATURAL is a shorthand form of USING: it forms a USING list consisting of all column names that appear in both input tables.
USING is safe for schema changes, NATURAL raises conflicting risks

Query Generics


q1 UNION [ALL] q2, q1 INTERSECT [ALL] q2 and q1 EXCEPT [ALL] q2  can be used to combine result of two queries
Ordering (like ORDER BY sort_expression1 [ASC | DESC] [NULLS { FIRST | LAST }]) happens at the end of data fetching and is one of the most expensive operations
VALUES (VALUES (1, 'one'), (2, 'two'), (3, 'three')) can be used to generate intermediate tables to ease queries
RETURNING * give control over output of any query
Use range data types when applicable (see link)

demo=# create table api(jdoc jsonb);
CREATE TABLE
demo=# CREATE INDEX idxgin ON api USING GIN (jdoc);
CREATE INDEX
demo=#
CTEs
CTEs can split complex queries into multiple small portions

CTEs can be used RECURSIVE over row's own value. TIP: It is iterable, not recursion.

Performance: Indexes


B-Tree: Use for operations like < OR <= OR =  OR >= OR > together with BETWEEN, IN, IS NULL or IS NOT NULL
Hash: Use for = operator. ie: UUID
GiST: A combination of index algorithms. Use for ...
GIN: Inverted index. Use for text-search, ranking ...
BRIN: Best for repeating data sets (dates, integers, etc)

Maintenance: Backup & Respore


SQL Dump & Restore

Exporting: man pg_dump (also see pg_dumpall)
Importing: man pg_restore (or psql {DB_NAME} < dump.file). Don't forget to ANALYZE after imports!


File system level backup


Continuous archiving => use cloud :)


Bonus

psql

psql [-h .. -d .. -U .. -P ..] OR psql 'dbname=.. user=.. application_name=..' OR  APPNAME=.. PGPASSWORD=.. PGHOST=.. psql

\?    -> List of all available apis (like help)
\d[+] -> show table definition
\di   -> list indexes \di <schema>.<tablename> for table specific
\dp   -> list access priviliges in tables, views, functions etc
\g    -> equavalant to ;
\sf   -> show function definition script
\sv   -> show view definition script
\g    -> equavalant to ;
\q    -> quit
\i <filename>.sql -> execute sql file from the session
\o <filename>     -> send all query results to specified file
\copy ...         -> perform sql copy for data ingestion/extraction

Columnar

Relational databases write data to disk in a row-based aproach
Row Based
Saved as tuples of row into disk. Needs a scan on disk in complex queries.
select value from data where name like 'fra%'; will first fetch all the table data to memory. This is due to storage structure undertheneeth.


Name
City
...
Some Value


Frank
Berlin
...
123


...
...
...
...


Franz
Potsdam
...
321


Column Based
Each column saved to disk as a block. Thus, compression is highly possible.
Also, select value from data where name like 'fra%'; will fetch only data of value column.


Name
City
...
Some Value


Franz
Potsdam
...
125


Frank
col2_val
...
222


Henrik
col2_val
...
121


...
...
...
...


Zeynep
Bitlis
XYZ
111


Example
Potensial Candidates


ClickHous: Fully opensource analytics data capable of petabyte scale queries in seconds.
Citus Data: Fully open source Postgresql extension with distributed capacity.

Take-aways


Delegate heavy-lifting to database server
Minimize network latency, computing time & costs
Reduces compelxity
Rethink your data design! Use jsonb, generated-columns and other features
Use columnar data storage for long-term queries (like over-yearly calculations)
Name	City	...	Some Value
Franz	Potsdam	...	125
Frank	col2_val	...	222
Henrik	col2_val	...	121
...	...	...	...
Zeynep	Bitlis	XYZ	111