Skip to content

Instantly share code, notes, and snippets.

@mark-walle
Last active May 30, 2022 20:28
Show Gist options
  • Save mark-walle/068637cb8d891eed7a2bac1ebc0907be to your computer and use it in GitHub Desktop.
Save mark-walle/068637cb8d891eed7a2bac1ebc0907be to your computer and use it in GitHub Desktop.
Querying Data with SQL from PostgreSQL
  1. DDL: Data Definition Language
  2. DCL: Data Control Language
  3. DQL: Data Query Language
  4. DML: Data Manipulation Language
  5. DTL: Data Transaction Language

1. DDL: Data Definition Language

DDL is the first steps in every database creation as they define the objects that will contain and manage the data and server-side code.

The STATEMENTS involved in DDL are:

  • CREATE
  • ALTER
  • DROP

The OBJECTS invovled in DDL are:

  • DATABASE
  • SCHEMA
  • TABLE
  • VIEW
  • INDEX
  • TYPE
  • PROCEDURE
  • FUNCTION
  • ROLE
  • TRIGGER
  • RULE
  • SEQUENCE

each object has it;s own syntax for create and alter, but drop typically just contains the object name that you're droppping.

-- example DML statements
CREATE DATABASE DML;

CREATE TABLE Orders (
  OrderID INT PRIMARY KEY,
  Order Date DATE NOT NULL );

ALTER TABLE Orders ADD Customer VARCHAR(10);

DROP DATABASE DML;

2. DCL: Data Control Language

Module overview

What are Concurrency conflicts? The occur when two or more processes or resource consumers compete for a shared resource, which could result in conflics.

  • Example
  • ACID Properties
  • Isolation implementation Paradigms
    • Optimistic
    • Pessimistics

Concurrency Conflicts

Imagine a shared tip jar in a restaurant with two servers, Ruth and Jim.

  • place money in
  • use it to break large notes
  • place IOU notes in, in order to use the jar money to give customers exact change

As long as Ruth and Jim use the tip jar in turn, there will be no concurrency issues. However, the impact of that is potential bottlenecks (e.g., what happens if you have hundreds of servers using the tip jar for the above options, it would take a lot of time to negotiate who is using the tip jar at a given moment). The moment two people use the jar for different purposes at the same time, concurrency conflicts could cause other people's attempts to fail.

A database needs to be able to handle concurrency issues to avoid such conflicts.

Database Transaction

Definition:

"A transaction symbolizes a unit of work performed within a database management system against a database, and treated in a coherent and reliable way independent of other transactions" wikipedia

ACID Properties Guarantees

RDBMs engines are required to follow a set of desired properties known as ACID.

  • Atomicity: Individual and irreducible series of database operations
  • Concistency: Any transasction can change the state of the database from one valid state to another valid state
  • Isolation: determines when and how changes become visible to others
  • Durability: committed changes survive permanently

An example of atomicity is in an ATM withdrawal, which will require three steps: first checking the account balance, debiting the account, and issuing the cash. All of these must be bound as a single transaction where if any one fails, the transaction is cancelled and rolled back in its entirety.

A transaction has to be consistent (from a consistent state and to a consistent state). The database engine is responsible to avoid transitioning the database to an inconsistent state. Inconsistency may be related to Primary and Foreign Keys, and all other Constraints (such as unique, null, and check constraints).

Imagine two transactions. Isolation must ensure shared aspects of two transactions are not in conflict. Imagine two transactions

Transaction #1 Transaction #2
Data Read A Data Modification B
Data Modification B Data Modification C
Data Read C Data Read B

Isolation defines aspects such as

  • If transaction #1 can run Data Modification B that has already been modified by transaction #2
  • Whether transaction #2 can see the modification on Data Read B that has already been modified by transaciton #1's Data Modification B step
  • Whether transaction #1 can see the modification on Data Read C that has already been modified by transaciton #2's Data Modification C step.

Duribility gaurantees that after the DBE acknowledges that the transaction has been committed, that the transaction will survive any system failures. Not as obvious as it may seem since data modification happen in volatile system RAM. If the RAM changes are not committed to a non-volitile medium like a disk.

Isolation Paradigms

To handle concurrency conflicts, DBEs use one of the following two paradigms:

  • Pessimistic: assume there will be conflicts, and take measures to prevent them in the first place
  • Optimistic: assumes that conflicts will be rare, so it is not preventative, instead it checks transaction validity before being allowed to commmit.

e.g., in a pesimitic paradigm, the DBE would place a lock on the tip jar for transition duration. In an optimistic paradigm, we take a snapshot of the jar, check the consistency of the transaction, and only proceed if it will work according to the snapshot. Then the change is applied against the Tip jar and rolled back if the state of the Tip jar differs from the state of the snapshot after the same change had occurred on it, undoing the work that was attempted.

Pessimistic:

  • Uses locks
  • Introduces waits (blocks)
  • Uses a single copy of data

Optimistic:

  • Uses multiple snapshots
  • May result in aborts / retries
  • Constly in memory and disk

Understanding ANSI Concurrency Phenomena

Formal definitions and implementations as set by ANSI (American National Standard Intsitute) - ANSI ISO SQL-92 standard:

  1. Lost Update*
  2. Dirty Read
  3. Non-repeatable Read
  4. Phantom Rows

* not a part of the ANSI ISO SQL-92

1. Lost Update

Here is a simplified theoretical lost update scenario. This fundamental conflict between two writers cannot occur in PostgreSQL or other mainstream RDBMS. Lost update is therefore not a concern, but this is the idea:

Transaction A Transaction B
Begins Begins
Modifies Resource X [...]
[...] Modifies Resource X
Ends Ends
  1. Two transactions begin simultanously executing.
  2. Transaction A starts, and then Transaction B starts.
  3. Transaction A modifies resource X
  4. Then Transaction B also modifes resource X, overwriting the changes on resource X from Transaction A.
  5. Transaction A commits.
  6. Transaction B commits.

Transaction A will have lost its update.

2. Dirty Read

A dirty read occurs when when a transaction reads a resource that has been modified by a concurrent transaction before it commits; but the modifying transaction then decides to rollback its changes.

Transaction A Transaction B
Begins Begins
Modifies Resource X [...]
[...] Reads Resource X
[...] Commits
Rolls back (reverting Resource X) [...]

Transaction A, by rolling back at the end, reverts resource X to it's previous state, but after Transaction B had already read and potentially acted on.

Since changes made by transactions should not persist in the database until the transaction commits; the state of Resource X ( which transaction B read and potentially acted upon), never really existed in the database.

3. Non-repeatable Read

Non-repeatable read phenomena occurs when a transaction is unable to read a consistent version of a resource multiple times due to concurrent changes made by other transactions.

Transaction A Transaction B
Begins Begins
[...] Reads resource X
Modifies Resource X < does some more work ... >
Commits Reads resource X
[...] Commits

The state of resource X on Transaction B's second read is different from what it was when it was previous read. Hence: Non-Repeatable.

4. Phantom Rows

Phantom rows are a subtler variant of the non-repeatable read phenomena.

Transaction A Transaction B
Begins Begins
[...] Read rows 1 to 10, 1 to 3 Exist
Inserts row 4 [...]
Commits Read rows 1 to 10, 1 to 4 Exist
[...] Commits

Transaction A and B begin, and Transcation B reads a range of rows using a filter of keyvalues 1 to 10, however at this point in time only rows 1 to 3 actually exist in the table and are returned.

Transaction A then inserts row 4 and commits. Now, if transaction B performs the same transaction as it did before, rows 1 through 4 are returned.

Note that this is not a non-repeatable read, because rows 1 through 3 were returned both times consistently; but row number 4 seems to have appeared out of nowhere. Hence: Phantom Row.

How ANSI Isolation Levels Solve Concurrency Phenomena

Now that we know the concurrency phenomena that may occur (Lost Update, Dirty Read, Non-repeatable Read, Phantom Rows), we can see what measures the ANSI SQL standard provides for dealing with them (both in general and specifically by PostgreSQL).

  1. READ UNCOMMITTED
  2. READ COMMITED
  3. REPEATABLE READ
  4. SERIALIZABLE

These are used in transactions with the following syntax:

BEGIN TRANSACTION ISOLATION LEVEL [ READ COMMITTED | REPEATABLE READ | SERIALIZABLE ];

-- [...]

COMMIT TRANSACTION;

Note that the more restrictive the isolation level, the higher the pentalyu in terms of concurrency and performance. This is due to maintaing snapshots through entire transactions.

1. READ UNCOMMITTED

Most RDBMS except for PostgreSQL support the least restrictive isolation lvele known as READ UNCOMMITTED. In this isolation level, all concurrency phenomena are permitted included diry reads, non-repeatable reads, and phantom rows).

Althrough PostgreSQL does support this syntac for setting read uncommitted, it is treated exactly the same way as READ COMMITTED, therefore for all practical purposes, it doesn't exist in postgreSQL at all.

This implies that dirty reads can never occur in PostgreSQL.

2. READ COMMITTED

Read committed is the default isoliation level in postgreSQL. Read committed prevents dirty reads, but non-repeatable reads, and phantom rows can still occur.

PostgreSQL uses MVCC, an optimistic isolitaion implementation paradigm, and read committed is implemented using data snapshots.

Transaction A Transaction B
Begins Begins
📷 Reads Resource X [...]
[...] Modifies Resource X
📷 Reads Resource X Commits

Transaction A uses Snapshots (:camera:) on each read of resource X in order to isolate it from concurrent transactions. The snapshot only lasts for the duration of the reading statement.

Shortly after or concurrently, Transaction B modifies the same resource and commits.

Transaction A reads the same resource again and takes a fresh snapshot of it. The snapshot now contains the modified state of resource X, and this constitutes a non-repeatable read; since the transaction is reading a different state of resource X from one read to the next.

3. REPEATABLE READ

Repeatable read prevents both dirty reads, and non-repeatable reads. But it should still allow for phantom rows. PostgreSQL optionally goes beyond this and does not allow phantom rows with repeatable read.

Transaction A Transaction B
Begins Begins
📷 Reads Resource X [...]
[...] Modifies Resource X
Reads Resource X Again Commits

In this transaction, one the second read of X, it uses the same snapshot as when it first read the data.

4. Serializable

The most restrictive isolation level, effectively emulating serial transaction execution, effectivly eliminating any concurrency in favour of high isolation.

Serializable eliminates dirty reads, non-repeatable reads, and phantom rows.

Using Transaction Control Language

  1. Starting a Transaction Block (BEGIN/START,COMMIT/ROLLBACK)
  2. SET TRANSACTION
  3. Transaction Save Points
  4. DDL Statements

1. Starting a Transaction Block

BEGIN
BEGIN WORK
BEGIN TRANSACTION
START TRANSACTION

while all are equivalent, BEGIN TRANSACTION is the standard common to many RDBMS.

We can also define the isolation level for the transaction as part of the begin transaction statement.

Terminating a Transaction Block

COMMIT
COMMIT WORK
COMMIT TRANSACTION
---
ROLLBACK
ROLLBACK WORK
ROLLBACK TRANSACTION

Again, [COMMIT | ROLLBACK] TRANSACTION is the most explicit and common to other RDBMS.

BEGIN TRANSACTION;
  -- Transaction body
COMMIT TRANSACTION;

You cannot nest transactions in PostgreSQL. If you try to start a transaction inside the scope of another, it will be ignored and elicit a warning message.

2. SET TRANSACTION

You can set the isolation level for a transaction as:

BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
-- TRANSACTION BODY
COMMIT TRANSACTION;
---
-- is equivalent to:
---
BEGIN TRANSACTION;
SET TRANSASCTION ISOLATION LEVEL REPEATABLE READ;
-- TRANSACTION BODY
COMMIT TRANSACTION;

You can also preface all transactions in a session to follow a certain isolation level by moving the SET statement outside the transactions and specifying SESSION:

SET SESSION CHARACTERISTICS AS TRANSACTION ISOLATION LEVEL REPEATABLE READ;
BEGIN TRANSACTION;
-- TRANSACTION BODY
COMMIT TRANSACTION;

The server can be configured with a default isolation level without knowledge of the people crafting transactions, so be aware of that.

3. Transaction Save Points

You can divide a transaction into smaller chunks providing partial rollback points using the SAVEPOINT statement, followed by a unique savepoint name.

BEGIN TRANSACTION
-- <Do Some Work>
SAVEPOINT MySavePoint1
-- <Do Some More Work>
IF <Some Condition>
  ROLLBACK TO SAVEPOINT MySavePoint1
END IF

COMMIT TRANSACTION

Notice that we use a condition to potentially rollback to the SAVEPOINT. Note that you may be better off just making smaller transactions rather than using savepoints, because it is usually better to avoid conditional rollback, just because it adds complexity.

4. DDL Statements

DDL statements are fully transactional. Therefore, DDL statements like CREATE TABLE statements occuring within a transaction will be committed or rolled back much like an INSERT or DELETE statement.

We can think of DDL statements as an insert only on system tables. For example:

BEGIN TRANSACTION
CREATE TABLE T1 (Col1 INT PRIMARY KEY)
INSERT INTO T1 VALUES (1)
ROLLBACK TRANSACTION
SELECT * FROM T1

If the transaction for whatever reason was rolled back, it will negate the effect of the CREATE TABLE and the INSERT.

Inserting Data with the INSERT Statement

  1. INSERT Syntax Overview
  2. Inserting Literal Values
  3. Inserting Default Value
  4. Inserting a Query Result Set
  5. Inserting Data Through Views
  6. Handinling Insertion Conflict
  7. Returning Data from INSERT

Preamble

  • KEYWORDS
  • Variables, Expressions, Aliases
  • [ Optional Clause ]
  • { Mandatory Clause }
  • Option 1 | Option 2 (inside clauses)
  • Repear = [, ...]

1. INSERT Syntax Overview

-- It's advisable to include the column names in the INSERT statements for explicitness and readibility, and avoid potential future bugs if the table structure ever changes
INSERT INTO Table [ AS Alias ] [ ( Column_Name [, ...] ) ]
[ OVERRIDING { SYSTEM | USER } VALUE ] -- used to explicity handle system generated values such as identity oclumns and sequences
{ DEFAULT VALUES | VALUES ({ Expression | DEFAULT } [, ...])
[, ...] | Query } -- Default values, explicit values, or a query that allows the data to be inserted
[ ON CONFLICT [ Conflict_target ] Conflict_Action ] -- defines which actions we want to take incase our data conflicts with existing table constraints
[ RETURNING * | Output_Expression [ [ AS ] Output_Alias ] [, ...] ] -- included to make our insert statements not only insert the data, but also return data to either the client application or a nesting statement similar to a SELECT.

2. Inserting Literal Values

The VALUES row constructor can be used to generate datasets for inserts. The VALUES keyword is followed by parenthases with comma-separated values for the different column value expressions.

We can also have it create additional rows by adding a second set of parentheses also separated by a column.

All rows must have the same number of expressions.

This is the same as using a SELECT statement with literal expressions and without a FROM clause. Both are VALID forms to insert literal values.

INSERT INTO Table [ AS Alias ] [ ( Column_Name [, ...] )]
VALUES ( Expression [, ...] ) [, ...]

INSERT INTO Table [ AS Alias ] [ ( Column_Name [, ...] )]
SELECT Expression [, ...]

Examples of Inserting Literal Values

Example of inserts using the VALUES keyword:

DROP TABLE IF EXISTS T;

CREATE TABLE T (
               Identity_Column_ALWS  INT NOT NULL  GENERATED ALWAYS AS IDENTITY,
               Identity_Column_DFLT  INT NOT NULL  GENERATED BY DEFAULT AS IDENTITY,
               Default_Column        INT NOT NULL  DEFAULT(0),
               Unique_Column         INT NULL      UNIQUE,
               Constraint_Column     INT NULL      CHECK (Constraint_Column > 0)
               );

-- Inserting Literals

INSERT INTO T (Default_Column, Unique_Column, Constraint_Column)
VALUES (1, 1, 1),
      (2, 2, 2); -- INSERT 0 2

SELECT * FROM T;

result:

 identity_column_alws | identity_column_dflt | default_column | unique_column | constraint_column 
----------------------+----------------------+----------------+---------------+-------------------
                    1 |                    1 |              1 |             1 |                 1
                    2 |                    2 |              2 |             2 |                 2
(2 rows)

Example of an insert using the SELECT keyword:

INSERT INTO  T (Default_Column, Unique_Column, Constraint_Column)
SELECT 3, 3, 3;

SELECT * FROM T;

result:

 identity_column_alws | identity_column_dflt | default_column | unique_column | constraint_column 
----------------------+----------------------+----------------+---------------+-------------------
                    1 |                    1 |              1 |             1 |                 1
                    2 |                    2 |              2 |             2 |                 2
                    3 |                    3 |              3 |             3 |                 3
(3 rows)

3. Inserting Default Value

To insert default values means that we do not explicitly state the actual value as part of the insert statement Instead, we instruct the database engine to populate that value for us.

In PosetgreSQL we have two types of default values

  1. Explicitly assigned with a default constraint, which is assinged to a column; and
  2. Identity or Sequence columns, which can be used to generate sequential identifiers for rows (often abused as surrogate keys).
  • These kinds can be triggered by excluding them completely from an insert column-list.
INSERT INTO Table [ AS Alias ] [ ( Column_Name [, ...] ) ]

For identiy, the INSERT syntax allows us to state the desired behaviour when there is a conflict of an explicit value being provided for a column that has an identity property.

  • Stating OVERRIDING SYSTEM VALUE will give preference to the value in the statement; and,
  • stating OVERRIDING USER VALUE will simply ignore it and keep using the system generated value.
INSERT INTO Table [ AS Alias ] [ ( Column_Name [, ...] ) ]
[ OVERRIDING { SYSTEM | USER } VALUE ] 

Another way to trigger default constraints is to use the DEFAULT VALUES clause for cases where you want the whole row to exist as defaults. Or use the keyword DEFAULT as part of the values clause.

INSERT INTO Table [ AS Alias ] [ ( Column_Name [, ...] ) ]
[ OVERRIDING { SYSTEM | USER } VALUE ]
DEFAULT VALUES | VALUES ( DEFAULT [, ... ] )

Examples of inserting default values

To trigger the default constraint for the Default_Column we can either exclude it from the insert column list or explicitly use the DEFAULT keyword.

-- Inserting Defaults

INSERT INTO T (Unique_Column, Constraint_Column)
VALUES (4, 4);

INSERT INTO T  (Default_Column, Unique_Column, Constraint_Column)
VALUES (DEFAULT, 5, 5);

SELECT * FROM T;

Result:

 identity_column_alws | identity_column_dflt | default_column | unique_column | constraint_column 
----------------------+----------------------+----------------+---------------+-------------------
                    1 |                    1 |              1 |             1 |                 1
                    2 |                    2 |              2 |             2 |                 2
                    3 |                    3 |              3 |             3 |                 3
                    4 |                    4 |              0 |             4 |                 4
                    5 |                    5 |              0 |             5 |                 5
(5 rows)

We can explicitly override the Identity_Column_DFLT by providing a value.

INSERT INTO T (Identity_Column_DFLT, Unique_Column, Constraint_Column)
VALUES (12, 6, 6);

SELECT * FROM T;

result:

 identity_column_alws | identity_column_dflt | default_column | unique_column | constraint_column 
----------------------+----------------------+----------------+---------------+-------------------
                    1 |                    1 |              1 |             1 |                 1
                    2 |                    2 |              2 |             2 |                 2
                    3 |                    3 |              3 |             3 |                 3
                    4 |                    4 |              0 |             4 |                 4
                    5 |                    5 |              0 |             5 |                 5
                    6 |                   12 |              0 |             6 |                 6
(6 rows)

Trying to do the same with the Identity_Column_ALWS which generates identities defined with ALWAYS AS IDENTITY will elicit a warning message if attempted.

INSERT INTO T (Identity_Column_ALWS, Unique_Column, Constraint_Column)
VALUES (12, 7, 7);

-- ERROR:  cannot insert a non-DEFAULT value into column "identity_column_alws"
-- DETAIL:  Column "identity_column_alws" is an identity column defined as GENERATED ALWAYS.
-- HINT:  Use OVERRIDING SYSTEM VALUE to override.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment