Skip to content

Instantly share code, notes, and snippets.

@logston
Last active March 26, 2024 21:01
Show Gist options
  • Save logston/ff26d122688478b15695d1ac103c2bde to your computer and use it in GitHub Desktop.
Save logston/ff26d122688478b15695d1ac103c2bde to your computer and use it in GitHub Desktop.

Warm Ups

How can we look around the data?

  • \l
  • \dt
  • \d

What does the users table look like?

SELECT * FROM users LIMIT 1;   
                   id                  |   city    |      name      |            address             | credit_card
---------------------------------------+-----------+----------------+--------------------------------+--------------
  ae147ae1-47ae-4800-8000-000000000022 | amsterdam | James Martinez | 53159 Samantha Streets Apt. 69 |  9215735868

And rides?

SELECT * FROM rides LIMIT 1;                                                                                                 
                   id                  |   city    | vehicle_city |               rider_id               |              vehicle_id              |         start_address          |        end_address        |     start_time      |      end_time       | revenue
---------------------------------------+-----------+--------------+--------------------------------------+--------------------------------------+--------------------------------+---------------------------+---------------------+---------------------+----------
  ab020c49-ba5e-4800-8000-00000000014e | amsterdam | amsterdam    | bd70a3d7-0a3d-4000-8000-000000000025 | aaaaaaaa-aaaa-4800-8000-00000000000a | 77033 Melissa Streets Suite 78 | 54177 Larson Fall Apt. 12 | 2018-12-06 03:04:05 | 2018-12-08 02:04:05 |   33.00

How many rides in week to date?

SELECT count(*)
FROM rides
WHERE start_time >= now() - INTERVAL '1 week';

Full speed

Aggregation

What cities have the most users?

SELECT city, count(*)
FROM users
GROUP BY city
ORDER BY count(*) DESC;

What cities have a threshold of users?

SELECT city, count(*)
FROM users
GROUP BY city
HAVING count(*) > 6
ORDER BY count(*) DESC;

Sub-queries

What users have used the 0_i_information_perform promo code?

SELECT name
FROM users
WHERE id IN (
    SELECT user_id
    FROM user_promo_codes
    WHERE code = '0_i_information_perform'
);

Joins

Can I get the same information but with joins?

SELECT u.name
FROM users u
LEFT JOIN user_promo_codes c ON u.id = c.user_id
WHERE c.code = '0_i_information_perform';

How efficient is this?

https://www.cockroachlabs.com/docs/stable/explain-analyze

EXPLAIN SELECT u.name
FROM users u
LEFT JOIN user_promo_codes c ON u.id = c.user_id
WHERE c.code = '0_i_information_perform';
EXPLAIN ANALYZE SELECT u.name
FROM users u
LEFT JOIN user_promo_codes c ON u.id = c.user_id
WHERE c.code = '0_i_information_perform';
EXPLAIN ANALYZE (VERBOSE, TYPES, DISTSQL) SELECT u.name
FROM users u
LEFT JOIN user_promo_codes c ON u.id = c.user_id
WHERE c.code = '0_i_information_perform';

Compared with sub-query?

Not too different! Let's not be too afraid of sub-queries unless we see real performance hit or hard to reason about.

Hard Question

Get a list of names of people who:

  • Take rides in a city with both bikes and scooters.
  • Have only ever ridden on one mode of transit.

Could be used to offer them a discount to try the other type of vehicle.

What are the cities?

SELECT DISTINCT city
FROM vehicles
WHERE type IN ('bike', 'scooter')
GROUP BY city, type
-- Needs to have both types of vehicle.
HAVING count(type) = 2;

What people in those cities have only ridden via one vehicle type?

SELECT r.rider_id, min(v.type)
FROM rides r
LEFT JOIN vehicles v ON r.vehicle_id = v.id
WHERE r.city
    IN (
        SELECT DISTINCT city
        FROM vehicles
        WHERE type IN ('bike', 'scooter')
        GROUP BY city, type
        -- Needs to have both types of vehicle.
        HAVING count(type) = 2
    )
    AND v.type IN ('bike', 'scooter')
GROUP BY r.rider_id
-- If there is only on distinct type, we know
-- the rider has only tried one mode of transit.
HAVING count(DISTINCT v.type) = 1

Add in names...

SELECT u.name, u.city, r2.type
FROM users u
LEFT JOIN (
    SELECT r.rider_id, min(v.type) as type
    FROM rides r
    LEFT JOIN vehicles v ON r.vehicle_id = v.id
    WHERE r.city
        IN (
            SELECT DISTINCT city
            FROM vehicles
            WHERE type IN ('bike', 'scooter')
            GROUP BY city, type
            -- Needs to have both types of vehicle.
            HAVING count(type) = 2
        )
        AND v.type IN ('bike', 'scooter')
    GROUP BY r.rider_id
    -- If there is only on distinct type, we know
    -- the rider has only tried one mode of transit.
    HAVING count(DISTINCT v.type) = 1
) r2 ON r2.rider_id = u.id
WHERE r2.type IS NOT NULL

Views

CREATE VIEW bike_scooter_cities AS
SELECT DISTINCT city
FROM vehicles
WHERE type IN ('bike', 'scooter')
GROUP BY city, type
HAVING count(type) = 2
SELECT * FROM bike_scooter_cities;
\d
CREATE VIEW solo_type_riders AS
SELECT r.rider_id, min(v.type) as type
FROM rides r
LEFT JOIN vehicles v ON r.vehicle_id = v.id
WHERE r.city
    IN (
        SELECT *
        FROM bike_scooter_cities
    )
    AND v.type IN ('bike', 'scooter')
GROUP BY r.rider_id
HAVING count(DISTINCT v.type) = 1
CREATE VIEW offer_opportunities AS
SELECT u.name, u.city, r.type
FROM users u
LEFT JOIN solo_type_riders r ON r.rider_id = u.id
WHERE r.type IS NOT NULL
EXPLAIN SELECT * FROM offer_opportunities;

The plan is the same! How do we make this more efficient?

CREATE MATERIALIZED VIEW offer_opportunities_mat AS
SELECT u.name, u.city, r2.type
FROM users u
LEFT JOIN (
    SELECT r.rider_id, min(v.type) as type
    FROM rides r
    LEFT JOIN vehicles v ON r.vehicle_id = v.id
    WHERE r.city
        IN (
            SELECT DISTINCT city
            FROM vehicles
            WHERE type IN ('bike', 'scooter')
            GROUP BY city, type
            -- Needs to have both types of vehicle.
            HAVING count(type) = 2
        )
        AND v.type IN ('bike', 'scooter')
    GROUP BY r.rider_id
    -- If there is only on distinct type, we know
    -- the rider has only tried one mode of transit.
    HAVING count(DISTINCT v.type) = 1
) r2 ON r2.rider_id = u.id
WHERE r2.type IS NOT NULL
EXPLAIN SELECT * FROM offer_opportunities_mat;

So much better! Review the execution times for proof.

CTEs (Common Table Expressions)

WITH r2 AS (
    SELECT r.rider_id, min(v.type) as type
    FROM rides r
    LEFT JOIN vehicles v ON r.vehicle_id = v.id
    WHERE r.city
        IN (
            SELECT DISTINCT city
            FROM vehicles
            WHERE type IN ('bike', 'scooter')
            GROUP BY city, type
            -- Needs to have both types of vehicle.
            HAVING count(type) = 2
        )
        AND v.type IN ('bike', 'scooter')
    GROUP BY r.rider_id
    HAVING count(DISTINCT v.type) = 1
)
SELECT u.name, u.city, r2.type
FROM users u
LEFT JOIN r2 ON r2.rider_id = u.id
WHERE r2.type IS NOT NULL

CTEs are great! And let you execute a query once and then use the results all over your main query. However:

  • PostgreSQL (and potentially other DBs) will have trouble moving easy filter params into the CTE.
  • Also, CTEs are materialized in memory. A whole extra table is created on the fly in memory. For big tables, this can be very memory consuming.
  • The same indices that you have for your main tables, can not be used on the CTE tables. Thus the queries might be very slow.

See https://medium.com/@hakibenita/be-careful-with-cte-in-postgresql-fca5e24d2119 for more details.

Transactions

paul@pdl-advaned-sql-4491.g95.gcp-us-west2.cockroachlabs.cloud:26257/cluster-4491/movr> BEGIN;
BEGIN

Time: 23ms total (execution 24ms / network 23ms)

paul@pdl-advaned-sql-4491.g95.gcp-us-west2.cockroachlabs.cloud:26257/cluster-4491/movr  OPEN> DELETE FROM vehicle_location_histories;
DELETE 1009

Time: 40ms total (execution 18ms / network 23ms)

paul@pdl-advaned-sql-4491.g95.gcp-us-west2.cockroachlabs.cloud:26257/cluster-4491/movr  OPEN> SELECT * FROM vehicle_location_histories;
  city | ride_id | timestamp | lat | long
-------+---------+-----------+-----+-------
(0 rows)

Time: 29ms total (execution 4ms / network 24ms)

paul@pdl-advaned-sql-4491.g95.gcp-us-west2.cockroachlabs.cloud:26257/cluster-4491/movr  OPEN> ROLLBACK;
ROLLBACK

Time: 109ms total (execution 6ms / network 102ms)

paul@pdl-advaned-sql-4491.g95.gcp-us-west2.cockroachlabs.cloud:26257/cluster-4491/movr> SELECT count(*) FROM vehicle_location_histories;
  count
---------
   1009
(1 row)

Time: 54ms total (execution 4ms / network 50ms)

paul@pdl-advaned-sql-4491.g95.gcp-us-west2.cockroachlabs.cloud:26257/cluster-4491/movr>

If you wanted to save that deletion, you would want to run COMMIT instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment