Skip to content

Instantly share code, notes, and snippets.

@simonw
Last active April 16, 2023 03:51
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save simonw/1aa4050f3f7d92b048ae414a40cddbd5 to your computer and use it in GitHub Desktop.
Save simonw/1aa4050f3f7d92b048ae414a40cddbd5 to your computer and use it in GitHub Desktop.

SQLite Snapshot Restoration

For background on this ChatGPT transcript, see sqlite-history: tracking changes to SQLite tables using triggers

April 7, 2023 at 2:13:48 PM

system: You are ChatGPT, a large language model trained by OpenAI, based on the GPT-4 architecture. Knowledge cutoff: 2021-09 Current date: 2023-04-07

user: Suggest options for a database table design to help track changes made to a table as efficiently, with the ability to take "snapshots" of the table that can later be restored. There should be two tables: the content_table and a content_table_shadow which contains the information used to restore things

The database is SQLite.

assistant: To efficiently track changes made to a table and take "snapshots" that can later be restored, you can use a combination of triggers and a shadow table. Here's a suggested table design and SQLite setup:

  1. Create the content_table with a primary key to uniquely identify each record. Add a version column to store the version of each record.
CREATE TABLE content_table (
    id INTEGER PRIMARY KEY,
    content TEXT,
    version INTEGER DEFAULT 1
);
  1. Create the content_table_shadow table to store historical snapshots. This table should have the same columns as the content_table, with an additional column to store the timestamp of the snapshot.
CREATE TABLE content_table_shadow (
    id INTEGER,
    content TEXT,
    version INTEGER,
    snapshot_timestamp DATETIME
);
  1. Create triggers to maintain the content_table_shadow table. Whenever a row is updated or deleted in the content_table, insert the old version into the content_table_shadow table with a timestamp.
-- Trigger for UPDATE
CREATE TRIGGER content_table_update_trigger
AFTER UPDATE ON content_table
FOR EACH ROW
BEGIN
    INSERT INTO content_table_shadow (id, content, version, snapshot_timestamp)
    VALUES (old.id, old.content, old.version, datetime('now'));
END;

-- Trigger for DELETE
CREATE TRIGGER content_table_delete_trigger
AFTER DELETE ON content_table
FOR EACH ROW
BEGIN
    INSERT INTO content_table_shadow (id, content, version, snapshot_timestamp)
    VALUES (old.id, old.content, old.version, datetime('now'));
END;

With this setup, you can efficiently track changes in content_table and store snapshots in content_table_shadow. To restore a snapshot, you can query the content_table_shadow for a specific timestamp and update the content_table with the corresponding data.

To take a snapshot, just update the content in the content_table. The triggers will automatically create a snapshot in the content_table_shadow.

To restore a snapshot, use the following query:

WITH latest_snapshot AS (
    SELECT id, content, version, MAX(snapshot_timestamp) as snapshot_timestamp
    FROM content_table_shadow
    WHERE snapshot_timestamp <= 'your_desired_timestamp'
    GROUP BY id
)
UPDATE content_table
SET content = latest_snapshot.content,
    version = latest_snapshot.version
FROM latest_snapshot
WHERE content_table.id = latest_snapshot.id;

Replace 'your_desired_timestamp' with the timestamp of the snapshot you want to restore.

user: When I revert I don't want to write extra records into the content_table_shadow table

assistant: I understand your concern. In that case, you can use a temporary flag to disable triggers when you revert the content_table. Here's the updated trigger and revert process:

  1. Create a temporary table to store the flag that disables the trigger:
CREATE TEMP TABLE disable_trigger (
    is_disabled INTEGER DEFAULT 0
);
  1. Update the triggers to check the disable_trigger flag before inserting into the content_table_shadow:
-- Trigger for UPDATE
CREATE TRIGGER content_table_update_trigger
AFTER UPDATE ON content_table
FOR EACH ROW
WHEN (SELECT is_disabled FROM disable_trigger) = 0
BEGIN
    INSERT INTO content_table_shadow (id, content, version, snapshot_timestamp)
    VALUES (old.id, old.content, old.version, datetime('now'));
END;

-- Trigger for DELETE
CREATE TRIGGER content_table_delete_trigger
AFTER DELETE ON content_table
FOR EACH ROW
WHEN (SELECT is_disabled FROM disable_trigger) = 0
BEGIN
    INSERT INTO content_table_shadow (id, content, version, snapshot_timestamp)
    VALUES (old.id, old.content, old.version, datetime('now'));
END;
  1. To restore a snapshot, first disable the triggers by updating the disable_trigger flag, then update the content_table with the snapshot data, and finally re-enable the triggers:
-- Disable triggers
UPDATE disable_trigger SET is_disabled = 1;

-- Restore snapshot
WITH latest_snapshot AS (
    SELECT id, content, version, MAX(snapshot_timestamp) as snapshot_timestamp
    FROM content_table_shadow
    WHERE snapshot_timestamp <= 'your_desired_timestamp'
    GROUP BY id
)
UPDATE content_table
SET content = (SELECT content FROM latest_snapshot WHERE latest_snapshot.id = content_table.id),
    version = (SELECT version FROM latest_snapshot WHERE latest_snapshot.id = content_table.id)
WHERE id IN (SELECT id FROM latest_snapshot);

-- Enable triggers
UPDATE disable_trigger SET is_disabled = 0;

Replace 'your_desired_timestamp' with the timestamp of the snapshot you want to restore. This process will prevent extra records from being written to the content_table_shadow table during the revert process.

user: I want a new design. I don't want to use triggers all the time - I want to specifically run a SQL query to create a snapshot, where the snapshot itself gets an ID that is a foreign key to a new snapshots table (with a name and date column)

assistant: In this design, we'll use a dedicated snapshots table to store snapshot metadata and adjust the content_table_shadow to link to the snapshot via a foreign key. You'll create snapshots by manually running SQL queries.

  1. Create the content_table:
CREATE TABLE content_table (
    id INTEGER PRIMARY KEY,
    content TEXT,
    version INTEGER DEFAULT 1
);
  1. Create the snapshots table to store snapshot metadata:
CREATE TABLE snapshots (
    snapshot_id INTEGER PRIMARY KEY,
    snapshot_name TEXT,
    snapshot_date DATETIME
);
  1. Update the content_table_shadow table to include a foreign key for the snapshot_id:
CREATE TABLE content_table_shadow

**user**: I want a new design. I don't want to use triggers all the time - I want to specifically run a SQL query to create a snapshot, where the snapshot itself gets an ID that is a foreign key to a new snapshots table (with a name and date column)

No version column

**assistant**: Based on your requirements, I suggest the following table design and SQL queries:

1. Create the content_table:

```sql
CREATE TABLE content_table (
    id INTEGER PRIMARY KEY,
    content TEXT
);
  1. Create the content_table_shadow table to store historical snapshots. Add a snapshot_id column as a foreign key to the snapshots table.
CREATE TABLE content_table_shadow (
    id INTEGER,
    content TEXT,
    snapshot_id INTEGER,
    FOREIGN KEY (snapshot_id) REFERENCES snapshots (id)
);
  1. Create a snapshots table to store snapshot metadata, such as snapshot name and timestamp.
CREATE TABLE snapshots (
    id INTEGER PRIMARY KEY,
    name TEXT,
    snapshot_timestamp DATETIME
);
  1. To create a snapshot, first insert a new row into the snapshots table with the desired name and timestamp, then insert the current state of the content_table into the content_table_shadow table with the new snapshot_id.
-- Insert new snapshot metadata
INSERT INTO snapshots (name, snapshot_timestamp)
VALUES ('your_snapshot_name', datetime('now'));

-- Get the new snapshot ID
SELECT last_insert_rowid() INTO your_snapshot_id;

-- Create the snapshot in content_table_shadow
INSERT INTO content_table_shadow (id, content, snapshot_id)
SELECT id, content, your_snapshot_id
FROM content_table;

Replace 'your_snapshot_name' with the desired name for the snapshot, and store the result of the last_insert_rowid() function in a variable (e.g., your_snapshot_id) to use as the snapshot_id.

  1. To restore a snapshot, query the content_table_shadow table for the specific snapshot_id and update the content_table with the corresponding data.
-- Restore snapshot
WITH snapshot_data AS (
    SELECT id, content
    FROM content_table_shadow
    WHERE snapshot_id = your_snapshot_id
)
UPDATE content_table
SET content = (SELECT content FROM snapshot_data WHERE snapshot_data.id = content_table.id)
WHERE id IN (SELECT id FROM snapshot_data);

Replace 'your_snapshot_id' with the ID of the snapshot you want to restore. This design allows you to create and restore snapshots using SQL queries without using triggers.

user: How can I avoid storing the same exact row with many different copies?

assistant: To avoid storing multiple copies of the same row in the content_table_shadow table, you can modify the snapshot creation query to only store rows that have changed since the last snapshot. You can achieve this by comparing the content of each row in the content_table with the most recent snapshot for that row in the content_table_shadow table.

  1. Update the snapshot creation query to only insert rows that have changed:
-- Insert new snapshot metadata
INSERT INTO snapshots (name, snapshot_timestamp)
VALUES ('your_snapshot_name', datetime('now'));

-- Get the new snapshot ID
SELECT last_insert_rowid() INTO your_snapshot_id;

-- Create the snapshot in content_table_shadow with only changed rows
INSERT INTO content_table_shadow (id, content, snapshot_id)
SELECT ct.id, ct.content, your_snapshot_id
FROM content_table ct
WHERE NOT EXISTS (
    SELECT 1
    FROM content_table_shadow cts
    JOIN (
        SELECT id, MAX(snapshot_id) as last_snapshot_id
        FROM content_table_shadow
        WHERE snapshot_id < your_snapshot_id
        GROUP BY id
    ) last_snapshots
    ON cts.id = last_snapshots.id AND cts.snapshot_id = last_snapshots.last_snapshot_id
    WHERE cts.id = ct.id AND cts.content = ct.content
);

Replace 'your_snapshot_name' with the desired name for the snapshot, and store the result of the last_insert_rowid() function in a variable (e.g., your_snapshot_id) to use as the snapshot_id.

This updated query will compare the content of each row in the content_table with the most recent snapshot of that row in the content_table_shadow table. If the content is the same, the row will not be inserted into the content_table_shadow table for the new snapshot, avoiding duplicate copies.

The restore process remains the same as in the previous response.

user: How about a design that uses MD5 hashes of row content somehow?

assistant: To use MD5 hashes for tracking changes and avoid duplicate copies of the same row, you can add a content_hash column to both the content_table and content_table_shadow. Here's an updated design:

  1. Create the content_table with an additional content_hash column:
CREATE TABLE content_table (
    id INTEGER PRIMARY KEY,
    content TEXT,
    content_hash TEXT
);
  1. Create the content_table_shadow table with the content_hash column:
CREATE TABLE content_table_shadow (
    id INTEGER,
    content TEXT,
    content_hash TEXT,
    snapshot_id INTEGER,
    FOREIGN KEY (snapshot_id) REFERENCES snapshots (id)
);
  1. Create the snapshots table as before:
CREATE TABLE snapshots (
    id INTEGER PRIMARY KEY,
    name TEXT,
    snapshot_timestamp DATETIME
);
  1. When inserting or updating a row in the content_table, calculate the MD5 hash of the content:
-- INSERT
INSERT INTO content_table (id, content, content_hash)
VALUES (1, 'your_content', lower(hex(md5('your_content'))));

-- UPDATE
UPDATE content_table
SET content = 'new_content',
    content_hash = lower(hex(md5('new_content')))
WHERE id = 1;

Replace 'your_content' and 'new_content' with the actual content you want to insert or update.

  1. Update the snapshot creation query to only insert rows with changed content_hash values:
-- Insert new snapshot metadata
INSERT INTO snapshots (name, snapshot_timestamp)
VALUES ('your_snapshot_name', datetime('now'));

-- Get the new snapshot ID
SELECT last_insert_rowid() INTO your_snapshot_id;

-- Create the snapshot in content_table_shadow with only changed rows
INSERT INTO content_table_shadow (id, content, content_hash, snapshot_id)
SELECT ct.id, ct.content, ct.content_hash, your_snapshot_id
FROM content_table ct
WHERE NOT EXISTS (
    SELECT 1
    FROM content_table_shadow cts
    JOIN (
        SELECT id, MAX(snapshot_id) as last_snapshot_id
        FROM content_table_shadow
        WHERE snapshot_id < your_snapshot_id
        GROUP BY id
    ) last_snapshots
    ON cts.id = last_snapshots.id AND cts.snapshot_id = last_snapshots.last_snapshot_id
    WHERE cts.id = ct.id AND cts.content_hash = ct.content_hash
);

Replace 'your_snapshot_name' with the desired name for the snapshot, and store the result of the last_insert_rowid() function in a variable (e.g., your_snapshot_id) to use as the snapshot_id.

  1. The restore process remains the same as in the previous responses:
-- Restore snapshot
WITH snapshot_data AS (
    SELECT id, content
    FROM content_table_shadow
    WHERE snapshot_id = your_snapshot_id
)
UPDATE content_table
SET content = (SELECT content FROM snapshot_data WHERE snapshot_data.id = content_table.id),
    content_hash = (SELECT content_hash FROM snapshot_data WHERE snapshot_data.id = content_table.id)
WHERE id IN (SELECT id FROM snapshot_data);

Replace 'your_snapshot_id' with the ID of the snapshot you want to restore.

This design uses MD5 hashes to track changes in the content, and only inserts rows with different content_hash values into the content_table_shadow table during snapshot creation.

user: pros and cons of these different designs, as a table

assistant: Here's a comparison of the different designs presented earlier as a table:

Design Aspect Without Triggers With Triggers With Hashes
Complexity Low Medium Medium
Snapshot Creation Control Manual Automatic Manual
Storage Efficiency Medium Medium High
Performance Impact Low High Medium
Ease of Snapshot Restoration High High High
Granular Change Detection No Yes Yes
Requires Additional Columns in Tables No Yes (version) Yes (hash)

Pros:

  • Without Triggers
    • Simple design and easy to implement
    • Low performance impact during regular operations
    • Complete control over snapshot creation through manual SQL queries
  • With Triggers
    • Automatic snapshots whenever data is updated or deleted
    • Granular change detection with version tracking
    • No need to manually create snapshots, reducing human errors
  • With Hashes
    • Granular change detection based on content hash
    • Storage-efficient due to hashing and avoiding duplicate rows
    • Complete control over snapshot creation through manual SQL queries

Cons:

  • Without Triggers
    • No granular change detection
    • Requires manual snapshot creation, which can be prone to human errors
  • With Triggers
    • Performance impact due to triggers firing on every update and delete operation
    • More complex design
    • Requires additional columns (version) in content_table
  • With Hashes
    • Requires manual snapshot creation, which can be prone to human errors
    • Requires additional columns (content_hash) in both content_table and content_table_shadow
    • Performance impact due to hashing during snapshot creation

Each design has its own advantages and trade-offs. Choose the one that best suits your requirements, based on factors such as performance, storage efficiency, and complexity.

user: I want a design which optimizes for using as little space as possible, while still allowing me to selectively record a snapshot of a table that I can roll back to later on

assistant: To optimize for using as little space as possible, you can utilize a design that combines the hash-based approach with delta encoding. This design will only store the differences between snapshots, further reducing the storage required.

  1. Create the content_table with an additional content_hash column:
CREATE TABLE content_table (
    id INTEGER PRIMARY KEY,
    content TEXT,
    content_hash TEXT
);
  1. Create the content_table_shadow table with additional columns for content_hash and a delta column:
CREATE TABLE content_table_shadow (
    id INTEGER,
    content TEXT,
    content_hash TEXT,
    delta TEXT,
    snapshot_id INTEGER,
    FOREIGN KEY (snapshot_id) REFERENCES snapshots (id)
);
  1. Create the snapshots table as before:
CREATE TABLE snapshots (
    id INTEGER PRIMARY KEY,
    name TEXT,
    snapshot_timestamp DATETIME
);
  1. When inserting or updating a row in the content_table, calculate the MD5 hash of the content:
-- INSERT
INSERT INTO content_table (id, content, content_hash)
VALUES (1, 'your_content', lower(hex(md5('your_content'))));

-- UPDATE
UPDATE content_table
SET content = 'new_content',
    content_hash = lower(hex(md5('new_content')))
WHERE id = 1;

Replace 'your_content' and 'new_content' with the actual content you want to insert or update.

  1. Update the snapshot creation query to store delta values in the content_table_shadow for rows with changed content_hash values:
-- Insert new snapshot metadata
INSERT INTO snapshots (name, snapshot_timestamp)
VALUES ('your_snapshot_name', datetime('now'));

-- Get the new snapshot ID
SELECT last_insert_rowid() INTO your_snapshot_id;

-- Create the snapshot in content_table_shadow with only changed rows
INSERT INTO content_table_shadow (id, content, content_hash, delta, snapshot_id)
SELECT ct.id, ct.content, ct.content_hash, ct.content || '->' || cts.content, your_snapshot_id
FROM content_table ct
JOIN content_table_shadow cts ON cts.id = ct.id
WHERE NOT EXISTS (
    SELECT 1
    FROM content_table_shadow cts
    JOIN (
        SELECT id, MAX(snapshot_id) as last_snapshot_id
        FROM content_table_shadow
        WHERE snapshot_id < your_snapshot_id
        GROUP BY id
    ) last_snapshots
    ON cts.id = last_snapshots.id AND cts.snapshot_id = last_snapshots.last_snapshot_id
    WHERE cts.id = ct.id AND cts.content_hash = ct.content_hash
);

Replace 'your_snapshot_name' with the desired name for the snapshot, and store the result of the last_insert_rowid() function in a variable (e.g., your_snapshot_id) to use as the snapshot_id.

  1. To restore a snapshot, you will need to reconstruct the content from the delta values:
-- Function to reconstruct content from delta (to be implemented in your programming language)
function reconstruct_content(deltas) {
    // Implement content reconstruction from deltas
}

-- Query to fetch deltas for each row
SELECT id, GROUP_CONCAT(delta, ',') as deltas
FROM content_table_shadow
WHERE snapshot_id <= your_snapshot_id
GROUP BY id;

-- Reconstruct content for each row and update content_table
UPDATE content_table
SET content = reconstructed_content,
    content_hash = lower(hex(md5(reconstructed_content)))
WHERE id = row_id;

Replace 'your_snapshot_id' with the ID of the snapshot you want to restore. The reconstruction function should be implemented in your programming language, as SQLite does not natively support this functionality.

This design uses MD5 hashes to

user: I want a design that doesn't use hashing at all, stores snapshots only when told to and uses as little space as possible for those snapshots, taking into account the fact that many snapshots will share a lot of identical rows

assistant: To create a design that doesn't use hashing, only stores snapshots when told to, and minimizes space usage, you can utilize the concept of reference counting.

  1. Create the content_table:
CREATE TABLE content_table (
    id INTEGER PRIMARY KEY,
    content TEXT
);
  1. Create the content_table_shadow table to store historical snapshots with an additional reference_count column:
CREATE TABLE content_table_shadow (
    id INTEGER,
    content TEXT,
    snapshot_id INTEGER,
    reference_count INTEGER,
    FOREIGN KEY (snapshot_id) REFERENCES snapshots (id)
);
  1. Create the snapshots table:
CREATE TABLE snapshots (
    id INTEGER PRIMARY KEY,
    name TEXT,
    snapshot_timestamp DATETIME
);
  1. When creating a snapshot, increment the reference_count for existing rows in content_table_shadow and only insert new rows when they don't exist in any snapshot:
-- Insert new snapshot metadata
INSERT INTO snapshots (name, snapshot_timestamp)
VALUES ('your_snapshot_name', datetime('now'));

-- Get the new snapshot ID
SELECT last_insert_rowid() INTO your_snapshot_id;

-- Increment reference_count for existing rows in content_table_shadow
UPDATE content_table_shadow
SET reference_count = reference_count + 1
WHERE snapshot_id = (
    SELECT MAX(snapshot_id)
    FROM content_table_shadow
);

-- Insert new rows into content_table_shadow
INSERT INTO content_table_shadow (id, content, snapshot_id, reference_count)
SELECT ct.id, ct.content, your_snapshot_id, 1
FROM content_table ct
WHERE NOT EXISTS (
    SELECT 1
    FROM content_table_shadow cts
    WHERE cts.id = ct.id AND cts.content = ct.content
);

Replace 'your_snapshot_name' with the desired name for the snapshot, and store the result of the last_insert_rowid() function in a variable (e.g., your_snapshot_id) to use as the snapshot_id.

  1. To restore a snapshot, update the content_table with the data from the content_table_shadow:
-- Restore snapshot
UPDATE content_table
SET content = (
    SELECT content
    FROM content_table_shadow cts
    WHERE cts.id = content_table.id AND cts.snapshot_id = (
        SELECT MAX(snapshot_id)
        FROM content_table_shadow
        WHERE id = content_table.id AND snapshot_id <= your_snapshot_id
    )
);

Replace 'your_snapshot_id' with the ID of the snapshot you want to restore.

This design utilizes reference counting to minimize space usage by avoiding duplicate copies of identical rows in content_table_shadow. Snapshots are only created when you execute the SQL queries, giving you full control over when they are generated.

user: Show two alternative designs for this

assistant: Here are two alternative designs based on your requirements:

Alternative Design 1: Timestamp-based Snapshots

  1. Create the content_table:
CREATE TABLE content_table (
    id INTEGER PRIMARY KEY,
    content TEXT
);
  1. Create the content_table_shadow table to store historical snapshots with an additional timestamp column:
CREATE TABLE content_table_shadow (
    id INTEGER,
    content TEXT,
    snapshot_id INTEGER,
    snapshot_timestamp DATETIME,
    FOREIGN KEY (snapshot_id) REFERENCES snapshots (id)
);
  1. Create the snapshots table:
CREATE TABLE snapshots (
    id INTEGER PRIMARY KEY,
    name TEXT,
    snapshot_timestamp DATETIME
);
  1. When creating a snapshot, insert new rows into content_table_shadow only if the content has changed since the last snapshot:
-- Insert new snapshot metadata
INSERT INTO snapshots (name, snapshot_timestamp)
VALUES ('your_snapshot_name', datetime('now'));

-- Get the new snapshot ID
SELECT last_insert_rowid() INTO your_snapshot_id;

-- Insert new rows into content_table_shadow
INSERT INTO content_table_shadow (id, content, snapshot_id, snapshot_timestamp)
SELECT ct.id, ct.content, your_snapshot_id, datetime('now')
FROM content_table ct
WHERE NOT EXISTS (
    SELECT 1
    FROM content_table_shadow cts
    WHERE cts.id = ct.id AND cts.content = ct.content AND cts.snapshot_timestamp = (
        SELECT MAX(snapshot_timestamp)
        FROM content_table_shadow
        WHERE id = ct.id
    )
);

Replace 'your_snapshot_name' with the desired name for the snapshot, and store the result of the last_insert_rowid() function in a variable (e.g., your_snapshot_id) to use as the snapshot_id.

  1. To restore a snapshot, update the content_table with the data from the content_table_shadow:
-- Restore snapshot
UPDATE content_table
SET content = (
    SELECT content
    FROM content_table_shadow cts
    WHERE cts.id = content_table.id AND cts.snapshot_id = (
        SELECT MAX(snapshot_id)
        FROM content_table_shadow
        WHERE id = content_table.id AND snapshot_id <= your_snapshot_id
    )
);

Replace 'your_snapshot_id' with the ID of the snapshot you want to restore.

Alternative Design 2: Using a Snapshot-Content Junction Table

  1. Create the content_table:
CREATE TABLE content_table (
    id INTEGER PRIMARY KEY,
    content TEXT
);
  1. Create the content_table_shadow table to store historical data:
CREATE TABLE content_table_shadow (
    id INTEGER PRIMARY KEY,
    content TEXT
);
  1. Create the snapshots table:
CREATE TABLE snapshots (
    id INTEGER PRIMARY KEY,
    name TEXT,
    snapshot_timestamp DATETIME
);
  1. Create a snapshot_content junction table:
CREATE TABLE snapshot_content (
    snapshot_id INTEGER,
    content_shadow_id INTEGER,
    FOREIGN KEY (snapshot_id) REFERENCES snapshots (id),
    FOREIGN KEY (content_shadow_id) REFERENCES content_table_shadow (id)
);
  1. When creating a snapshot, insert new rows into content_table_shadow only if the content is not already present, and link the snapshot with the content_table_shadow using the snapshot_content junction table:
-- Insert new snapshot metadata
INSERT INTO snapshots (name, snapshot_timestamp)
VALUES ('your_snapshot_name', datetime('now'));

-- Get the new snapshot ID
SELECT last_insert_rowid() INTO your_snapshot_id;

-- Insert new rows into content_table_shadow and link with the snapshot
INSERT INTO snapshot_content (snapshot_id, content_shadow_id)
SELECT your_snapshot_id, COALESCE

**user**: Content table should have "title" and "body" - consider a design with four tables:

content - id, title, body
content_shadow - id, content_id, title, body
snapshots - id, name, created
snapshot_to_shadow - snapshot_id, shadow_id

The content_shadow table should avoid storing duplicate copies of identical rows, but should record new row copies if the row has changed

**assistant**: Here's a design based on your requirements with four tables:

1. Create the content table with columns id, title, and body:

```sql
CREATE TABLE content (
    id INTEGER PRIMARY KEY,
    title TEXT,
    body TEXT
);
  1. Create the content_shadow table with columns id, content_id, title, and body:
CREATE TABLE content_shadow (
    id INTEGER PRIMARY KEY,
    content_id INTEGER,
    title TEXT,
    body TEXT,
    FOREIGN KEY (content_id) REFERENCES content (id)
);
  1. Create the snapshots table with columns id, name, and created:
CREATE TABLE snapshots (
    id INTEGER PRIMARY KEY,
    name TEXT,
    created DATETIME
);
  1. Create the snapshot_to_shadow junction table with columns snapshot_id and shadow_id:
CREATE TABLE snapshot_to_shadow (
    snapshot_id INTEGER,
    shadow_id INTEGER,
    FOREIGN KEY (snapshot_id) REFERENCES snapshots (id),
    FOREIGN KEY (shadow_id) REFERENCES content_shadow (id)
);
  1. When creating a snapshot, insert new rows into content_shadow only if the title and body have changed since the last snapshot, and link the snapshot with content_shadow using the snapshot_to_shadow junction table:
-- Insert new snapshot metadata
INSERT INTO snapshots (name, created)
VALUES ('your_snapshot_name', datetime('now'));

-- Get the new snapshot ID
SELECT last_insert_rowid() INTO your_snapshot_id;

-- Insert new rows into content_shadow and link with the snapshot
WITH new_shadow_rows AS (
    INSERT INTO content_shadow (content_id, title, body)
    SELECT ct.id, ct.title, ct.body
    FROM content ct
    WHERE NOT EXISTS (
        SELECT 1
        FROM content_shadow cs
        WHERE cs.content_id = ct.id AND cs.title = ct.title AND cs.body = ct.body
    )
    RETURNING id, content_id
)
INSERT INTO snapshot_to_shadow (snapshot_id, shadow_id)
SELECT your_snapshot_id, nsr.id
FROM new_shadow_rows nsr;

Replace 'your_snapshot_name' with the desired name for the snapshot, and store the result of the last_insert_rowid() function in a variable (e.g., your_snapshot_id) to use as the snapshot_id.

  1. To restore a snapshot, update the content table with the data from the content_shadow table:
-- Restore snapshot
UPDATE content
SET title = (
    SELECT cs.title
    FROM content_shadow cs
    JOIN snapshot_to_shadow sts ON sts.shadow_id = cs.id
    WHERE cs.content_id = content.id AND sts.snapshot_id = (
        SELECT MAX(snapshot_id)
        FROM snapshot_to_shadow
        WHERE snapshot_id <= your_snapshot_id
    )
),
body = (
    SELECT cs.body
    FROM content_shadow cs
    JOIN snapshot_to_shadow sts ON sts.shadow_id = cs.id
    WHERE cs.content_id = content.id AND sts.snapshot_id = (
        SELECT MAX(snapshot_id)
        FROM snapshot_to_shadow
        WHERE snapshot_id <= your_snapshot_id
    )
);

Replace 'your_snapshot_id' with the ID of the snapshot you want to restore.

This design minimizes space usage by avoiding duplicate copies of identical rows in content_shadow, while still recording new copies of rows if their content has changed. The snapshot_to_shadow junction table associates snapshots with the corresponding content_shadow entries.

user: To speed this design up, add a checksum column to the content_shadow table - this will be a short hash - if the hash is a match it still needs to compare the title and body columns, but if the hash is a miss it can save time by assuming that row has definitely changed

assistant: You can add a checksum column to the content_shadow table to speed up the design. The checksum column can store a short hash of the concatenated title and body columns. This allows a faster comparison while still ensuring that the row is checked if the hash matches.

  1. Modify the content_shadow table to include a checksum column:
CREATE TABLE content_shadow (
    id INTEGER PRIMARY KEY,
    content_id INTEGER,
    title TEXT,
    body TEXT,
    checksum TEXT,
    FOREIGN KEY (content_id) REFERENCES content (id)
);
  1. Update the snapshot creation query to compute the checksum using a short hash function, such as SQLite's built-in substr(sha1()) function:
-- Insert new snapshot metadata
INSERT INTO snapshots (name, created)
VALUES ('your_snapshot_name', datetime('now'));

-- Get the new snapshot ID
SELECT last_insert_rowid() INTO your_snapshot_id;

-- Insert new rows into content_shadow and link with the snapshot
WITH new_shadow_rows AS (
    INSERT INTO content_shadow (content_id, title, body, checksum)
    SELECT ct.id, ct.title, ct.body, substr(sha1(ct.title || ct.body), 1, 8)
    FROM content ct
    WHERE NOT EXISTS (
        SELECT 1
        FROM content_shadow cs
        WHERE cs.content_id = ct.id AND cs.checksum = substr(sha1(ct.title || ct.body), 1, 8) AND cs.title = ct.title AND cs.body = ct.body
    )
    RETURNING id, content_id
)
INSERT INTO snapshot_to_shadow (snapshot_id, shadow_id)
SELECT your_snapshot_id, nsr.id
FROM new_shadow_rows nsr;
  1. The restore snapshot query remains the same:
-- Restore snapshot
UPDATE content
SET title = (
    SELECT cs.title
    FROM content_shadow cs
    JOIN snapshot_to_shadow sts ON sts.shadow_id = cs.id
    WHERE cs.content_id = content.id AND sts.snapshot_id = (
        SELECT MAX(snapshot_id)
        FROM snapshot_to_shadow
        WHERE snapshot_id <= your_snapshot_id
    )
),
body = (
    SELECT cs.body
    FROM content_shadow cs
    JOIN snapshot_to_shadow sts ON sts.shadow_id = cs.id
    WHERE cs.content_id = content.id AND sts.snapshot_id = (
        SELECT MAX(snapshot_id)
        FROM snapshot_to_shadow
        WHERE snapshot_id <= your_snapshot_id
    )
);

By adding the checksum column and using a short hash function, the design speeds up the comparison of rows. The hash function quickly identifies rows that have likely changed, allowing the query to skip further comparisons for rows with non-matching hashes. The title and body columns are still compared for rows with matching hashes to ensure data accuracy.

user: Will adding a compound index to shadow on content_id, checksum make this system faster at writing snapshots?

assistant: Yes, adding a compound index on content_id and checksum columns in the content_shadow table can help to speed up writing snapshots. The index will optimize the search for existing rows in the content_shadow table while creating a snapshot.

To create the compound index, use the following SQL statement:

CREATE INDEX idx_content_shadow_content_id_checksum ON content_shadow (content_id, checksum);

This index will improve the performance of the NOT EXISTS subquery in the snapshot creation process, as it will make it faster to find rows with matching content_id and checksum values. The database engine will be able to use the index to quickly filter out non-matching rows, reducing the number of rows that need to be compared on the title and body columns.

user: If snapshots are stored in terms of the snapshot (id, name, created) table and the snapshot_to_shadow (snapshot_id, shadow_id) tables what's a design that can save space if two snapshots have almost entirely the same rows in them?

assistant: To save space when two snapshots have almost entirely the same rows in them, you can implement a design that uses delta encoding. In this design, only the differences between consecutive snapshots will be stored instead of storing complete copies of rows for each snapshot. Here's how you can modify the existing design to implement delta encoding:

  1. Modify the content_shadow table to include a previous_shadow_id column:
CREATE TABLE content_shadow (
    id INTEGER PRIMARY KEY,
    content_id INTEGER,
    title TEXT,
    body TEXT,
    checksum TEXT,
    previous_shadow_id INTEGER NULL,
    FOREIGN KEY (content_id) REFERENCES content (id),
    FOREIGN KEY (previous_shadow_id) REFERENCES content_shadow (id)
);
  1. Update the snapshot creation query to store only the differences between consecutive snapshots:
-- Insert new snapshot metadata
INSERT INTO snapshots (name, created)
VALUES ('your_snapshot_name', datetime('now'));

-- Get the new snapshot ID
SELECT last_insert_rowid() INTO your_snapshot_id;

-- Insert new rows into content_shadow and link with the snapshot
WITH new_shadow_rows AS (
    INSERT INTO content_shadow (content_id, title, body, checksum, previous_shadow_id)
    SELECT ct.id, ct.title, ct.body, substr(sha1(ct.title || ct.body), 1, 8),
        (
            SELECT cs.id
            FROM content_shadow cs
            JOIN snapshot_to_shadow sts ON sts.shadow_id = cs.id
            WHERE cs.content_id = ct.id AND sts.snapshot_id = (
                SELECT MAX(snapshot_id)
                FROM snapshot_to_shadow
                WHERE snapshot_id <= your_snapshot_id
            )
        )
    FROM content ct
    WHERE NOT EXISTS (
        SELECT 1
        FROM content_shadow cs
        WHERE cs.content_id = ct.id AND cs.checksum = substr(sha1(ct.title || ct.body), 1, 8) AND cs.title = ct.title AND cs.body = ct.body
    )
    RETURNING id, content_id
)
INSERT INTO snapshot_to_shadow (snapshot_id, shadow_id)
SELECT your_snapshot_id, nsr.id
FROM new_shadow_rows nsr;
  1. Modify the restore snapshot query to reconstruct the content table using the delta-encoded content_shadow table:
WITH RECURSIVE
    content_restore (id, title, body, previous_shadow_id) AS (
        SELECT cs.content_id, cs.title, cs.body, cs.previous_shadow_id
        FROM content_shadow cs
        JOIN snapshot_to_shadow sts ON sts.shadow_id = cs.id
        WHERE sts.snapshot_id = your_snapshot_id
    UNION ALL
        SELECT cr.id, COALESCE(cs.title, cr.title), COALESCE(cs.body, cr.body), cs.previous_shadow_id
        FROM content_restore cr
        JOIN content_shadow cs ON cs.id = cr.previous_shadow_id
    )
UPDATE content
SET title = (SELECT title FROM content_restore WHERE id = content.id AND previous_shadow_id IS NULL),
    body = (SELECT body FROM content_restore WHERE id = content.id AND previous_shadow_id IS NULL);

This design saves space by storing only the differences between consecutive snapshots. The content_shadow table now includes a previous_shadow_id column that references the previous version of a row. The snapshot creation query now inserts only the changed rows into content_shadow and links them to the previous version of the row. The restore snapshot query recursively reconstructs the content table using the delta-encoded content_shadow table.

@simonw
Copy link
Author

simonw commented Apr 13, 2023

I used this initial discussion to help me come up with a design for my https://github.com/simonw/sqlite-history library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment