Skip to content

Instantly share code, notes, and snippets.

@ericgj
Last active February 11, 2016 20:58
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ericgj/e6b0259ee170ec7ffd07 to your computer and use it in GitHub Desktop.
Save ericgj/e6b0259ee170ec7ffd07 to your computer and use it in GitHub Desktop.
Diary

Simple database migrations

DB migrations without any adapter or ORM layer needed

Following a simplified version of the series by K. Scott Allen: http://blog.codinghorror.com/get-your-database-under-version-control/

With rollbacks added in.


9 Feb 2016

How (and if) to use commit/file hashes

On the surface of it, it seems to be a great idea. Make sure your local files are in sync with the ones committed to the database. Store the file hashes in the database together with the filenames.

Certainly with the non-transactional code files, at the very least it makes sense as an optimization. No need to re-create every stored procedure every time you do (a set of) migrations.

But with the transactional schema files, does it make sense? If the hashes don't match for an already-committed transaction .... what then? They shouldn't ever change. Rails migrations doesn't enforce this. It's just understood. You don't do it. It's a case of storing too much data ties you up in knots of useless states that never will occur in practice.

A more significant issue with transactional files, however, is if you get an uncommitted transaction coming in with a timestamp before the current version of the database. I think you need at least a warning to the user in this case. It's easy to imagine this happening with a bunch of devs working on the same repo in different timezones, for instance.

This suggests that the database should track transactional files as well -- at least by name, if not by hash of their contents.

So the migration process -- let's call it "sync" rather than "do" -- looks like this.

1. 
  1.1. Grab names and hashes of all local "transaction" files and "stored" files.
  1.2. Grab names and hashes of latest files as stored in database.
  1.3. Grab current version (timestamp) of database.

2.
  2.1. Compare the two transaction file lists (matched by name), sorting files into
         - committed (matching) files
         - uncommitted (unmatched in db) files after the current version of database
         - uncommitted (unmatched in db) files before the current version
         - missing (unmatched local) files          
  2.2. Compare the two stored file lists (matched by name and hash), sorting into
         - identical (matching) files
         - updated (identical name, different hash) files
         - new (unmatched in db) files
         - missing (unmatched local) files
         
3. From these lists, compile 
  - current version of database
  - transaction files to execute (uncommitted after current version)
  - stored files to execute (updated and new)
  - warnings for the other lists

4. Prompt the user to proceed or cancel, reporting any warnings

5. Execute
  5.1. Each transaction file in order + update database version
  5.2. Each stored file in order  *note*

Note: Alternatively the stored files could be updated on each migration (before updating dbase version). However, besides being inefficient, I think it is unclear which "version" (timestamp) the stored files belong to, if there is more than one migration to execute. This does leave open the bad possibility that all the migrations are run, and the database version updated, but the stored file update fails. What then? Perhaps the best solution is to run the stored file update within the final transaction-file update, i.e. before the final database-version update. In other words, consider that the current state of the stored files depends upon the state of the database after the last new migration to be run.

This issue is significant for another reason: reverting versions, AKA "undo". Consider this scenario: you want to apply several migrations and then changes to a stored procedure. Later you want to rollback the second of the two migrations. What state should the stored procedure be left in after rollback? The prior one, or the changed one? The answer is there is no way of knowing for sure. And how would you even go back to the prior version? Store the text in the database too? Revert to a local git commit?

It starts to get ... ugh...

I think what I am coming to is Mr. Allen's suggestion to manage stored procedure and other code separate from schema transactions does not really work for me. I do not really need a repository of this code separate from what the database stores for me. I do not have hundreds of stored procedures to maintain. Thus, there is no reason not to make changes to stored code together with any other schema transactions. In fact, it makes a lot of sense to do it this way, because you usually want to update the schema and code together, and also stored procedures and triggers together, etc. Databases are massive bundles of dependencies, better not to make it harder to follow these dependencies.

This also makes both sync and undo much simpler.

The main difference from the initial implementation (below), then, is that the database records the transaction files (and does not simply rely on a current version to determine sync state).

Note also that it is possible to have an immutable version table. Each record represents a transaction -- including revert transactions.

CREATE TABLE `_version_` (
  id INT(11) AUTO_INCREMENT PRIMARY KEY,
  version CHAR(14),
  description TEXT NULL,
  filename VARCHAR(255),
  filehash CHAR(40),
  revert BIT DEFAULT 0,
  timestamp DATETIME
)

Getting the last committed version is slightly more complex but still not too bad. "Select the highest version from those records where the last record per version is not reverted".

SELECT MAX(version) from (
  SELECT version, revert 
    FROM _version_ AS a INNER JOIN (
      SELECT MAX(id) as id FROM _version_ GROUP BY version
    ) as b on a.id = b.id
) as last 
  WHERE last.revert = 0 ;

5 Feb 2016

Following a simplified version of the series linked to here by K. Scott Allen: http://blog.codinghorror.com/get-your-database-under-version-control/

With rollbacks added in.

Conventions

  1. Schema changes

     - 20160202063423-do-baseline.sql
     - 20160203042354-do-add-table-foo-or-some-description-of-change.sql
     - 20160203042354-undo.sql
    
  • Sequenced by timestamp
  • Timestamp is the version key in the dbase
  • <timestamp>-{do-}<description> is a schema change routine
  • <timestamp>-undo{-<description>} is a rollback routine (optional), timestamp matching the original.
  1. Code (Views/Stored Procedures/Triggers)

     - 001-some-trigger.sql
     - 002-another-trigger.sql
     - 003-some-view.sql
    

    etc

  • All code in one folder
  • Sequenced by number - whatever order necessary to deal with dependencies
  • Code should be defensive, i.e. include DROP IF EXISTS statements.
  • Git commit hash is the version key in the dbase
  1. Version table

     CREATE TABLE `_db_version_` (
       id INT(11) AUTO_INCREMENT PRIMARY KEY,
       version CHAR(14),
       description TEXT NULL,
       commit CHAR(40)
     )
    
  • version corresponds to the timestamp on the migration
  • commit corresponds to the git commit of the code
  • last record by id is the current version

API

migrate init

Create versions table if not exists

migrate check

Check current schema and code versions and compare to filesystem + git state

migrate do

Check for new schema and code and then apply each one in sequence (schema first, then code), updating the version table.

Note if there are uncommitted changes, it should abort.

migrate undo

Check current schema version and run the "undo" migration file, if it exists. Find the last commit (max-1 version record), check it out from version control, apply the code migrations, go back to the current branch. Pop (delete?) the latest version (last record) in the versions table.

I considered having options for doing "partial" migrations, e.g. only do schema but not code, or interactively pick which schema migrations to apply, but that adds an order of magnitude complexity to versioning and much else.

Config

Initial implementation has no config, but down the road I'd think we'd want to have a single .json config file that basically specifies the conventions listed above.

One consideration is how to specify the database client and connection details. Every client deals with this differently. My thought is to just specify the command in the .json config file in a standardized way that externalizes all configuration. For example, have it be simple command that accepts the sql on stdin. So typically you'll want to write a wrapper script where you pull in your connection details from secrets files and feed stdin into the db client program.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment