ericgj/db-migrations-diary.md

## db-migrations-diary.md

      
    Raw
  

              db-migrations-diary.md
            
          
    Simple database migrations

DB migrations without any adapter or ORM layer needed
Following a simplified version of the series by K. Scott Allen:
http://blog.codinghorror.com/get-your-database-under-version-control/
With rollbacks added in.

9 Feb 2016
How (and if) to use commit/file hashes

On the surface of it, it seems to be a great idea. Make sure your local
files are in sync with the ones committed to the database. Store the file hashes
in the database together with the filenames.
Certainly with the non-transactional code files, at the very least it makes
sense as an optimization. No need to re-create every stored procedure every
time you do (a set of) migrations.
But with the transactional schema files, does it make sense?  If the hashes
don't match for an already-committed transaction .... what then?  They shouldn't
ever change. Rails migrations doesn't enforce this. It's just understood. You
don't do it. It's a case of storing too much data ties you up in knots of
useless states that never will occur in practice.
A more significant issue with transactional files, however, is if you get an
uncommitted transaction coming in with a timestamp before the current version
of the database. I think you need at least a warning to the user in this case.
It's easy to imagine this happening with a bunch of devs working on the same
repo in different timezones, for instance.
This suggests that the database should track transactional files as well -- at
least by name, if not by hash of their contents.
So the migration process -- let's call it "sync" rather than "do" -- looks like
this.
1. 
  1.1. Grab names and hashes of all local "transaction" files and "stored" files.
  1.2. Grab names and hashes of latest files as stored in database.
  1.3. Grab current version (timestamp) of database.

2.
  2.1. Compare the two transaction file lists (matched by name), sorting files into
         - committed (matching) files
         - uncommitted (unmatched in db) files after the current version of database
         - uncommitted (unmatched in db) files before the current version
         - missing (unmatched local) files          
  2.2. Compare the two stored file lists (matched by name and hash), sorting into
         - identical (matching) files
         - updated (identical name, different hash) files
         - new (unmatched in db) files
         - missing (unmatched local) files
         
3. From these lists, compile 
  - current version of database
  - transaction files to execute (uncommitted after current version)
  - stored files to execute (updated and new)
  - warnings for the other lists

4. Prompt the user to proceed or cancel, reporting any warnings

5. Execute
  5.1. Each transaction file in order + update database version
  5.2. Each stored file in order  *note*

Note: Alternatively the stored files could be updated on each migration
(before updating dbase version). However, besides being inefficient, I think it
is unclear which "version" (timestamp) the stored files belong to, if there is
more than one migration to execute. This does leave open the bad possibility
that all the migrations are run, and the database version updated, but the
stored file update fails. What then?  Perhaps the best solution is to run the
stored file update within the final transaction-file update, i.e. before the
final database-version update. In other words, consider that the current state
of the stored files depends upon the state of the database after the last new
migration to be run.
This issue is significant for another reason: reverting versions, AKA "undo".
Consider this scenario: you want to apply several migrations and then changes
to a stored procedure. Later you want to rollback the second of the two
migrations. What state should the stored procedure be left in after rollback?
The prior one, or the changed one? The answer is there is no way of knowing for
sure. And how would you even go back to the prior version? Store the text in
the database too? Revert to a local git commit?
It starts to get ... ugh...

I think what I am coming to is Mr. Allen's suggestion to manage stored procedure
and other code separate from schema transactions does not really work for me.
I do not really need a repository of this code separate from what the database
stores for me. I do not have hundreds of stored procedures to maintain. Thus,
there is no reason not to make changes to stored code together with any
other schema transactions.  In fact, it makes a lot of sense to do it this way,
because you usually want to update the schema and code together, and also
stored procedures and triggers together, etc. Databases are massive bundles
of dependencies, better not to make it harder to follow these dependencies.
This also makes both sync and undo much simpler.
The main difference from the initial implementation (below), then, is that the
database records the transaction files (and does not simply rely on a current
version to determine sync state).
Note also that it is possible to have an immutable version table. Each record
represents a transaction -- including revert transactions.
CREATE TABLE `_version_` (
  id INT(11) AUTO_INCREMENT PRIMARY KEY,
  version CHAR(14),
  description TEXT NULL,
  filename VARCHAR(255),
  filehash CHAR(40),
  revert BIT DEFAULT 0,
  timestamp DATETIME
)

Getting the last committed version is slightly more complex but still not too
bad. "Select the highest version from those records where the last record per
version is not reverted".
SELECT MAX(version) from (
  SELECT version, revert 
    FROM _version_ AS a INNER JOIN (
      SELECT MAX(id) as id FROM _version_ GROUP BY version
    ) as b on a.id = b.id
) as last 
  WHERE last.revert = 0 ;


5 Feb 2016
Following a simplified version of the series linked to here by K. Scott Allen:
http://blog.codinghorror.com/get-your-database-under-version-control/
With rollbacks added in.
Conventions


Schema changes
 - 20160202063423-do-baseline.sql
 - 20160203042354-do-add-table-foo-or-some-description-of-change.sql
 - 20160203042354-undo.sql


Sequenced by timestamp
Timestamp is the version key in the dbase
<timestamp>-{do-}<description>  is a schema change routine
<timestamp>-undo{-<description>}  is a rollback routine (optional),
timestamp matching the original.


Code (Views/Stored Procedures/Triggers)
 - 001-some-trigger.sql
 - 002-another-trigger.sql
 - 003-some-view.sql

etc


All code in one folder
Sequenced by number - whatever order necessary to deal with dependencies
Code should be defensive, i.e. include DROP IF EXISTS statements.
Git commit hash is the version key in the dbase


Version table
 CREATE TABLE `_db_version_` (
   id INT(11) AUTO_INCREMENT PRIMARY KEY,
   version CHAR(14),
   description TEXT NULL,
   commit CHAR(40)
 )


version corresponds to the timestamp on the migration
commit corresponds to the git commit of the code
last record by id is the current version

API

migrate init

Create versions table if not exists
migrate check

Check current schema and code versions and compare to filesystem + git state
migrate do

Check for new schema and code and then apply each one in sequence (schema
first, then code), updating the version table.
Note if there are uncommitted changes, it should abort.
migrate undo

Check current schema version and run the "undo" migration file, if it exists.
Find the last commit (max-1 version record), check it out from version control,
apply the code migrations, go back to the current branch.
Pop (delete?) the latest version (last record) in the versions table.
I considered having options for doing "partial" migrations, e.g. only do schema
but not code, or interactively pick which schema migrations to apply, but that
adds an order of magnitude complexity to versioning and much else.
Config

Initial implementation has no config, but down the road I'd think we'd want to
have a single .json config file that basically specifies the conventions listed
above.
One consideration is how to specify the database client and connection details.
Every client deals with this differently. My thought is to just specify the
command in the .json config file in a standardized way that externalizes all
configuration. For example, have it be simple command that accepts the sql on
stdin. So typically you'll want to write a wrapper script where you pull in
your connection details from secrets files and feed stdin into the db client
program.