Skip to content

Instantly share code, notes, and snippets.

@wbbradley
Last active December 16, 2015 15:28
Show Gist options
  • Save wbbradley/5455761 to your computer and use it in GitHub Desktop.
Save wbbradley/5455761 to your computer and use it in GitHub Desktop.
Performing an extractClass data migration in Django with South

Refactoring and Data Migration in Django

Often we need to change the way a piece of data is modelled on our backend. Recently I was faced with refactoring a large model into component pieces and needed a cheat sheet for the best practices to use when migrating live data on our production server to a new schema.

This is an explanation of how to perform an extractClass refactoring from within the context of a Django models.Model. For this tutorial, I'll assume you are familiar with schemamigration using South.

The basic purpose of this type of refactoring is usually one of the following:

  • to DRY up your data model, such that a similar schema can be reused in multiple places
  • to implement a One to Many relationship on data that is currently One to One, and tightly coupled to an existing Model

Let's say we have a model:

from django.db import models

class FootballMatch(models.Model):
    home_team_name = models.CharField(max_length=128)
    home_team_coach = models.CharField(max_length=128)
    home_team_city = models.CharField(max_length=128)
    away_team_name = models.CharField(max_length=128)
    away_team_coach = models.CharField(max_length=128)
    away_team_city = models.CharField(max_length=128)

But we decide to change the schema for our team, and add a team mascot. Thus we want to convert our code to:

class FootballMatch(models.Model):
    home_team_name = models.CharField(max_length=128)
    home_team_coach = models.CharField(max_length=128)
    home_team_city = models.CharField(max_length=128)
    home_team_mascot = models.CharField(max_length=128)
    away_team_name = models.CharField(max_length=128)
    away_team_coach = models.CharField(max_length=128)
    away_team_mascot = models.CharField(max_length=128)

But now we're starting to notice a lot of redundancy, so we decide to perform an extractClass refactoring. We hope to end up with this:

class FootballTeam(models.Model):
    name = models.CharField(unique=True, max_length=128)
    coach = models.CharField(max_length=128)
    city = models.CharField(max_length=128)
    mascot = models.CharField(max_length=128) 
                   
class FootballMatch(models.Model):
    home_team = models.ForeignKey(FootballTeam, related_name='home_matches')
    away_team = models.ForeignKey(FootballTeam, related_name='away_matches')

We punch this in to our models.py and because we're familiar with South, we run python manage.py schemamigration app_name --auto but, alas, we are greeted by the following message:

(env)[~/src/football]$ python manage.py schemamigration football --auto
 + Added model football.FootballTeam
 ? The field 'FootballMatch.home_team_city' does not have a default specified, yet is NOT NULL.
 ? Since you are removing this field, you MUST specify a default
 ? value to use for existing rows. Would you like to:
 ?  1. Quit now, and add a default to the field in models.py
 ?  2. Specify a one-off value to use for existing columns now
 ?  3. Disable the backwards migration by raising an exception.
 ? Please select a choice: 

And, it seems that none of these choices are adequate for our purpose. What we really want is a data migration. We don't want to lose our data during the migration of our schema. Our schema migration does not understand that we'd like to move our team information into the FootballTeam model from the FootballMatch model. So, we need to teach it.

Step-by-step Migration Strategy

The procedure I'm about to describe will follow a pattern you can apply to other data migrations. It is a three step process.

  1. Create an expanded schema which encompasses the new model
  2. Migrate the existing data to the new model
  3. Eliminate the old data model elements

Let's back up to our original model, and create our expanded schema.

class FootballTeam(models.Model):
    name = models.CharField(key=True, max_length=128)
    coach = models.CharField(max_length=128)
    city = models.CharField(max_length=128)
    mascot = models.CharField(max_length=128)
        
class FootballMatch(models.Model):
    home_team_name = models.CharField(max_length=128)
    home_team_coach = models.CharField(max_length=128)
    home_team_city = models.CharField(max_length=128)
    away_team_name = models.CharField(max_length=128)
    away_team_coach = models.CharField(max_length=128)
    away_team_city = models.CharField(max_length=128)
    home_team = models.ForeignKey(FootballTeam, related_name='home_matches', blank=True, null=True)
    away_team = models.ForeignKey(FootballTeam, related_name='away_matches', blank=True, null=True)

And run our first schema migration.

(env)[~/src/football]$ python manage.py schemamigration football --auto
...
(env)[~/src/football]$ python manage.py migrate
...

If you look closely at the new intermediate schema, you'll notice that I made 'name' a key in the new FootballTeam model. This was a choice based on the semantic meaning of this refactoring. For this demonstration, I'm assuming that all home_team_* tuples will be identical, and the same for away team. The rationale here is that I don't want duplicate teams in my FootballTeam table.

Let's write a data migration that will migrate the home_team_* and away_team_* data into associated FootballTeams.

(env)[~/src/football]$ python manage.py datamigration football make_teams
Created 0003_make_teams.py.

Now, in order to construct our forward migration, let's alter this new migration file.

Replace the forwards and backwards routines in your data-migration script (mine is called 0003_make_teams.py) as follows:

class Migration(DataMigration):

    def forwards(self, orm):
        for match in orm.FootballMatch.objects.all():
            team, created = orm.FootballTeam.objects.get_or_create(name=match.home_team_name)
            if created:
                print "Added team '{}'".format(team.name)
            team.coach = match.home_team_coach
            team.city = match.home_team_city
            team.save()
            match.home_team = team

            team, created = orm.FootballTeam.objects.get_or_create(name=match.away_team_name)
            if created:
                print "Added team '{}'".format(team.name)
            team.coach = match.away_team_coach
            team.city = match.away_team_city
            team.save()
            match.away_team = team

            match.save()

    def backwards(self, orm):                  
        for match in orm.FootballMatch.objects.all():
            home_team = match.home_team
            if home_team:
                match.home_team_name = home_team.name
                match.home_team_coach = home_team.coach
                match.home_team_city = home_team.city
                 
            away_team = match.away_team 
            if away_team:
                match.away_team_name = away_team.name 
                match.away_team_coach = away_team.coach
                match.away_team_city = away_team.city

            match.home_team = None
            match.away_team = None 
            match.save()

    models = {
        ...
    }
    
    ...

It's worth noting here that if the incoming data is inconsistent, (ie: in different matches there's a different coach for the same team) you'll get somewhat arbitrary behavior (since the coach will override).

At this point, we have crafted our forwards and backwards data-migration scripts, so we should be able to run python manage.py migrate in order to run the forward migration. To go backward, we specify our target revision. In my example, the script before my data-migration script was 0002. python manage.py migrate 0002 would get us back there.

The last step in this refactoring exercise will get us to the originally desired clean version of our data model. That is, we will finalize the removal of the redundant data structures in the FootballMatch.

class FootballTeam(models.Model):
    name = models.CharField(key=True, max_length=128)
    coach = models.CharField(max_length=128)
    city = models.CharField(max_length=128)
    mascot = models.CharField(max_length=128)
        
class FootballMatch(models.Model):
    home_team = models.ForeignKey(FootballTeam, related_name='home_matches', blank=True, null=True)
    away_team = models.ForeignKey(FootballTeam, related_name='away_matches', blank=True, null=True)

And now create our final schema migrations:

(env)[~/src/football]$ python manage.py schemamigration football --auto
 ? The field 'FootballMatch.home_team_city' does not have a default specified, yet is NOT NULL.
 ? Since you are removing this field, you MUST specify a default
 ? value to use for existing rows. Would you like to:
 ?  1. Quit now, and add a default to the field in models.py
 ?  2. Specify a one-off value to use for existing columns now
 ?  3. Disable the backwards migration by raising an exception.
 ? Please select a choice: 

This doesn't look good, but it's OK. Just select option #2 and enter a temporary string such as 'temp' for all of these default field values, as they will be replaced by data from the FootballTeam models when we are migrating backwards.

Conclusion

Understanding this three step process to data migration is necessary in order to adapt your code to changing feature requirements. Please let me know about the errors or mistakes I have made!

Will Bradley
@wbbradley

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment