Skip to content

Instantly share code, notes, and snippets.

@joehoyle
Created June 24, 2021 07:25
Show Gist options
  • Save joehoyle/971d4a02160071dc377bbc3bebaf8f8c to your computer and use it in GitHub Desktop.
Save joehoyle/971d4a02160071dc377bbc3bebaf8f8c to your computer and use it in GitHub Desktop.

WordPress Database Migration to UTF8 (Altis)

Moving a WordPress database to "all UTF8" can be problematic. Without taking note of the specific scenario, you might find yourself struggling with messed up encoding (on not-ascii chars). If you find yourself manually replacing messed up chars in a database, stop! You should never need to do this. Only in the worst cases of "split brain encoding" might you need to do something like that.

When moving a WordPress database to a new charset, it's not uncommon for there to be a mixture of latin and utf8 in different settings. This is because WordPress' default encoding used to be latin1 a long time ago. To confuse things furthere, there can also be a mismatch between WordPress' understanding of the encoding for the database, and MySQL's.

DB_CHARSET

You must find the value of DB_CHARSET in WordPress. This is typically in the wp-config.php, however it might not be. WordPress has a default value, which can vary depending if you are using a DB drop-in plugin too. The best way to check this is with WP CLI via wp shell and echo DB_CHARSET;.

Database

The MySQL database has it's own encoding value (as well as all tables, but for this guide we'll assume all tables match the database level encoding value). You can run a query like SELECT default_character_set_name FROM information_schema.SCHEMATA WHERE schema_name = "wordpress" to determine what the charset in the database is.

From that, you should map to one of the following scenarios.

Of most interesting, and common point of confusion is scenario 3. In this case, you have WordPress sending UTF8 data into a database that MySQL understands to be latin1. Any database export from MySQL will then treat that data as latin1 on export, which can mangle data. The most direct way to fix this is to export the database from the MySQL server with no default char encoding set in the mysqldump file, and then re-import it into a new utf8 database. That looks something like this:

mysqldump -h $hostname -u wordpress --skip-set-charset --default-character-set=latin1 wordpress > backup.sql

Next, we re-import the database backup (that has skipped setting the latin1 charset in the database file). This assumes that the default charset on your MySQL server is UTF8.

mysql -h $hostname -u wordpress wordpress < backup.sql

Warning on "split brain encoding"

If at any point the value of DB_CHARSET on the WordPress side changes from latin1 to utf8 without first correcting the database, you'll be in a world of hurt that's virtually impossible to fix. The issue here will be that some data (say posts after you messed up those charset values) will be encoded as utf8 and some will be latin1, and there's no true way to distinguish them. Converting from latin1 to utf8 will now break everything that was already utf8, and vice versa.

The only tool you then have at your disposal is messy migration scripts running things like mb_convert_encoding trying to guess which data needs to be converted and which doesn't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment