Skip to content

Instantly share code, notes, and snippets.

@karthicraghupathi
Created June 12, 2024 19:13
Show Gist options
  • Save karthicraghupathi/5cfd5e7fc1cd194d6c1d610d6ff40505 to your computer and use it in GitHub Desktop.
Save karthicraghupathi/5cfd5e7fc1cd194d6c1d610d6ff40505 to your computer and use it in GitHub Desktop.
UTF8 in MariaDB / MySQL

References

Why?

MySQL’s utf8 charset only implements some parts of the whole UTF-8 encoding and it only uses 1 to 3 bytes for encoding. So when you try to use some characters occupying 4 bytes, those characters cannot be stored and there are some warnings.

In MySQL, utf8 is an alias for utf8mb3.

If you want to use more UTF-8 encoding characters, you could use MySQL’s utf8mb4. For the Basic Multilingual Plane (BMP) characters, utf8mb4 and utf8mb3 have identical storage characteristics: same code values, same encoding, same length. For a supplementary character, utf8mb4 using 4 bytes to store it could store more when utf8mb3 cannot at all.

Existing Databases & Tables

Determine Current Character Set & Collation

Run this query:

SELECT
	s.SCHEMA_NAME,
	s.DEFAULT_CHARACTER_SET_NAME,
	s.DEFAULT_COLLATION_NAME,
	t.TABLE_NAME,
	t.TABLE_COLLATION,
	ccsa.CHARACTER_SET_NAME,
	c.COLUMN_NAME,
	c.CHARACTER_SET_NAME,
	c.COLLATION_NAME
FROM
	information_schema.SCHEMATA s
LEFT JOIN information_schema.TABLES t ON
	s.SCHEMA_NAME = t.TABLE_SCHEMA
LEFT JOIN information_schema.COLLATION_CHARACTER_SET_APPLICABILITY ccsa ON
	t.TABLE_COLLATION = ccsa.COLLATION_NAME
LEFT JOIN information_schema.COLUMNS c ON
	s.SCHEMA_NAME = c.TABLE_SCHEMA
	AND t.TABLE_NAME = c.TABLE_NAME
WHERE
	s.schema_name = 'database_name';

Changing Character Set & Collation

Run these queries after creating a backup of the database and at a time when the traffic is minimal. Depending on the table size, the convert query takes a while to run.

-- database
ALTER DATABASE database_name CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;

-- for each table (also converts columns within table)
-- USE
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

-- for each table (does NOT convert existing columns within table)
-- do NOT use
ALTER TABLE table_name DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

Future Databases & Tables

Add the following in the appropriate sections in my.cnf:

[client]
default-character-set = utf8mb4

[mysql]
default-character-set = utf8mb4

[mysqld]
character-set-client-handshake = FALSE
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci

This will ensure all future databases and tables created will the utfmb4 character set and utfmb4_unicode_ci collation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment