Skip to content

Instantly share code, notes, and snippets.

@jonasraoni
Created July 21, 2021 13:39
Show Gist options
  • Save jonasraoni/3e6913b32b934f388abab3eba77a5ed7 to your computer and use it in GitHub Desktop.
Save jonasraoni/3e6913b32b934f388abab3eba77a5ed7 to your computer and use it in GitHub Desktop.
Detect and convert bad encoded UTF-8 data in MySQL
-- Convert
SELECT CONVERT(CAST(CONVERT('São Paulo' USING latin1) AS BINARY) USING utf8) -- São Paulo
-- Our lovely friends celebrate ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö \o/
-- Poor man's bad encoding detection (in my case there was LATIN1 data mixed with UTF8 in the database and I just wanted to find when things started to get mixed up)
-- When trying to convert, normally the amount of characters decrease...
SELECT field
FROM table
WHERE LENGTH(CONVERT(CAST(CONVERT(field USING latin1) AS BINARY) USING utf8)) < LENGTH(field)
-- And if an specific character fails, MySQL replaces it by "????"
AND (
-- So the output shouldn't have the interrogation
LOCATE('?', CONVERT(CAST(CONVERT(field USING latin1) AS BINARY) USING utf8)) = 0
-- Or the original text should have it as well
OR LOCATE('?', field) <> 0
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment