WARNING: You must copy the commands from the raw
version of this document. The markdown converter doesn’t like it…
Do you have a file with weird encoding errors? Perhaps instead of ’
you have á^€™
. The cause is a mix up between latin1 (usually — and especially if mysql is involved — the windows cp1252 encoding) and utf8.
Many solutions drop bytes. This has the advantage that you can see every change that is made, as it happens, should you wish.
It requires the file to be encoded in ISO 8859–1.
See the UTF-8 Character Debug Table
This may not even work properly, especially if things have been mis-coded multiple times.
Requires vim. Not necessarily the best way to fix this, but it works well for both interactive and instant fixes.
If your web browser isn’t displaying utf8 characters correctly, this won’t work. If your terminal dislikes these characters, use the interactive version.
Oh, and use your actual filename instead of FILE.
vim -b FILE -c "set nobin | %s/\v([ ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ])/&/ge | %s/\v([��������������������������������])/&/ge | %s/\v�/€/ge | %s/\v�/‚/ge | %s/\v�/ƒ/ge | %s/\v�/„/ge | %s/\v�/…/ge | %s/\v�/†/ge | %s/\v�/‡/ge | %s/\v�/ˆ/ge | %s/\v�/‰/ge | %s/\v�/Š/ge | %s/\v�/‹/ge | %s/\v�/Œ/ge | %s/\v�/Ž/ge | %s/\v�/‘/ge | %s/\v�/’/ge | %s/\v�/“/ge | %s/\v�/”/ge | %s/\v�/•/ge | %s/\v�/–/ge | %s/\v�/—/ge | %s/\v�/˜/ge | %s/\v�/™/ge | %s/\v�/š/ge | %s/\v�/›/ge | %s/\v�/œ/ge | %s/\v�/ž/ge | %s/\v�/Ÿ/ge | w ++enc=utf8 ++nobin"
Open the file and check for characters �����
/\v[�����]
If you find any, you should restart and use the manual approach.
The bulk is regular expression replacement of invalid characters. The main difficulty with direct replacement is the characters tend to eat the following character, too. This solution prevents that. I have no idea why it works.
Open the file in binary mode, in vim.
vim -b FILE
or
:e ++bin FILE
Then, set as not-binary
:set nobin
Check for an invalid character with 8g8
in normal mode.
This should fix the broken recognisable characters
:%s/\v([ ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ])/&/g
Then run the following to separate the values 80 to 9F from the following character
:%s/\v([��������������������������������])/&/g
The final step is to replace the (now isolated) 80 to 9F.
:%s/\v�/€/ge | %s/\v�/‚/ge | %s/\v�/ƒ/ge | %s/\v�/„/ge | %s/\v�/…/ge | %s/\v�/†/ge | %s/\v�/‡/ge | %s/\v�/ˆ/ge | %s/\v�/‰/ge | %s/\v�/Š/ge | %s/\v�/‹/ge | %s/\v�/Œ/ge | %s/\v�/Ž/ge | %s/\v�/‘/ge | %s/\v�/’/ge | %s/\v�/“/ge | %s/\v�/”/ge | %s/\v�/•/ge | %s/\v�/–/ge | %s/\v�/—/ge | %s/\v�/˜/ge | %s/\v�/™/ge | %s/\v�/š/ge | %s/\v�/›/ge | %s/\v�/œ/ge | %s/\v�/ž/ge | %s/\v�/Ÿ/ge
This doesn’t include 81, 8D, 8F, 90 or 9D, which should not have any representation as single characters.
�����
Also check the â characters are all mean to be there… That’s how some encodings fail (see below).
:w ++enc=utf8 ++nobin
Should now be fixed.
So… Search for any ‘EUR’ strings.
:%s/\vâEUR\~/‘/gc
:%s/\vâEUR\(TM\)/’/gc
:%s/\vâEUR"/—/gc
Asterisms (U+2042 ⁂)… problematic
:%s/\vâ�,/⁂/gc