dbaynard/Fix-encoding.md

## Fix-encoding.md

      
    Raw
  

              Fix-encoding.md
            
          
    Fix encoding issues (e.g. â€™ ~ ’)

WARNING: You must copy the commands from the raw version of this document. The markdown converter doesn’t like it…
Do you have a file with weird encoding errors? Perhaps instead of ’ you have á^€™. The cause is a mix up between latin1 (usually — and especially if mysql is involved — the windows cp1252 encoding) and utf8.
Many solutions drop bytes. This has the advantage that you can see every change that is made, as it happens, should you wish.
It requires the file to be encoded in ISO 8859–1.
See the UTF-8 Character Debug Table
This may not even work properly, especially if things have been mis-coded multiple times.
TLDR single command

Requires vim. Not necessarily the best way to fix this, but it works well for both interactive and instant fixes.
If your web browser isn’t displaying utf8 characters correctly, this won’t work. If your terminal dislikes these characters, use the interactive version.
Oh, and use your actual filename instead of FILE.
vim -b FILE -c "set nobin | %s/\v([ ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ])/&/ge | %s/\v([��������������������������������])/&/ge | %s/\v�/€/ge | %s/\v�/‚/ge | %s/\v�/ƒ/ge | %s/\v�/„/ge | %s/\v�/…/ge | %s/\v�/†/ge | %s/\v�/‡/ge | %s/\v�/ˆ/ge | %s/\v�/‰/ge | %s/\v�/Š/ge | %s/\v�/‹/ge | %s/\v�/Œ/ge | %s/\v�/Ž/ge | %s/\v�/‘/ge | %s/\v�/’/ge | %s/\v�/“/ge | %s/\v�/”/ge | %s/\v�/•/ge | %s/\v�/–/ge | %s/\v�/—/ge | %s/\v�/˜/ge | %s/\v�/™/ge | %s/\v�/š/ge | %s/\v�/›/ge | %s/\v�/œ/ge | %s/\v�/ž/ge | %s/\v�/Ÿ/ge | w ++enc=utf8 ++nobin"

Open the file and check for characters �����
/\v[�����]

If you find any, you should restart and use the manual approach.
How it works (and interactive implementation)

The bulk is regular expression replacement of invalid characters. The main difficulty with direct replacement is the characters tend to eat the following character, too. This solution prevents that. I have no idea why it works.
Load file

Open the file in binary mode, in vim.
vim -b FILE

or
:e ++bin FILE

Then, set as not-binary
:set nobin

Find and replace all invalid characters

Check for an invalid character with 8g8 in normal mode.
This should fix the broken recognisable characters
:%s/\v([ ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ])/&/g

Then run the following to separate the values 80 to 9F from the following character
:%s/\v([��������������������������������])/&/g

The final step is to replace the (now isolated) 80 to 9F.
:%s/\v�/€/ge | %s/\v�/‚/ge | %s/\v�/ƒ/ge | %s/\v�/„/ge | %s/\v�/…/ge | %s/\v�/†/ge | %s/\v�/‡/ge | %s/\v�/ˆ/ge | %s/\v�/‰/ge | %s/\v�/Š/ge | %s/\v�/‹/ge | %s/\v�/Œ/ge | %s/\v�/Ž/ge | %s/\v�/‘/ge | %s/\v�/’/ge | %s/\v�/“/ge | %s/\v�/”/ge | %s/\v�/•/ge | %s/\v�/–/ge | %s/\v�/—/ge | %s/\v�/˜/ge | %s/\v�/™/ge | %s/\v�/š/ge | %s/\v�/›/ge | %s/\v�/œ/ge | %s/\v�/ž/ge | %s/\v�/Ÿ/ge

This doesn’t include 81, 8D, 8F, 90 or 9D, which should not have any representation as single characters.
�����

Also check the â characters are all mean to be there… That’s how some encodings fail (see below).
Save file

:w ++enc=utf8 ++nobin

Should now be fixed.
Multiple miscodings

So… Search for any ‘EUR’ strings.
:%s/\vâEUR\~/‘/gc
:%s/\vâEUR\(TM\)/’/gc
:%s/\vâEUR"/—/gc

Asterisms (U+2042 ⁂)… problematic
:%s/\vâ�,/⁂/gc