Skip to content

Instantly share code, notes, and snippets.

@OldskoolOrion
Last active March 11, 2017 12:32
Show Gist options
  • Save OldskoolOrion/94b5b8a51acb2dceea26a675bdd2a2cf to your computer and use it in GitHub Desktop.
Save OldskoolOrion/94b5b8a51acb2dceea26a675bdd2a2cf to your computer and use it in GitHub Desktop.
PHP function to 'normalize' data, mangled before by some process (e.g. treating UTF-8 as Windows1252).
<?
/*
URL : https://github.com/OldskoolOrion/normalize_to_utf8_chars
Function usable when normalizing input data, that has been mangled before, by :
1) Encoding Problem : Treating UTF-8 Bytes as Windows-1252 or ISO-8859-1
2) Encoding Problem : Incorrect Double Mis-Conversion
3) Encoding Problem : ISO-8859-1 vs Windows-1252
The problems should be tackled at the source of the problem of course
(copying/pasting from Microsoft Word comes to mind, for instance), but a lot of
times you get data handed to you when you cannot exercise that control, but
you still will have to fix the issue in the data.
The function does not deal with every possibility, but is pretty extensive and
has grown out of real-world issues I encountered, that actually needed to be
dealt with.
I wrote it in a clear readable form, optimized for a line-length of 120 chars,
but when using a smaller viewing window (e.g. 80 chars), all the most important
information is still visible without scrolling the view, while trying to
keep my personal usual coding style intact (to keep everything readable).
In this case information prevailed over style.
Take notice of the special remarks below the array (marked by [nr]).
For these remarks I basically break the Unicode standard, for reasons the
situation usually dictates. You might want to take a different approach.
That is up to yourself and the situation you want to use (part of) this code for.
The array is maintained in order of the character's Windows 1252 hex-code, for
the sole simple reason that ascii tables (which I used to look up information)
always are.
Written in PHP since that gives me flexibility to deploy it in a quick-n-dirty
shell script to prepare and parse existing data before loading it into a db,
as well in an online service where an existing mangled data stream needs fixing
before further processing / storage in a database (UTF-8).
Again : this should NOT be your solution - this is for fixing 'what-already-is'.
Author: OldskoolOrion (H.Coenen)
*/
function normalize_to_utf8_chars($string) { // Nr. | Unicode | Win1252 | Expected | Actually | UTF8 Bytes
//----------------------------------------------------------------------
$search=array(chr(0xE2).chr(0x82).chr(0xAC), // 001 | U+20AC | 0x80 | € | € | %E2 %82 %AC
chr(0xE2).chr(0x80).chr(0x9A), // 002 | U+201A | 0x82 | ‚ | ‚ | %E2 %80 %9A
chr(0xC6).chr(0x92), // 003 | U+0192 | 0x83 | ƒ | Æ’ | %C6 %92
chr(0xE2).chr(0x80).chr(0x9E), // 004 | U+201E | 0x84 | „ | „ | %E2 %80 %9E
chr(0xE2).chr(0x80).chr(0xA6), // 005 | U+2026 | 0x85 | … | … | %E2 %80 %A6
chr(0xE2).chr(0x80).chr(0xA0), // 006 | U+2020 | 0x86 | † | †| %E2 %80 %A0
chr(0xE2).chr(0x80).chr(0xA1), // 007 | U+2021 | 0x87 | ‡ | ‡ | %E2 %80 %A1
chr(0xCB).chr(0x86), // 008 | U+02C6 | 0x88 | ˆ | ˆ | %CB %86
chr(0xE2).chr(0x80).chr(0xB0), // 009 | U+2030 | 0x89 | ‰ | ‰ | %E2 %80 %B0
chr(0xC5).chr(0xA0), // 010 | U+0160 | 0x8A | Š | Å | %C5 %A0
chr(0xE2).chr(0x80).chr(0xB9), // 011 | U+2039 | 0x8B | ‹ | ‹ | %E2 %80 %B9
chr(0xC5).chr(0x92), // 012 | U+0152 | 0x8C | Œ | Å’ | %C5 %92
chr(0xC5).chr(0xBD), // 013 | U+017D | 0x8E | Ž | Ž | %C5 %BD
chr(0xE2).chr(0x80).chr(0x98), // 014 | U+2018 | 0x91 | ‘ | ‘ | %E2 %80 %98
chr(0xE2).chr(0x80).chr(0x99), // 015 | U+2019 | 0x92 | ’ | ’ | %E2 %80 %99
chr(0xE2).chr(0x80).chr(0x9C), // 016 | U+201C | 0x93 | “ | “ | %E2 %80 %9C
chr(0xE2).chr(0x80).chr(0x9D), // 017 | U+201D | 0x94 | ” | †| %E2 %80 %9D
chr(0xE2).chr(0x80).chr(0xA2), // 018 | U+2022 | 0x95 | • | • | %E2 %80 %A2
chr(0xE2).chr(0x80).chr(0x93), // 019 | U+2013 | 0x96 | – | – | %E2 %80 %93 (see: [1])
chr(0xE2).chr(0x80).chr(0x94), // 020 | U+2014 | 0x97 | — | — | %E2 %80 %94 (see: [2])
chr(0xCB).chr(0x9C), // 021 | U+02DC | 0x98 | ˜ | Ëœ | %CB %9C
chr(0xE2).chr(0x84).chr(0xA2), // 022 | U+2122 | 0x99 | ™ | â„¢ | %E2 %84 %A2
chr(0xC5).chr(0xA1), // 023 | U+0161 | 0x9A | š | Å¡ | %C5 %A1
chr(0xE2).chr(0x80).chr(0xBA), // 024 | U+203A | 0x9B | › | › | %E2 %80 %BA
chr(0xC5).chr(0x93), // 025 | U+0153 | 0x9C | œ | Å“ | %C5 %93
chr(0xC5).chr(0xBE), // 026 | U+017E | 0x9E | ž | ž | %C5 %BE
chr(0xC5).chr(0xB8), // 027 | U+0178 | 0x9F | Ÿ | Ÿ | %C5 %B8
chr(0xC2).chr(0xA0), // 028 | U+00A0 | 0xA0 | | Â | %C2 %A0 (see [3])
chr(0xC2).chr(0xA1), // 029 | U+00A1 | 0xA1 | ¡ | ¡ | %C2 %A1
chr(0xC2).chr(0xA2), // 030 | U+00A2 | 0xA2 | ¢ | ¢ | %C2 %A2
chr(0xC2).chr(0xA3), // 031 | U+00A3 | 0xA3 | £ | £ | %C2 %A3
chr(0xC2).chr(0xA4), // 032 | U+00A4 | 0xA4 | ¤ | ¤ | %C2 %A4
chr(0xC2).chr(0xA5), // 033 | U+00A5 | 0xA5 | ¥ | Â¥ | %C2 %A5
chr(0xC2).chr(0xA6), // 034 | U+00A6 | 0xA6 | ¦ | ¦ | %C2 %A6
chr(0xC2).chr(0xA7), // 035 | U+00A7 | 0xA7 | § | § | %C2 %A7
chr(0xC2).chr(0xA8), // 036 | U+00A8 | 0xA8 | ¨ | ¨ | %C2 %A8
chr(0xC2).chr(0xA9), // 037 | U+00A9 | 0xA9 | © | © | %C2 %A9
chr(0xC2).chr(0xAA), // 038 | U+00AA | 0xAA | ª | ª | %C2 %AA
chr(0xC2).chr(0xAB), // 039 | U+00AB | 0xAB | « | « | %C2 %AB
chr(0xC2).chr(0xAC), // 040 | U+00AC | 0xAC | ¬ | ¬ | %C2 %AC
chr(0xC2).chr(0xAD), // 041 | U+00AD | 0xAD | | Â | %C2 %AD (see: [4])
chr(0xC2).chr(0xAE), // 042 | U+00AE | 0xAE | ® | ® | %C2 %AE
chr(0xC2).chr(0xAF), // 043 | U+00AF | 0xAF | ¯ | ¯ | %C2 %AF
chr(0xC2).chr(0xB0), // 044 | U+00B0 | 0xB0 | ° | ° | %C2 %B0
chr(0xC2).chr(0xB1), // 045 | U+00B1 | 0xB1 | ± | ± | %C2 %B1
chr(0xC2).chr(0xB2), // 046 | U+00B2 | 0xB2 | ² | ² | %C2 %B2
chr(0xC2).chr(0xB3), // 047 | U+00B3 | 0xB3 | ³ | ³ | %C2 %B3
chr(0xC2).chr(0xB4), // 048 | U+00B4 | 0xB4 | ´ | ´ | %C2 %B4
chr(0xC2).chr(0xB5), // 049 | U+00B5 | 0xB5 | µ | µ | %C2 %B5
chr(0xC2).chr(0xB6), // 050 | U+00B6 | 0xB6 | ¶ | ¶ | %C2 %B6
chr(0xC2).chr(0xB7), // 051 | U+00B7 | 0xB7 | · | · | %C2 %B7
chr(0xC2).chr(0xB8), // 052 | U+00B8 | 0xB8 | ¸ | ¸ | %C2 %B8
chr(0xC2).chr(0xB9), // 053 | U+00B9 | 0xB9 | ¹ | ¹ | %C2 %B9
chr(0xC2).chr(0xBA), // 054 | U+00BA | 0xBA | º | º | %C2 %BA
chr(0xC2).chr(0xBB), // 055 | U+00BB | 0xBB | » | » | %C2 %BB
chr(0xC2).chr(0xBC), // 056 | U+00BC | 0xBC | ¼ | ¼ | %C2 %BC
chr(0xC2).chr(0xBD), // 057 | U+00BD | 0xBD | ½ | ½ | %C2 %BD
chr(0xC2).chr(0xBE), // 058 | U+00BE | 0xBE | ¾ | ¾ | %C2 %BE
chr(0xC2).chr(0xBF), // 059 | U+00BF | 0xBF | ¿ | ¿ | %C2 %BF
chr(0xC3).chr(0x80), // 060 | U+00C0 | 0xC0 | À | À | %C3 %80
chr(0xC3).chr(0x81), // 061 | U+00C1 | 0xC1 | Á | Ã | %C3 %81
chr(0xC3).chr(0x82), // 062 | U+00C2 | 0xC2 |  | Â | %C3 %82
chr(0xC3).chr(0x83), // 063 | U+00C3 | 0xC3 | à | Ã | %C3 %83
chr(0xC3).chr(0x84), // 064 | U+00C4 | 0xC4 | Ä | Ä | %C3 %84
chr(0xC3).chr(0x85), // 065 | U+00C5 | 0xC5 | Å | Ã… | %C3 %85
chr(0xC3).chr(0x86), // 066 | U+00C6 | 0xC6 | Æ | Æ | %C3 %86
chr(0xC3).chr(0x87), // 067 | U+00C7 | 0xC7 | Ç | Ç | %C3 %87
chr(0xC3).chr(0x88), // 068 | U+00C8 | 0xC8 | È | È | %C3 %88
chr(0xC3).chr(0x89), // 069 | U+00C9 | 0xC9 | É | É | %C3 %89
chr(0xC3).chr(0x8A), // 070 | U+00CA | 0xCA | Ê | Ê | %C3 %8A
chr(0xC3).chr(0x8B), // 071 | U+00CB | 0xCB | Ë | Ë | %C3 %8B
chr(0xC3).chr(0x8C), // 072 | U+00CC | 0xCC | Ì | ÃŒ | %C3 %8C
chr(0xC3).chr(0x8D), // 073 | U+00CD | 0xCD | Í | Ã | %C3 %8D
chr(0xC3).chr(0x8E), // 074 | U+00CE | 0xCE | Î | ÃŽ | %C3 %8E
chr(0xC3).chr(0x8F), // 075 | U+00CF | 0xCF | Ï | Ã | %C3 %8F
chr(0xC3).chr(0x90), // 076 | U+00D0 | 0xD0 | Ð | Ã | %C3 %90
chr(0xC3).chr(0x91), // 077 | U+00D1 | 0xD1 | Ñ | Ñ | %C3 %91
chr(0xC3).chr(0x92), // 078 | U+00D2 | 0xD2 | Ò | Ã’ | %C3 %92
chr(0xC3).chr(0x93), // 079 | U+00D3 | 0xD3 | Ó | Ó | %C3 %93
chr(0xC3).chr(0x94), // 080 | U+00D4 | 0xD4 | Ô | Ô | %C3 %94
chr(0xC3).chr(0x95), // 081 | U+00D5 | 0xD5 | Õ | Õ | %C3 %95
chr(0xC3).chr(0x96), // 082 | U+00D6 | 0xD6 | Ö | Ö | %C3 %96
chr(0xC3).chr(0x97), // 083 | U+00D7 | 0xD7 | × | × | %C3 %97
chr(0xC3).chr(0x98), // 084 | U+00D8 | 0xD8 | Ø | Ø | %C3 %98
chr(0xC3).chr(0x99), // 085 | U+00D9 | 0xD9 | ٠| Ù | %C3 %99
chr(0xC3).chr(0x9A), // 086 | U+00DA | 0xDA | Ú | Ú | %C3 %9A
chr(0xC3).chr(0x9B), // 087 | U+00DB | 0xDB | Û | Û | %C3 %9B
chr(0xC3).chr(0x9C), // 088 | U+00DC | 0xDC | Ü | Ãœ | %C3 %9C
chr(0xC3).chr(0x9D), // 089 | U+00DD | 0xDD | Ý | Ã | %C3 %9D
chr(0xC3).chr(0x9E), // 090 | U+00DE | 0xDE | Þ | Þ | %C3 %9E
chr(0xC3).chr(0x9F), // 091 | U+00DF | 0xDF | ß | ß | %C3 %9F
chr(0xC3).chr(0xA0), // 092 | U+00E0 | 0xE0 | à | Ã | %C3 %A0
chr(0xC3).chr(0xA1), // 093 | U+00E1 | 0xE1 | á | á | %C3 %A1
chr(0xC3).chr(0xA2), // 094 | U+00E2 | 0xE2 | â | â | %C3 %A2
chr(0xC3).chr(0xA3), // 095 | U+00E3 | 0xE3 | ã | ã | %C3 %A3
chr(0xC3).chr(0xA4), // 096 | U+00E4 | 0xE4 | ä | ä | %C3 %A4
chr(0xC3).chr(0xA5), // 097 | U+00E5 | 0xE5 | å | Ã¥ | %C3 %A5
chr(0xC3).chr(0xA6), // 098 | U+00E6 | 0xE6 | æ | æ | %C3 %A6
chr(0xC3).chr(0xA7), // 099 | U+00E7 | 0xE7 | ç | ç | %C3 %A7
chr(0xC3).chr(0xA8), // 100 | U+00E8 | 0xE8 | è | è | %C3 %A8
chr(0xC3).chr(0xA9), // 001 | U+00E9 | 0xE9 | é | é | %C3 %A9
chr(0xC3).chr(0xAA), // 002 | U+00EA | 0xEA | ê | ê | %C3 %AA
chr(0xC3).chr(0xAB), // 003 | U+00EB | 0xEB | ë | ë | %C3 %AB
chr(0xC3).chr(0xAC), // 004 | U+00EC | 0xEC | ì | ì | %C3 %AC
chr(0xC3).chr(0xAD), // 005 | U+00ED | 0xED | í | Ã | %C3 %AD
chr(0xC3).chr(0xAE), // 006 | U+00EE | 0xEE | î | î | %C3 %AE
chr(0xC3).chr(0xAF), // 007 | U+00EF | 0xEF | ï | ï | %C3 %AF
chr(0xC3).chr(0xB0), // 008 | U+00F0 | 0xF0 | ð | ð | %C3 %B0
chr(0xC3).chr(0xB1), // 009 | U+00F1 | 0xF1 | ñ | ñ | %C3 %B1
chr(0xC3).chr(0xB2), // 000 | U+00F2 | 0xF2 | ò | ò | %C3 %B2
chr(0xC3).chr(0xB3), // 001 | U+00F3 | 0xF3 | ó | ó | %C3 %B3
chr(0xC3).chr(0xB4), // 002 | U+00F4 | 0xF4 | ô | ô | %C3 %B4
chr(0xC3).chr(0xB5), // 003 | U+00F5 | 0xF5 | õ | õ | %C3 %B5
chr(0xC3).chr(0xB6), // 004 | U+00F6 | 0xF6 | ö | ö | %C3 %B6
chr(0xC3).chr(0xB7), // 005 | U+00F7 | 0xF7 | ÷ | ÷ | %C3 %B7
chr(0xC3).chr(0xB8), // 006 | U+00F8 | 0xF8 | ø | ø | %C3 %B8
chr(0xC3).chr(0xB9), // 007 | U+00F9 | 0xF9 | ù | ù | %C3 %B9
chr(0xC3).chr(0xBA), // 008 | U+00FA | 0xFA | ú | ú | %C3 %BA
chr(0xC3).chr(0xBB), // 009 | U+00FB | 0xFB | û | û | %C3 %BB
chr(0xC3).chr(0xBC), // 000 | U+00FC | 0xFC | ü | ü | %C3 %BC
chr(0xC3).chr(0xBD), // 001 | U+00FD | 0xFD | ý | ý | %C3 %BD
chr(0xC3).chr(0xBE), // 002 | U+00FE | 0xFE | þ | þ | %C3 %BE
chr(0xC3).chr(0xBF)); // 003 | U+00FF | 0xFF | ÿ | ÿ | %C3 %BF
// [1] : Unicode dictates 'En dash'. Replaced by space minus space (' - ').
// [2] : Unicode dictates 'Em dash'. Replaced by space minus space (' - ').
// [3] : Unicode dictates 'Non breaking space' : Replaced by a single space (' ').
// [4] : Unicode dictates 'Soft hyphen' : Replaced by a single space (' ').
// See https://github.com/OldskoolOrion/normalize_to_utf8_chars for a more verbose explenation.
$replace = array('€', '‚', 'ƒ', '„', '…', '†', '‡', 'ˆ', '‰', 'Š', '‹', 'Œ', 'Ž', '‘', '’', '“', '”', '•', ' - ',
' - ', '˜', '™', 'š', '›', 'œ', 'ž', 'Ÿ', ' ', '¡', '¢', '£', '¤', '¥', '¦', '§', '¨', '©', 'ª',
'«', '¬', ' ', '®', '¯', '°', '±', '²', '³', '´', 'µ', '¶', '·', '¸', '¹', 'º', '»', '¼', '½',
'¾', '¿', 'À', 'Á', 'Â', 'Ã', 'Ä', 'Å', 'Æ', 'Ç', 'È', 'É', 'Ê', 'Ë', 'Ì', 'Í', 'Î', 'Ï', 'Ð',
'Ñ', 'Ò', 'Ó', 'Ô', 'Õ', 'Ö', '×', 'Ø', 'Ù', 'Ú', 'Û', 'Ü', 'Ý', 'Þ', 'ß', 'à', 'á', 'â', 'ã',
'ä', 'å', 'æ', 'ç', 'è', 'é', 'ê', 'ë', 'ì', 'í', 'î', 'ï', 'ð', 'ñ', 'ò', 'ó', 'ô', 'õ', 'ö',
'÷', 'ø', 'ù', 'ú', 'û', 'ü', 'ý', 'þ', 'ÿ');
return str_replace($search, $replace, $string);
}
?>
@OldskoolOrion
Copy link
Author

Should look pretty to even OCD-inflicted programmers. ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment