Skip to content

Instantly share code, notes, and snippets.

@FilipDominec
Created June 28, 2022 15:06
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save FilipDominec/912b18147842ed5de7adbf3fab1413c9 to your computer and use it in GitHub Desktop.
Save FilipDominec/912b18147842ed5de7adbf3fab1413c9 to your computer and use it in GitHub Desktop.
Searches for such charset conversion, which would generate a wrong encoded string from a known correct one
#!/usr/bin/python3
#-*- coding: utf-8 -*-
# Searches for such charset conversion, which would generate a wrong encoded string from a known correct one
# Public domain, written by Filip Dominec 2022
# EXAMPLES:
#wrong, correct = "╪ konstrukЯnб ¤eчenб", "ě konstrukční řešení"
#wrong, correct = "sloučeninovĂ˝ch", "sloučeninových"
#wrong, correct = "pøípravu slouèeninových polovodièù", "přípravu sloučeninových polovodičů"
#wrong, correct = "ý", "ý"
#wrong, correct = "Pro přípravu sloučeninových polovodičů vyuľívá jako zdrojové materiály", "Pro přípravu sloučeninových polovodičů využívá jako zdrojové materiály"
#wrong, correct = "à", "ů"
#wrong, correct = "v∞m╪r", "výměr"
#wrong, correct = "slouèeninových", "sloučeninových"
#wrong, correct = "vyuľívá","využívá"
#wrong, correct = "vyu¾ívá", "využívá"
wrong, correct = "M╪²ení", "Měření",
import os
## Try all encodings (big table!)
def encodinglist(): # https://stackoverflow.com/questions/1728376/get-a-list-of-all-the-encodings-python-can-encode-to
r=[]
for i in os.listdir(os.path.split(__import__("encodings").__file__)[0]):
name=os.path.splitext(i)[0]
try:
"".encode(name)
except:
pass
else:
if name not in ("idna", "punycode"):
r.append(name.replace("_","-"))
r.sort()
return r
enclist = encodinglist()
## Narrow list of likely encodings
#enclist = ['ascii', 'utf8', 'latin-1']
#win_encs = [f'Windows-125{n}' for n in range(8)]
#iso_encs = [f'ISO-8859-{n}' for n in range(1,10) ]
#enclist = enclist + win_encs + iso_encs
possible_froms = []
possible_tos = []
possible_solutions = []
enclen = max(len(c) for c in enclist)
enclist_aligned = [f"{enc:{enclen}} " for enc in enclist]
print("REALLY ENCODED: \ BUT INTERPRETED AS:")
for ll in ("".join(j) for j in zip(*enclist_aligned)):
print(" "*enclen + " " + ll)
for f,a in zip(enclist, enclist_aligned):
print(a, end="")
for t in enclist:
try:
co = wrong.encode(t,"ignore").decode(f,"ignore")
if co == correct:
#print(f,t)
print("X", end="")
possible_froms.append(f)
possible_tos.append(t)
possible_solutions.append((f,t))
else:
print("·" if f!=t else " ", end="")
#print(co, end="")
#if "ý" in co: print(f,t)
except:
pass
print("E",end="")
print()
#for f,t in possible_solutions:
print(f"Conclusion: when '{correct}' is encoded as:\n\t{set(possible_froms)}\nbut (mis)interpreted as:\n\t{set(possible_tos)},\n it may appear as '{wrong}'")
@FilipDominec
Copy link
Author

FilipDominec commented Jun 28, 2022

Example output:

REALLY ENCODED: \  BUT INTERPRETED AS:
                   abbccccccccccccccccccccccccccccccccccccccccceeeeggghhiiiiiiiiiiiiiiiiiiiiiijkkkklmmmmmmmmmmmpprssstuuuuuuuuuu
                   siihppppppppppppppppppppppppppppppppppppppppuuuubbbpzssssssssssssssssssssssoooozaaaaaaaaaaaaatahhhinttttttttt
                   cgga0111111111111124457778888888888888888999cccc12k- oooooooooooooooooooooohiii1tccccccccccclcwiiisifffffffff
                   i55r3001122222222272302375555556666666677345----83 r 2222222888888888888888a8880i-----------mp-fff-c---------
                   i hm7022455555555534700750256780123456945290jjjk01 o 0000000888888888888888b---4nacccfgilrrto1uttt6o111333788
                     ka 6650012345678                          iipr32 m 2222222555555555555555 rtu8-reryarcaoous5n---2d666222  -
                     sp                                        ss  0  a 2222222999999999999999     1anorreetmmr 4ijjj0e -- --  s
                     c                                         -x     n ----------------------      btaiseliaak  ciii - bl bl  i
                     s                                         20     8 jjjjjjk111111123456789      ietlikannni  osss e ee ee  g
                                                               02       ppppppr 013456              cuil  n2 is  d -x s         
                                                               01        -----                       rai  d  ah  e 20 c         
                                                               43        1223e                       onc     n   - 02 a         
                                                                           0 x                                   e 01 p         
                                                                           0 t                                   s 43 e         
                                                                           4                                     c              
                                                                                                                 a              
                                                                                                                 p              
                                                                                                                 e              
                                                                                                                                
ascii               ············································································································
big5               · ···········································································································
big5hkscs          ·· ··········································································································
charmap            ··· ·········································································································
cp037              ···· ········································································································
cp1006             ····· ·······································································································
cp1026             ······ ······································································································
cp1125             ······· ·····································································································
cp1140             ········ ····································································································
cp1250             ········· ···································································································
cp1251             ·········· ··································································································
cp1252             ··········· ·································································································
cp1253             ············ ································································································
cp1254             ············· ·······························································································
cp1255             ·············· ······························································································
cp1256             ··············· ·····························································································
cp1257             ················ ····························································································
cp1258             ················· ···························································································
cp273              ·················· ··························································································
cp424              ··················· ·························································································
cp437              ···················· ························································································
cp500              ····················· ·······················································································
cp720              ······················ ······················································································
cp737              ······················· ·····················································································
cp775              ························ ····················································································
cp850              ························· ···················································································
cp852              ····················X····· ····XXX··X········································································
cp855              ··························· ·················································································
cp856              ···························· ················································································
cp857              ····························· ···············································································
cp858              ······························ ··············································································
cp860              ······························· ·············································································
cp861              ································ ············································································
cp862              ································· ···········································································
cp863              ·································· ··········································································
cp864              ··································· ·········································································
cp865              ···································· ········································································
cp866              ····································· ·······································································
cp869              ······································ ······································································
cp874              ······································· ·····································································
cp875              ········································ ····································································
cp932              ········································· ···································································
cp949              ·········································· ··································································
cp950              ··········································· ·································································
euc-jis-2004       ············································ ································································
euc-jisx0213       ············································· ·······························································
euc-jp             ·············································· ······························································
euc-kr             ··············································· ·····························································
gb18030            ················································ ····························································
gb2312             ················································· ···························································
gbk                ·················································· ··························································
hp-roman8          ··················································· ·························································
hz                 ···················································· ························································
iso2022-jp         ····················································· ·······················································
iso2022-jp-1       ······················································ ······················································
iso2022-jp-2       ······················································· ·····················································
iso2022-jp-2004    ························································ ····················································
iso2022-jp-3       ························································· ···················································
iso2022-jp-ext     ·························································· ··················································
iso2022-kr         ··························································· ·················································
iso8859-1          ···························································· ················································
iso8859-10         ····························································· ···············································
iso8859-11         ······························································ ··············································
iso8859-13         ······························································· ·············································
iso8859-14         ································································ ············································
iso8859-15         ································································· ···········································
iso8859-16         ·································································· ··········································
iso8859-2          ··································································· ·········································
iso8859-3          ···································································· ········································
iso8859-4          ····································································· ·······································
iso8859-5          ······································································ ······································
iso8859-6          ······································································· ·····································
iso8859-7          ········································································ ····································
iso8859-8          ········································································· ···································
iso8859-9          ·········································································· ··································
johab              ··········································································· ·································
koi8-r             ············································································ ································
koi8-t             ············································································· ·······························
koi8-u             ·············································································· ······························
kz1048             ··············································································· ·····························
latin-1            ················································································ ····························
mac-arabic         ················································································· ···························
mac-centeuro       ·················································································· ··························
mac-croatian       ··················································································· ·························
mac-cyrillic       ···················································································· ························
mac-farsi          ····················································································· ·······················
mac-greek          ······················································································ ······················
mac-iceland        ······················································································· ·····················
mac-latin2         ························································································ ····················
mac-roman          ························································································· ···················
mac-romanian       ·························································································· ··················
mac-turkish        ··························································································· ·················
palmos             ···························································································· ················
ptcp154            ····························································································· ···············
raw-unicode-escape ······························································································ ··············
shift-jis          ······························································································· ·············
shift-jis-2004     ································································································ ············
shift-jisx0213     ································································································· ···········
tis-620            ·································································································· ··········
unicode-escape     ··································································································· ·········
utf-16             ···································································································· ········
utf-16-be          ····································································································· ·······
utf-16-le          ······································································································ ······
utf-32             ······································································································· ·····
utf-32-be          ········································································································ ····
utf-32-le          ········································································································· ···
utf-7              ·········································································································· ··
utf-8              ··········································································································· ·
utf-8-sig          ············································································································ 
Conclusion: when 'Měření' is encoded as:
	{'cp852'}
but (mis)interpreted as:
	{'cp437', 'cp860', 'cp862', 'cp861', 'cp865'},
 it may appear as 'M╪²ení'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment