Skip to content

Instantly share code, notes, and snippets.

@FilipDominec
Created October 19, 2021 17:35
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save FilipDominec/f321f2523e9fc8948fea72fabd18c5aa to your computer and use it in GitHub Desktop.
Save FilipDominec/f321f2523e9fc8948fea72fabd18c5aa to your computer and use it in GitHub Desktop.
Helps to fix diacritics mess in legacy websites. Uses the chardet module to detect character encoding; accepts multiple files to print a table
#!/usr/bin/python3
#-*- coding: utf-8 -*-
import chardet, pathlib, sys
known_enc = {'Win':'Windows-1250', 'ISO':'ISO-8859-2', '1250':'Windows-1250', 'utf':'utf8' }
for fn in sys.argv[1:]:
found_enc = chardet.detect(pathlib.Path(fn).read_bytes())['encoding']
if found_enc[:3] in known_enc.keys():
found_enc = known_enc[found_enc[:3]]
print(f'{fn:20s} auto-detected encoding {found_enc:14s}', end='')
fileheader = pathlib.Path(fn).read_bytes()[:500]
if 'charset='.encode() in fileheader:
print(' --> file defines encoding ', end='')
for k in known_enc.keys():
if k.encode() in fileheader:
print(k, end='')
print()
else:
print(' --> file DOES NOT define encoding')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment