Skip to content

Instantly share code, notes, and snippets.

@dmolesUC
Created August 5, 2021 19:36
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dmolesUC/90cababd5fba9cd6287480e9995a4386 to your computer and use it in GitHub Desktop.
Save dmolesUC/90cababd5fba9cd6287480e9995a4386 to your computer and use it in GitHub Desktop.
Grep command to locate possible Windows 1252 <-> UTF-8 encoding problems
# After https://www.i18nqa.com/debug/utf8-debug.html
#
# A more sophisticated version of this would grep for suspicious sequences-of-sequences;
# if the file legit contains accented characters, this will show it as a false positive
find . -name marc.xml -exec env LANG=LC_ALL grep -Pl '(\xc2|\xc3|\xc5|\xc6|\xcb|\xe2)' {} \;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment