Skip to content

Instantly share code, notes, and snippets.

@moraisaugusto
Created September 21, 2021 14:59
Show Gist options
  • Save moraisaugusto/246cd461eeb45b5ee1edc83b58c28b3c to your computer and use it in GitHub Desktop.
Save moraisaugusto/246cd461eeb45b5ee1edc83b58c28b3c to your computer and use it in GitHub Desktop.
Find wrong utf-8 chars in a large file
# Assuming you have your locale set to UTF-8 (see locale output), this works well to recognize invalid UTF-8 sequences:
grep -axv '.*' file.txt
# Grep man page:
# -a, --text: treats file as text, essential prevents grep to abort once finding an invalid byte sequence (not being utf8)
# -v, --invert-match: inverts the output showing lines not matched
# -x '.*' (--line-regexp): means to match a complete line consisting of any utf8 character.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment