Skip to content

Instantly share code, notes, and snippets.

@jennyd
Created September 11, 2014 10:36
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jennyd/44bc220fc3be05d1444e to your computer and use it in GitHub Desktop.
Save jennyd/44bc220fc3be05d1444e to your computer and use it in GitHub Desktop.
Remove lines containing invalid UTF8 byte sequences from files
#!/opt/mawk/bin/mawk -f
##
# Skip lines with invalid UTF8 byte sequences.
# Adapted from http://unix.stackexchange.com/questions/6516/filtering-invalid-utf8 by @rgarner
#
$0 !~ /^(([\x00-\x7F])|([\xC2-\xDF][\x80-\xBF])|((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF]))|((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF][\x80-\xBF])))*$/ {
next
}
{ print }
#!/bin/bash
##
# Remove any lines which contain any byte sequences which are invalid UTF8
# from the given files.
set -e
if [ $# -eq 0 ]; then
echo "Usage: $0 <files>"
exit 1
fi
for file in "$@"
do
if [ -f "$file.utf8-cleaned" ]; then
# These have already been cleaned
echo "Skipping $file"
elif [[ "$file" =~ .utf8-cleaned$ ]]; then
# Don't try to clean these files
continue
else
echo "$file"
/opt/mawk/bin/mawk -f exclude_lines_containing_invalid_utf8_bytes.awk $file > "$file.tmp" && mv "$file.tmp" $file && touch "$file.utf8-cleaned"
fi
done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment