Skip to content

Instantly share code, notes, and snippets.

@saurabhvyas
Last active February 21, 2018 08:19
Show Gist options
  • Save saurabhvyas/0bfff30f51af1fd44905ef95c8998ec9 to your computer and use it in GitHub Desktop.
Save saurabhvyas/0bfff30f51af1fd44905ef95c8998ec9 to your computer and use it in GitHub Desktop.
will clean up your UTF-8 file, skipping all the invalid characters.
#!/bin/bash
# this script removes invalid utf-8 codes from a bunch of text files in input_dir
input_dir='/media/saurabh/New Volume2/hardik_dataset/final (another copy)'
output_dir='/media/saurabh/New Volume2/hardik_dataset/output'
# iterate over each file in input folder
for entry in "$input_dir"/*.txt
do
echo "$entry"
temp1=$(basename "$entry")
temp5=${output_dir}/${temp1}
iconv -f utf-8 -t utf-8 -c "$entry" > "$temp5"
done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment