Skip to content

Instantly share code, notes, and snippets.

@geekley
Last active January 6, 2022 01:41
Show Gist options
  • Save geekley/3eb0bb8156ce0bede487b87a220b4be7 to your computer and use it in GitHub Desktop.
Save geekley/3eb0bb8156ce0bede487b87a220b4be7 to your computer and use it in GitHub Desktop.
Asciify a spell-check dictionary (word list). It filters words from a .dic with non-ascii chars and transforms the words into ascii-only versions. https://github.com/streetsidesoftware/cspell/issues/1060#issuecomment-1006199819
#!/usr/bin/env bash
#License: "Zero-Clause BSD" <https://opensource.org/licenses/0BSD>
# Requires perl and unidecode module (in Ubuntu, it can be installed with sudo apt install libtext-unidecode-perl).
# Example usage: asciify-dic $DIC_NAME.dic > $DIC_NAME-asciified.dic
if [[ "$1" == "--help" ]]; then
echo "Usage: $(basename "$0") INPUT_FILE > OUTPUT_FILE"
echo "Asciify a .dic file (list of dictionary words)."
echo ""
echo "Generates a file with ASCII-only versions of the words that have non-ASCII chars."
echo "These additional words can be used to make spell-checking accent-insensitive."
echo "Comment lines beginning with % are left unchanged."
exit
fi
# Filter words containing non-ascii characters, except in comments
grep -P '^\%|[^\x00-\x7F]' $1 |
# Make words accent-insensitive, except in comments
perl -C -MText::Unidecode -pe'next if /^\s*%/;unidecode($_)' |
# Remove duplicate lines, except in comments
awk '/^\s*%/||!seen[$0]++'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment