Skip to content

Instantly share code, notes, and snippets.

@jeffkayser
Last active February 4, 2024 04:21
Show Gist options
  • Save jeffkayser/65c67fedae417dba7913 to your computer and use it in GitHub Desktop.
Save jeffkayser/65c67fedae417dba7913 to your computer and use it in GitHub Desktop.
Creating English Wordlists with Spell Checking Oriented Word Lists (SCOWL)

Build SCOWL English wordlists

Summary

I needed to generate an English wordlist. SCOWL (Spell Checker Oriented Word Lists) comes with a build script that allows customization of what words are included (see the README). Below is a script to generate each supported size. The output from the SCOWL build script (mk-list) seems to be ISO-8859-1 encoded with DOS line endings, so the script converts it to UTF-8 encoding with UNIX line endings.

Script

#!/bin/bash

# Possessive duplicates ("$word" and "$word's") are stripped by default
# Pass -p as the first arg to retain them
if [ "$1" == "-p" ]; then
    POSSESSIVE=1
else
    POSSESSIVE=
fi

for SIZE in 10 20 35 40 50 55 60 70 80 95
do
    SCOWL_FILE=scowl-words-$SIZE.txt
    perl mk-list english $SIZE |
        iconv -f ISO-8859-1 -t UTF-8 |
        tr -d '\r' |
        ( [[ ! "$POSSESSIVE" ]] && sed -E "s/'s$//g" | sort -u || cat ) > $SCOWL_FILE
    SCOWL_WORDS=$(wc -l $SCOWL_FILE | sed -E 's/ *([0-9]+) .*/\1/')
    echo "Created '$SCOWL_FILE' ($SCOWL_WORDS words)"
done

Sample output

Shows the wordcounts for each size, which may be useful.

With possessive duplicates stripped

./mk-scowl-dict.sh
Created 'scowl-words-10.txt' (3969 words)
Created 'scowl-words-20.txt' (10746 words)
Created 'scowl-words-35.txt' (38351 words)
Created 'scowl-words-40.txt' (43394 words)
Created 'scowl-words-50.txt' (70703 words)
Created 'scowl-words-55.txt' (76203 words)
Created 'scowl-words-60.txt' (86102 words)
Created 'scowl-words-70.txt' (126304 words)
Created 'scowl-words-80.txt' (273607 words)
Created 'scowl-words-95.txt' (501583 words)

With possessive duplicates retained

./mk-scowl-dict.sh -p
Created 'scowl-words-10.txt' (4405 words)
Created 'scowl-words-20.txt' (12359 words)
Created 'scowl-words-35.txt' (48853 words)
Created 'scowl-words-40.txt' (55962 words)
Created 'scowl-words-50.txt' (98984 words)
Created 'scowl-words-55.txt' (105210 words)
Created 'scowl-words-60.txt' (119259 words)
Created 'scowl-words-70.txt' (161369 words)
Created 'scowl-words-80.txt' (333866 words)
Created 'scowl-words-95.txt' (644673 words)
@Bunch0fAtoms
Copy link

Works, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment