Skip to content

Instantly share code, notes, and snippets.

@lxndrv
Last active March 19, 2022 03:05
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save lxndrv/9b08d14c2078eb901a75d8cefc36fb30 to your computer and use it in GitHub Desktop.
Save lxndrv/9b08d14c2078eb901a75d8cefc36fb30 to your computer and use it in GitHub Desktop.
Ngram in bash with awk
cat << 'EOF' > ngram.sh
#Helper function to count ngrams
function soniq(){
sort | uniq -c | sort -nr
}
#Usage: cat file.txt | ngram 3 | soniq | head -20
function ngram(){
local NGRAM_COUNT=${1:-1}
sed -r 's/[^[:alnum:]]/ /g' | awk '{print tolower($0)}' \
| sed -e 's/ /\n/g' | sed '/^$/d' \
| awk -v nn=$NGRAM_COUNT '{line="";for(i=nn;i>0;i--){line = line" "arr[i]}print line; for(i=nn;i>0;i--){arr[i]=arr[i-1]};arr[0]=$0;}'
}
EOF
source ngram.sh
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment