Skip to content

Instantly share code, notes, and snippets.

@albert-decatur
Last active August 29, 2015 14:24
Show Gist options
  • Save albert-decatur/ccb2b05b1b10d6b50199 to your computer and use it in GitHub Desktop.
Save albert-decatur/ccb2b05b1b10d6b50199 to your computer and use it in GitHub Desktop.
identifying stopwords using ngrams and vowels

#identifying stopwords using ngrams and vowels

  • we have: street names in Richmond, VA
  • we want to: match street names to Confederate generals and Civil Rights leaders
  • first we must: remove small pesky elements of names that are irrelevant to the match

the solution: ngrams to build a stopword list!

prerequisites

# name of input plain text file - in this case, list of TIGER/Line road names in Richmond, VA
in=tiger.csv
# set max string length to be considered a stopword  
n=6
# identify unigrams from the input file
ngrams $in 1 |\
# get two column TSV of frequency, unigram
sortfreq |\
# strip out header
sed '1d' |\
# take any string <= length *n*, OR that has no vowels
awk "{if(length(\$2)<= $n || \$2 ~ /^[^aeiou]*$/ )print \$0}"|\
# get top 40 most frequent
head -n 40 |\
# now use ggplot to make a graph of our results
# the graph is 6in in height and 9in wide, with size 20 title font!
plotbars stopword count "Potential Stopwords Identified Using ngrams\nTIGER/Line Road Names, Richmond, VA" 20 6 9

results!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment