Skip to content

Instantly share code, notes, and snippets.

@jeroenjanssens
Created April 25, 2014 02:17
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jeroenjanssens/11275916 to your computer and use it in GitHub Desktop.
Save jeroenjanssens/11275916 to your computer and use it in GitHub Desktop.
Get top N words from STDIN using Bash, Python, and R. All three scripts produce the same output, but R scales very badly w.r.t. to input size. What am I doing wrong?
#!/usr/bin/env python
import re
import sys
from collections import Counter
num_words = int(sys.argv[1])
text = sys.stdin.read()
text = text.lower()
words = re.split('\W+', text)
cnt = Counter(words)
for word, count in cnt.most_common(num_words):
print "%8d %s" % (count, word)
#!/usr/bin/env Rscript
num.words <- as.integer(commandArgs(trailingOnly = TRUE))
f <- file("stdin")
input.lines <- readLines(f)
close(f)
full.text <- tolower(paste(input.lines, collapse = " "))
splits <- gregexpr("\\w+", full.text)
words.all <- (regmatches(full.text, splits)[[1]])
words.unique <- as.data.frame(table(words.all))
words.sorted <- words.unique[order(-words.unique$Freq),]
dummy <- mapply(function(w, c) {
cat(sprintf("%8d %s\n", c, w))
}, head(words.sorted$words, num.words), head(words.sorted$Freq, num.words))
#!/usr/bin/env bash
NUM_WORDS="$1"
tr '[:upper:]' '[:lower:]' |
grep -oE '\w+' |
sort |
uniq -c |
sort -nr |
head -n $NUM_WORDS
@leondutoit
Copy link

Stumbled upon this gist via a tweet. Interesting... and the following snippet might be something worth considering for R. Although, if you are intent on using only the standard library then this is probably not suitable since it relies on dplyr.

#!/usr/bin/Rscript --vanilla
num_words <- as.integer(commandArgs(trailingOnly = TRUE))
suppressMessages(library(dplyr))
input_lines <- readLines("stdin", warn = F)
tbl_df(data.frame(words = tolower(unlist(strsplit(input_lines, " "))))) %>%
  group_by(words) %>%
  summarise(occurrence = n()) %>%
  arrange(desc(occurrence)) %>%
  do(head(., num_words))

No idea how it scales in comparison to your example, but worth a shot :) Usage cat something | ./file_name <num_words>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment