public
Last active

Knuth/McIlroy word count task using Haskell

  • Download Gist
WordCount.hs
Haskell
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
import qualified System
import qualified Data.List as List
import qualified Data.Char as Char
import qualified Data.HashMap.Strict as HashMap
 
main :: IO ()
main = do
[arg] <- System.getArgs
text <- getContents
let n = read arg
putStr $ createReport n text
 
createReport :: Int -> String -> String
createReport n text =
unlines $
map (\(w, count) -> show count ++ " " ++ w) $
take n $
List.sortBy (flip compare) $
map (\(w, count) -> (count, w)) $
HashMap.toList $
HashMap.fromListWith (+) $
map (\w -> (w, 1)) $
words $
map Char.toLower $
map (\c -> if Char.isLetter c then c else '\n') $
text
WordCount.lhs
Literate Haskell

The task: read a file of text, determine the n most frequently used words, and print out a sorted list of those words along with their frequencies.

Import the standard libraries we use.

Note that I prefer to import modules as qualified, to avoid namespace pollution and confusion. I would rather see "Char.isLetter" in code rather than "isLetter".

import qualified System
import qualified Data.List as List
import qualified Data.Char as Char
import qualified Data.HashMap.Strict as HashMap

The entry point for the command line program involves the IO monad.

main :: IO ()
main = do

Get the command line arguments. For simplicity in this non-production code, we do no error checking for the correct number of arguments. If we do not give one argument, the program will exit with a pattern match failure.

     [arg] <- System.getArgs

Bind "text" to the entire contents of stdin as a String.

     text <- getContents

No error checking here in converting the String "arg" to an Int.

     let n = read arg

Do all the real work, then print out the report.

     putStr $ createReport n text

Generate a report for the n most frequently occurring words in the text.

Read the code backwards in order to see the data flow.

createReport :: Int -> String -> String
createReport n text =

Convert the list of Strings into a single String with a newline terminator for each String.

     unlines $

For each pair, generate a String that is the count followed by a space followed by the word.

     map (\(count, w) -> show count ++ " " ++ w) $

Take only the first n pairs of the list.

     take n $

Sort the list of pairs descending by the count for each word (and secondarily, descending by the word).

     List.sortBy (flip compare) $

Flip the pairs in the list.

     map (\(w, count) -> (count, w)) $

Extract a key/value pair list from the hash map.

     HashMap.toList $

Create a hash map from the list of pairs, with a combining function for each unique word that is just the addition operator on the value. The final result therefore is a hash map in which each word is mapped to its total count.

     HashMap.fromListWith (+) $

For each word, create a pair of it with the initial count 1.

     map (\w -> (w, 1)) $

Split up the text, on white space, into a list of words.

     words $

Make all the text lowercase.

     map Char.toLower $

Replace all non-letters with a newline.

     map (\c -> if Char.isLetter c then c else '\n') $
     text

The function for flipping pairs is called swap.

Please sign in to comment on this gist.

Something went wrong with that request. Please try again.