Skip to content

Instantly share code, notes, and snippets.

@dimchansky
Created April 4, 2015 10:54
Show Gist options
  • Save dimchansky/21210df6a90ace2b4e70 to your computer and use it in GitHub Desktop.
Save dimchansky/21210df6a90ace2b4e70 to your computer and use it in GitHub Desktop.
Counts the number of times each word appears in a (non-unicode) file and outputs result to other file sorting by frequencies and then by word (asc).
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE TupleSections #-}
module Main where
import qualified Data.ByteString.Lazy as S
import qualified Data.ByteString.Lazy.Builder as SB
import qualified Data.ByteString.Lazy.Char8 as C
import Data.Foldable (foldMap)
import qualified Data.HashMap.Strict as HM
import Data.List (sortBy)
import Data.Monoid ((<>))
import Data.Word8 (isAlpha, toLower)
main :: IO ()
main =
S.writeFile "out.txt"
. SB.toLazyByteString
. foldMap outLineBuilder
. sortBy sndThenFst
. countFrequencies
. splitWords =<< S.readFile "inp.txt"
where outLineBuilder (w,f) = SB.intDec f <> "\t" <> SB.lazyByteString w <> "\n"
sndThenFst (x1, y1) (x2, y2) = compare y1 y2 <> compare x1 x2
countFrequencies = HM.toList . HM.fromListWith (+) . map (,1)
makeLowAlphaOrNL w = if isAlpha w then toLower w else 10
splitWords = filter (not . S.null) . C.lines . S.map makeLowAlphaOrNL
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment