Skip to content

Instantly share code, notes, and snippets.

@svdberg
Created March 9, 2015 20:45
Show Gist options
  • Save svdberg/c5ab76073cd720864519 to your computer and use it in GitHub Desktop.
Save svdberg/c5ab76073cd720864519 to your computer and use it in GitHub Desktop.
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Data.Digest.Pure.MD5
import Data.List.Split
import Data.List
import System.Environment
import qualified Data.ByteString.Lazy as LBS
import qualified Data.ByteString.Char8 as C
main = do
args <- getArgs
let filename = head args
tsvData <- C.readFile filename
let
lns = C.lines $ tsvData
mdLines = map splitByTabAndMd5 lns
C.writeFile "temp.txt" $ C.unlines mdLines
return ()
splitByTabAndMd5 s = C.concat $ map (flip C.snoc ' ' . md5IfEmail) $ C.words s
md5IfEmail :: C.ByteString -> C.ByteString
md5IfEmail s = if C.isInfixOf "@" s then emailToMD5 s else s
emailToMD5 :: C.ByteString -> C.ByteString
emailToMD5 s = C.concat ["\"", C.pack $ show $ md5 $ LBS.fromStrict $ stripQuotes s, "\""]
stripQuotes :: C.ByteString -> C.ByteString
stripQuotes s = C.take l $ C.drop 1 s
where
l = C.length s - 2
@svdberg
Copy link
Author

svdberg commented Mar 9, 2015

File: MPA_users_all.txt: Little-endian UTF-16 Unicode text, with CRLF line terminators

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment