Skip to content

Instantly share code, notes, and snippets.

@takaki
Created October 3, 2012 04:50
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save takaki/3825068 to your computer and use it in GitHub Desktop.
Save takaki/3825068 to your computer and use it in GitHub Desktop.
HXT, Text Node, Shift-JIS
import Codec.Binary.UTF8.String
import Codec.Text.IConv
import Data.List
import Text.XML.HXT.Core
import qualified Data.ByteString.Lazy as BSL
main = do
cs <- BSL.readFile "4731398C.html"
let u8s = convert "CP932" "UTF-8" cs
let html = decode (BSL.unpack u8s)
let doc = readString [withParseHTML yes, withWarnings no] html
nodes <- runX $ doc //> hasText (isInfixOf "日") >>> getText
mapM_ (putStrLn . id ) nodes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment