Skip to content

Instantly share code, notes, and snippets.

@rodrigoalviani
Forked from apeyroux/crawler haskell
Created January 5, 2016 10:04
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rodrigoalviani/c4027dd0b32dec8d0e76 to your computer and use it in GitHub Desktop.
Save rodrigoalviani/c4027dd0b32dec8d0e76 to your computer and use it in GitHub Desktop.
-- | URL Doc : http://hackage.haskell.org/package/url-2.1.3/docs/Network-URL.html
module Page where
import Network.URL
import Network.Curl
import Text.XML.HXT.Core
import Text.HandsomeSoup
data Page = Page {
title :: String,
content :: String,
url :: Maybe URL,
stat :: Stat
} deriving Show
data Stat = Stat {
dwltime :: String
} deriving Show
class Indexable i where
rank:: i -> Integer
backlinks :: [i] -> Integer
html2title :: String -> IO String
html2title h = (runX $ doc >>> css "title" /> getText) >>= return . getTitle
where
doc = readString [withParseHTML yes, withWarnings no] h
getTitle t = if null t then "" else head t
crawlurl :: String -> IO Page
crawlurl u = do
(curlGetResponse_ (u) [CurlFollowLocation True] :: IO CurlResponse) >>= (\r-> (html2title (respBody r)) >>= (\t-> return $ Page t (respBody r) (importURL u) (Stat "0")))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment