Skip to content

Instantly share code, notes, and snippets.

@nicolaferraro
Created May 5, 2015 14:38
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save nicolaferraro/69b22877eb877bd1c246 to your computer and use it in GitHub Desktop.
Save nicolaferraro/69b22877eb877bd1c246 to your computer and use it in GitHub Desktop.
CrawlerService
package it.eng.scala.crawl
import org.jsoup.Jsoup
import scala.collection.JavaConversions._
object CrawlerService {
val AbsolutePrefix = "http://en.wikipedia.org/wiki/"
val RelativePrefix = "/wiki/"
val AbsolutePath = "http://en.wikipedia.org"
def scanLinks(address: String): List[(String, String)] =
Jsoup.
connect(address)
.get
.select("a[href]")
.iterator.toList
.map(e => (address, e.attr("href")))
.filter(p => p._2.startsWith(AbsolutePrefix) || p._2.startsWith(RelativePrefix))
.map(p =>
if(p._2.startsWith(RelativePrefix)) (p._1, AbsolutePath + p._2)
else p
)
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment