Skip to content

Instantly share code, notes, and snippets.

@azugxi7374
Last active August 29, 2015 13:56
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save azugxi7374/9192435 to your computer and use it in GitHub Desktop.
Save azugxi7374/9192435 to your computer and use it in GitHub Desktop.
scalaからjsoupでwebスクレイピングする

scalaからjsoupでwebスクレイピングする

Javaのjsoupを使う
http://jsoup.org/

libraryDependencies += "org.jsoup" % "jsoup" % "1.7.3"
import org.jsoup._
import collection.JavaConverters._

/////////////////////////////
// URLからHTML取得
val urlstr = "https://gist.github.com/ixxa/9192435"
val doc = Jsoup.connect(urlstr).get

////////////////////////////
// HTMLdocumentをパースしたり書き換えたりする
// bodyとか
val (head, body) = (doc.head, doc.body)

// 中身
val textContent = body.text // タグなしテキストのみ
val htmlContent = body.html // innerHTML

// select色々
val tag1 = doc.select(".hogeclass#hogeid").asScala.head //クラス、ID
val tag2 = doc.select("""link[rel^=style]""").asScala.head //タグ名、属性

// 書き換えとか
tag1.html("hogehoge")
tag2.attr("href", "hoge.css")

//////////////////////////////
// .html以外でもこれでいける
val cssdoc = Jsoup.connect(cssURL).get.body.text

公式サイトのCookbook contents読めば大体わかる

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment