Skip to content

Instantly share code, notes, and snippets.

@lusis
Created September 27, 2011 15:50
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save lusis/1245439 to your computer and use it in GitHub Desktop.
Save lusis/1245439 to your computer and use it in GitHub Desktop.
Example of using Goose from JRuby
require 'rubygems'
require 'java'
require 'chronic'
libs = []
libs << "lib/jars/*.jar"
libs.each do |lib|
Dir[lib].each do |jar|
puts "loading #{jar}"
require jar
end
end
module Maverick
include_package "com.gravity.goose"
end
module MaverickExtractors
include_package "com.gravity.goose.extractors"
end
class MyRubyDateExtractor < MaverickExtractors::PublishDateExtractor
def extract(rawdoc)
pub_date = rawdoc.select("div[class=submitted]").text
Chronic.parse(pub_date).to_java
end
end
@config = Maverick::Configuration.new
@config.local_storage_path = "./tmp"
@config.enable_image_fetching = false
@config.publish_date_extractor = MyRubyDateExtractor.new
url = "http://www.hollyscoop.com/paris-hilton/britney-shows-us-her-assets.html"
@goose = Maverick::Goose.new(@config)
@article = @goose.extract_content(url)
disp = <<EOT
Article title: #{@article.title}
Article pubdate: #{@article.publish_date}
Article tags: #{@article.meta_keywords}
Article:
------------------------------------------
#{@article.cleaned_article_text}
EOT
puts disp
lib/jars/
|-- akka-actor-1.1.3.jar
|-- akka-typed-actor-1.1.3.jar
|-- commons-codec-1.4.jar
|-- commons-io-2.0.1.jar
|-- commons-lang-2.6.jar
|-- commons-logging-1.1.1.jar
|-- goose-2.1.0.jar
|-- httpclient-4.1.2.jar
|-- httpcore-4.1.2.jar
|-- jsoup-1.5.2.jar
|-- log4j-1.2.16.jar
|-- scala-library-2.9.0-1.jar
|-- slf4j-api-1.6.1.jar
`-- slf4j-log4j12-1.6.1.jar
@erraggy
Copy link

erraggy commented Sep 27, 2011

So glad you got this hooked up @lusis!

@grangier
Copy link

How did you build the goose-2.1.0.jar ?

@lusis
Copy link
Author

lusis commented Dec 26, 2011

I'm assuming you have maven installed here

Check out the goose source tree and run mvn clean package. This will leave a two jar files in the target - one is the sources, the other is the jar you want.

Since I originally wrote this, it's possible some of the dependencies have changed. I just did a quick check and the versions look the same. You can run mvn dependency:tree | grep compile to see what jars you'll need. If you ran the build, they'll all have been downloaded.

The best way to just grab them all is to run mvn dependency:copy-dependencies. This will shove them all in target/dependency for you.

@grangier
Copy link

@lusis thank you very much for your detailed answer. Everything works as expected. Very usefull gist !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment