Skip to content

Instantly share code, notes, and snippets.

@drnic
Created October 16, 2009 00:40
Show Gist options
  • Save drnic/211438 to your computer and use it in GitHub Desktop.
Save drnic/211438 to your computer and use it in GitHub Desktop.
A quick app built from scratch + deployed to heroku in 50 minutes (including twitter-based bug fixing) http://page-stripper.heroku.com/
class HomeController < ApplicationController
def index
if (@target_url = params["url"]) && !@target_url.blank?
@target_url = @target_url =~ %r{^http://} ? @target_url : "http://#{@target_url}"
filter_if_length_less_than = 40
@page = open(@target_url).read
doc = Nokogiri::HTML.parse(@page)
content = doc.search("h1,p,.comment")
content = content.reject { |node| node.text.gsub(/\W/,'').strip.length < filter_if_length_less_than }
content = content.reject { |node| (%w[noscript li] & node.ancestors.map { |e| e.name }).length > 0 }
@contents = content.map { |e| e.text }.join("\n").split(/\n+/).map { |e| "<p>#{e.strip}</p>" }.join
end
rescue Exception => exception
log_error(exception) if logger
erase_results if performed?
flash.now[:notice] = "Bad bad things happened without cause"
end
end
%h1 Extract real content from any page
- if flash[:notice]
.flash
= flash[:notice]
%p
Using url:
= @target_url
%form{:action => '/', :method => 'GET'}
%fieldset
%ol
%li
%label{:for => 'url'} URL
%input{:id => 'url', :name => 'url', :style => 'width: 30em'}
%input{:type => 'submit', :value => 'Strip'}
.contents
= @contents
Feature: Strip content
In order to send real content to KIM
As a imindi user
I want to pull out real content from any web page
Scenario: Parse http://www.cnn.com/2009/US/10/15/arizona.sweat.lodge/index.html
Given I am on the home page
When I fill in "URL" with "http://www.cnn.com/2009/US/10/15/arizona.sweat.lodge/index.html"
When I press "Strip"
Then I should see "Using url:"
And I should see "http://www.cnn.com/2009/US/10/15/arizona.sweat.lodge/index.html"
And I should not see "Bad bad things happened without cause"
Scenario: Parse www.cnn.com/2009/US/10/15/arizona.sweat.lodge/index.html
Given I am on the home page
When I fill in "URL" with "http://www.cnn.com/2009/US/10/15/arizona.sweat.lodge/index.html"
When I press "Strip"
Then I should see "Using url:"
And I should see "http://www.cnn.com/2009/US/10/15/arizona.sweat.lodge/index.html"
And I should not see "Bad bad things happened without cause"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment