drnic (owner)

Forks

Revisions

gist: 211438 Download_button fork
public
Description:
A quick app built from scratch + deployed to heroku in 50 minutes (including twitter-based bug fixing) http://page-stripper.heroku.com/
Public Clone URL: git://gist.github.com/211438.git
Embed All Files: show embed
app/controllers/home_controller.rb #
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
class HomeController < ApplicationController
  def index
    if (@target_url = params["url"]) && !@target_url.blank?
      @target_url = @target_url =~ %r{^http://} ? @target_url : "http://#{@target_url}"
      filter_if_length_less_than = 40
      @page = open(@target_url).read
      doc = Nokogiri::HTML.parse(@page)
      content = doc.search("h1,p,.comment")
      content = content.reject { |node| node.text.gsub(/\W/,'').strip.length < filter_if_length_less_than }
      content = content.reject { |node| (%w[noscript li] & node.ancestors.map { |e| e.name }).length > 0 }
      @contents = content.map { |e| e.text }.join("\n").split(/\n+/).map { |e| "<p>#{e.strip}</p>" }.join
    end
  rescue Exception => exception
    log_error(exception) if logger
    erase_results if performed?
    flash.now[:notice] = "Bad bad things happened without cause"
  end
  
end
 
app/views/home/index.html.haml #
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
%h1 Extract real content from any page
 
- if flash[:notice]
  .flash
    = flash[:notice]
 
%p
  Using url:
  = @target_url
 
%form{:action => '/', :method => 'GET'}
  %fieldset
    %ol
      %li
        %label{:for => 'url'} URL
        %input{:id => 'url', :name => 'url', :style => 'width: 30em'}
        %input{:type => 'submit', :value => 'Strip'}
 
.contents
  = @contents
features/strip_content.feature #
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Feature: Strip content
  In order to send real content to KIM
  As a imindi user
  I want to pull out real content from any web page
 
  Scenario: Parse http://www.cnn.com/2009/US/10/15/arizona.sweat.lodge/index.html
    Given I am on the home page
    When I fill in "URL" with "http://www.cnn.com/2009/US/10/15/arizona.sweat.lodge/index.html"
    When I press "Strip"
    Then I should see "Using url:"
    And I should see "http://www.cnn.com/2009/US/10/15/arizona.sweat.lodge/index.html"
    And I should not see "Bad bad things happened without cause"
  
  Scenario: Parse www.cnn.com/2009/US/10/15/arizona.sweat.lodge/index.html
    Given I am on the home page
    When I fill in "URL" with "http://www.cnn.com/2009/US/10/15/arizona.sweat.lodge/index.html"
    When I press "Strip"
    Then I should see "Using url:"
    And I should see "http://www.cnn.com/2009/US/10/15/arizona.sweat.lodge/index.html"
    And I should not see "Bad bad things happened without cause"