jrochkind/Explanation.md

## Explanation.md

      
    Raw
  

              Explanation.md
            
          
    Rails has a handy truncate helper (which is actually mostly a method added to String ), but it warns you it's not safe to use on html source, it'll cut off end tags and such.
What if you want an HTML safe one?  There are a variety of suggested solutions you can google, none of which were quite robust/powerful enough for me.
So I started with my favorite, by Andrea Singh, using nokogiri.
But:


I  modified it to not monkey-patch Nokogiri, but be a static method instead (sadly making already confusing code yet more confusing, but I didn't want to monkey patch nokogiri)


I made it smarter about putting the mark-of-omission inside the tag who's text ended up truncated, instead of at the end of the source -- this is also not perfect, but works 'good enough' for most common use cases.


I made it handle Rails :seperator option -- again, very not perfectly, it will often break at a tag boundary instead of the actual best seperator, but in ways that should be good enough for most common use cases (tag boundaries are usually good breaking points too).


I made the top-level invocation method a Rails helper method using Rails functionality so-as to handle both html-safe truncation and ordinary truncation, if the string is html-safe, it uses html-safe truncation and returns a string that's still html-safe.


I added some tests (my tests run at the rails-helper method level, because that was convenient for me).


See the tests to see what it does and doesn't do.  It's not perfect, and there are a variety of different implementation or api choices that could be made -- but it's good enough for me, and if others have use cases like mine possibly better than anything else easily findable on the net.
If there's a lot of interest, I could turn this into an actual gem.
Although ultimately, for use in Rails, what I think should really  happen is for this functionality to be added to Rails html sanitize helper -- times when you want to sanitize overlap extensively with times when you want to truncate (since both are normally going to be with html as 'input' to your program), and both require an HTML parse. Better to do the HTML parse just once for both functions simultaneously, then need to do it once for sanitizing and again for truncating. (Rails sanitize doesn't use nokogiri, but it's own weird html parser).

  
## nokogiri_util.rb
# Nothing in here assumes Rails

require 'nokogiri'

module Util
 # An HTML-safe truncation using nokogiri, based off of:
  # http://blog.madebydna.com/all/code/2010/06/04/ruby-helper-to-cleanly-truncate-html.html
  #
  # but without monkey-patching, and behavior more consistent with Rails
  # truncate.
  #
  # It's hard to get all the edge-cases right, we probably mis-calculate slightly
  # on edge cases, and we aren't always able to strictly respect :seperator, sometimes
  # breaking on tag boundaries instead. But this should be good enough for actual use
  # cases, where those types of incorrect results are still good enough.
  #
  # ruby 1.9 only, in 1.8.7 non-ascii won't be handled quite right.
  #
  # Pass in a Nokogiri node, probably created with Nokogiri::HTML::DocumentFragment.parse(string)
  #
  # Might want to check length of your string to see if, even with HTML tags, it's
  # still under limit, before parsing as nokogiri and passing in here -- for efficiency.
  #
  # Get back a Nokogiri node, call #inner_html on it to go back to a string
  # (and you probably want to call .html_safe on the string you get back for use
  # in rails view)
  def self.nokogiri_truncate(node, max_length, omission = '…', seperator = nil)

    if node.kind_of?(::Nokogiri::XML::Text)
      if node.content.length > max_length
        allowable_endpoint = [0, max_length - omission.length].max
        if seperator
          allowable_endpoint = (node.content.rindex(seperator, allowable_endpoint) || allowable_endpoint)
        end

        ::Nokogiri::XML::Text.new(node.content.slice(0, allowable_endpoint) + omission, node.parent)
      else
        node.dup
      end
    else # DocumentFragment or Element
      return node if node.inner_text.length <= max_length

      truncated_node = node.dup
      truncated_node.children.remove
      remaining_length = max_length

      node.children.each do |child|
        #require 'debugger'
        #debugger
        if remaining_length == 0
          truncated_node.add_child ::Nokogiri::XML::Text.new(omission, truncated_node)
          break
        elsif remaining_length < 0
          break
        end
        truncated_node.add_child nokogiri_truncate(child, remaining_length, omission, seperator)
        # can end up less than 0 if the child was truncated to fit, that's
        # fine:
        remaining_length = remaining_length - child.inner_text.length

      end
      truncated_node
    end

  end
end

## rails_helper.rb
require 'util'

module SomeHelper

 # Like rails truncate helper, and taking the same options, but html_safe.
  #
  # If input string is NOT marked html_safe?, simply passes to rails truncate helper.
  # If a string IS marked html_safe?, uses nokogiri to parse it, and truncate
  # actual displayed text to max_length, while keeping html structure valid.
  #
  # Default omission marker is unicode elipsis unlike rails three periods.
  #
  # :length option will also default to 280, what we think is a good
  # length for abstract/snippet display, unlike rails 10.
  def special_truncate(str, options = {})
    options.reverse_merge!(:omission => "…", :length => 280)

    # works for non-html of course, but for html a quick check
    # to avoid expensive nokogiri parse if the whole string, even
    # with tags, is still less than max length.
    return str if str.length < options[:length]

    if str.html_safe?
      noko = Nokogiri::HTML::DocumentFragment.parse(str)
      Util.nokogiri_truncate(noko, options[:length], options[:omission], options[:seperator]).inner_html.html_safe
    else
      return truncate(str, options)
    end
  end

end

## special_truncate_test.rb
# encoding: UTF-8

require 'test_helper'

def test_truncate_basic
    # Basic test
    output = special_truncate("12345678901234567890", :length => 10)
    assert_equal "123456789…", output
  end

  def test_truncate_tags
    # With tags
    html_input = "123456<p><b>78901234567</b>890</p>".html_safe
    html_output = special_truncate(html_input, :length => 10)
    assert html_output.html_safe?, "truncated html_safe? is still html_safe?"
    assert_equal "123456<p><b>789…</b></p>", html_output
  end

  def test_truncate_tag_boundary
    # With break on tag boundary. Yes, there's an error not accounting
    # for length of omission marker in this particular edge case,
    # hard to fix, good enough for now.
    html_input = "<p>1234567890<b>123456</b>7890</p>".html_safe
    html_output = special_truncate(html_input, :length => 10)
    assert_equal "<p>1234567890…</p>", html_output
  end

  def test_truncate_boundary_edge_case
    html_input = "12345<p>6789<b>0123456</b>7890</p>".html_safe
    html_output = special_truncate(html_input, :length => 10)
    # yeah, weird elipses in <b> of their own, so it goes.
    assert_equal "12345<p>6789<b>…</b></p>", html_output
  end

  def test_truncate_another_edge_case
    html_input = "12345<p>67890<b>123456</b>7890</p>".html_safe
    html_output = special_truncate(html_input, :length => 10)
    assert_equal "12345<p>67890…</p>", html_output
  end

  def test_truncate_html_with_seperator
    html_input = "12345<p>67 901234<b></p>".html_safe
    html_output = special_truncate(html_input, :length => 10, :seperator => ' ')
    assert_equal "12345<p>67…</p>", html_output
  end

  def test_truncate_html_with_seperator_unavailable
    html_input = "12345<p>678901234<b></p>".html_safe
    html_output = special_truncate(html_input, :length => 10, :seperator => ' ')
    assert_equal "12345<p>6789…</p>", html_output
  end

  def test_truncate_html_with_boundary_seperator
    # known edge case we dont' handle, sorry. If this test
    # fails, that could be a good thing if you've fixed the edge case!
    html_input = "12345<p>6 8<b>90123456</b>7890</p>".html_safe
    html_output = special_truncate(html_input, :length => 10, :seperator => ' ')
    assert_equal "12345<p>6 8<b>9…</b></p>", html_output
  end

end
	# Nothing in here assumes Rails

	require 'nokogiri'

	module Util
	# An HTML-safe truncation using nokogiri, based off of:
	# http://blog.madebydna.com/all/code/2010/06/04/ruby-helper-to-cleanly-truncate-html.html
	#
	# but without monkey-patching, and behavior more consistent with Rails
	# truncate.
	#
	# It's hard to get all the edge-cases right, we probably mis-calculate slightly
	# on edge cases, and we aren't always able to strictly respect :seperator, sometimes
	# breaking on tag boundaries instead. But this should be good enough for actual use
	# cases, where those types of incorrect results are still good enough.
	#
	# ruby 1.9 only, in 1.8.7 non-ascii won't be handled quite right.
	#
	# Pass in a Nokogiri node, probably created with Nokogiri::HTML::DocumentFragment.parse(string)
	#
	# Might want to check length of your string to see if, even with HTML tags, it's
	# still under limit, before parsing as nokogiri and passing in here -- for efficiency.
	#
	# Get back a Nokogiri node, call #inner_html on it to go back to a string
	# (and you probably want to call .html_safe on the string you get back for use
	# in rails view)
	def self.nokogiri_truncate(node, max_length, omission = '…', seperator = nil)

	if node.kind_of?(::Nokogiri::XML::Text)
	if node.content.length > max_length
	allowable_endpoint = [0, max_length - omission.length].max
	if seperator
	allowable_endpoint = (node.content.rindex(seperator, allowable_endpoint) \|\| allowable_endpoint)
	end

	::Nokogiri::XML::Text.new(node.content.slice(0, allowable_endpoint) + omission, node.parent)
	else
	node.dup
	end
	else # DocumentFragment or Element
	return node if node.inner_text.length <= max_length

	truncated_node = node.dup
	truncated_node.children.remove
	remaining_length = max_length

	node.children.each do \|child\|
	#require 'debugger'
	#debugger
	if remaining_length == 0
	truncated_node.add_child ::Nokogiri::XML::Text.new(omission, truncated_node)
	break
	elsif remaining_length < 0
	break
	end
	truncated_node.add_child nokogiri_truncate(child, remaining_length, omission, seperator)
	# can end up less than 0 if the child was truncated to fit, that's
	# fine:
	remaining_length = remaining_length - child.inner_text.length

	end
	truncated_node
	end

	end
	end
	require 'util'

	module SomeHelper

	# Like rails truncate helper, and taking the same options, but html_safe.
	#
	# If input string is NOT marked html_safe?, simply passes to rails truncate helper.
	# If a string IS marked html_safe?, uses nokogiri to parse it, and truncate
	# actual displayed text to max_length, while keeping html structure valid.
	#
	# Default omission marker is unicode elipsis unlike rails three periods.
	#
	# :length option will also default to 280, what we think is a good
	# length for abstract/snippet display, unlike rails 10.
	def special_truncate(str, options = {})
	options.reverse_merge!(:omission => "…", :length => 280)

	# works for non-html of course, but for html a quick check
	# to avoid expensive nokogiri parse if the whole string, even
	# with tags, is still less than max length.
	return str if str.length < options[:length]

	if str.html_safe?
	noko = Nokogiri::HTML::DocumentFragment.parse(str)
	Util.nokogiri_truncate(noko, options[:length], options[:omission], options[:seperator]).inner_html.html_safe
	else
	return truncate(str, options)
	end
	end

	end
	# encoding: UTF-8

	require 'test_helper'

	def test_truncate_basic
	# Basic test
	output = special_truncate("12345678901234567890", :length => 10)
	assert_equal "123456789…", output
	end

	def test_truncate_tags
	# With tags
	html_input = "123456<p><b>78901234567</b>890</p>".html_safe
	html_output = special_truncate(html_input, :length => 10)
	assert html_output.html_safe?, "truncated html_safe? is still html_safe?"
	assert_equal "123456<p><b>789…</b></p>", html_output
	end

	def test_truncate_tag_boundary
	# With break on tag boundary. Yes, there's an error not accounting
	# for length of omission marker in this particular edge case,
	# hard to fix, good enough for now.
	html_input = "<p>1234567890<b>123456</b>7890</p>".html_safe
	html_output = special_truncate(html_input, :length => 10)
	assert_equal "<p>1234567890…</p>", html_output
	end

	def test_truncate_boundary_edge_case
	html_input = "12345<p>6789<b>0123456</b>7890</p>".html_safe
	html_output = special_truncate(html_input, :length => 10)
	# yeah, weird elipses in <b> of their own, so it goes.
	assert_equal "12345<p>6789<b>…</b></p>", html_output
	end

	def test_truncate_another_edge_case
	html_input = "12345<p>67890<b>123456</b>7890</p>".html_safe
	html_output = special_truncate(html_input, :length => 10)
	assert_equal "12345<p>67890…</p>", html_output
	end

	def test_truncate_html_with_seperator
	html_input = "12345<p>67 901234<b></p>".html_safe
	html_output = special_truncate(html_input, :length => 10, :seperator => ' ')
	assert_equal "12345<p>67…</p>", html_output
	end

	def test_truncate_html_with_seperator_unavailable
	html_input = "12345<p>678901234<b></p>".html_safe
	html_output = special_truncate(html_input, :length => 10, :seperator => ' ')
	assert_equal "12345<p>6789…</p>", html_output
	end

	def test_truncate_html_with_boundary_seperator
	# known edge case we dont' handle, sorry. If this test
	# fails, that could be a good thing if you've fixed the edge case!
	html_input = "12345<p>6 8<b>90123456</b>7890</p>".html_safe
	html_output = special_truncate(html_input, :length => 10, :seperator => ' ')
	assert_equal "12345<p>6 8<b>9…</b></p>", html_output
	end

	end