Skip to content

Instantly share code, notes, and snippets.

@rklemme
Created June 13, 2012 09:33
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save rklemme/2923072 to your computer and use it in GitHub Desktop.
Save rklemme/2923072 to your computer and use it in GitHub Desktop.
Alternative stripping HTML with regexp to https://github.com/6ftDan/regex-is-evil/blob/master/strip_html.rb
HTML_TAG_REPLACEMENTS = {
'br' => "\n",
}
HTML_QUOTE_REPLACEMENTS = {
'quot' => '"',
'amp' => '&',
}
def strip_html(str, tag = HTML_TAG_REPLACEMENTS, quot = HTML_QUOTE_REPLACEMENTS)
str.gsub %r{
# first alternative: remove tags
<
(?:
(?:(\w+) # tag name
# alternative: attributes
(?:
\s+
\w+ # attr name
=
(?:"[^"]*"|'[^']*') # attr value
)*
/? # optional tag closes also
)
|
# alternative: closing tag
(/\w+)
)
>
|
# second alternative: replace HTML entities
&
(\w+)
;
}x do |m|
tg = $1 || $2
if tg
tag[tg]
else
quot[$3]
end
end
end
ARGV.each do |arg|
printf "\nTest %p\n", arg
p strip_html(arg)
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment