Skip to content

Instantly share code, notes, and snippets.

@pedrobachiega
Created May 14, 2012 02:46
Show Gist options
  • Save pedrobachiega/2691483 to your computer and use it in GitHub Desktop.
Save pedrobachiega/2691483 to your computer and use it in GitHub Desktop.
Regex to extract links from HTML ( http://rubular.com/r/ESweX4uBlb )
require "test/unit"
class HtmlLinkTagsRegex < Test::Unit::TestCase
def regex
regex = /<a.+?href=["']([^"']+)["'].*?>(.+?)<\/a>/im
end
def test_extract_links
html = <<-html
<p>Do you know <A title="Bacon Ipsum" href="http://baconipsum.com/"
target="_blank">Bacon Ipsum</A> - from <a HREF='http://pedrobachiega.com' >pedrobachiega.com</a></p>
<p><a title="Bacon Ipsum" href="http://baconipsum.com/" target="_blank"><img src="http://baconipsum.com/wp-content/uploads/2011/06/bacon-ipsum-banner1.jpg" /></a></p>
<p>Hamburger beef bresaola pig tongue, pork chop sirloin tail pork belly shankle short loin pork. Pork loin ball tip pork meatloaf strip steak. <a href="http://wiki.answers.com/Q/Is_bacon_pork_or_beef">Bacon pork</a> loin pastrami, sirloin biltong ham hock spare ribs ground round hamburger shoulder tail pork chop. Speck pork belly bresaola t-bone. Swine prosciutto short ribs, tail pastrami leberkas shankle.</p>
<p><a href='https://en.wikipedia.org/wiki/Spare_ribs' target="_blank">Spare
ribs</a> kielbasa shank, frankfurter meatball tenderloin short loin salami beef ribs. Pastrami strip steak pork chop short ribs hamburger, speck chicken biltong tri-tip jerky meatloaf venison spare ribs pork loin corned beef. Tri-tip bresaola cow tail ball tip, filet mignon ham sirloin short loin beef ribs meatball. Ball tip pork belly beef ribs, flank turducken bacon ham shank jowl cow short ribs venison shoulder bresaola chicken. Spare ribs strip steak shankle kielbasa tri-tip. Ham hock jowl pancetta, turducken biltong prosciutto venison ball tip pork chop filet mignon fatback spare ribs corned beef pork loin.</p>
<p>Chicken ham drumstick, <a href="http://www.foodnetwork.com/recipes/emeril-live/boudin-sausage-recipe/index.html"
target="_blank">boudin sausage</a> shankle fatback jerky prosciutto short ribs ground round andouille chuck shoulder sirloin. Filet mignon andouille shankle pork loin, fatback short loin brisket. Turkey pork loin turducken, ball tip frankfurter shoulder brisket rump sirloin meatball sausage. Brisket meatball meatloaf andouille, spare ribs salami jowl pig drumstick corned beef speck ham hock tri-tip. Ground round shankle ham prosciutto, strip steak ball tip venison shank.</p>
html
links = html.scan(regex)
links.each_with_index do |link, i|
puts "#{i} - #{link}"
end
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment