Skip to content

Instantly share code, notes, and snippets.

@O-I
Last active August 19, 2023 01:02
Show Gist options
  • Star 6 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save O-I/4dd3a936a09474df97a0 to your computer and use it in GitHub Desktop.
Save O-I/4dd3a936a09474df97a0 to your computer and use it in GitHub Desktop.
[TIx 3] Extracting links from text with URI::extract

I have a simple Rails app that collects all the tweets I favorite on Twitter so I can sort and search through them at my leisure. Many of those favorites contain links I'd like to refer to, so I wrote a helper method that converts them to clickable anchor tags that looked like this:

# app/helpers/favorites_helper.rb
module FavoritesHelper

  # snip
  
  def text_to_true_link(tweet_text)
    urls = tweet_text.scan(/https*:\/\/t.co\/\w+/)
    urls.each do |url|
      tweet_text.gsub!(url, "<a href=#{url} target='_blank'>#{url}</a>")
    end
    tweet_text.html_safe
  end
end

The text_to_true_link method

  1. takes raw tweet_text as a string,
  2. scans through it looking for Twitter shortlinks with a regex (which should have used ? in place of * there),
  3. stores the link text in an array called urls,
  4. substitutes each link with an anchor tag for that link, and
  5. returns the newly formatted tweet_text with clickable links.

I thought this was a pretty clever hack, but while looking at my oldest tweets, I realized that they had links that predated the standard t.co shortlink and subsequently were not being converted into clickable links. So I did what you'd expect an inexperienced developer to do — I started looking for a Goldilocks regex that wasn't too complex and wasn't too liberal that would be adequate for my URI matching purposes.

While doing this, I stumbled upon a Stack Overflow answer that mentioned URI::regexp which had a comment mentioning URI::extract. What does URI::extract do? Why, exactly what I want — it extracts URIs from text.

At first, I tried using urls = URI.extract(tweet_text) which seemed to work. However, on further inspection, this was capturing any text that terminated in a colon, too, e.g.,

tweet_text = "Kleisli: common monads in Ruby https://github.com/txus/kleisli"
urls = URI.extract(tweet_text) # => ["Kleisli:", "https://github.com/txus/kleisli"]

Looking more closely at the documentation, URI::extract takes a second argument that limits URI matches to a specific set of schemes.

tweet_text = "Kleisli: common monads in Ruby https://github.com/txus/kleisli"
urls = URI.extract(tweet_text, %w(http https)) # => ["https://github.com/txus/kleisli"]

This led me to my current adequate implementation:

# app/helpers/favorites_helper.rb
module FavoritesHelper

  # snip
  
  def text_to_true_link(tweet_text)
    urls = URI.extract(tweet_text, %w(http https))
    urls.each do |url|
      tweet_text.gsub!(url, "<a href=#{url} target='_blank'>#{url}</a>")
    end
    tweet_text.html_safe
  end
end

Normally, I think I do a good job checking (or knowing) whether Ruby has a method that does what I want before I try to implement my own solution. Thinking more deeply as to why I missed URI::extract, I realized that while I have a pretty good command of Ruby's core libraries, I haven't spent nearly as much time exploring Ruby's standard libraries. I'd like to dig into more of the latter from here on out.

Questions I still have:

  1. Is there a better way to replace embedded links in text with their clickable counterparts?
  2. How does a large site like Twitter or Facebook implement this?
@bgschiller
Copy link

Thanks for this! should solve my problem exactly. Heads up—I think there's a bug that appears if the same url appears twice in the text you're escaping:

2.5.3 :008 > text_to_true_link('a url https://example.com repeated again https://example.com')
 => "a url <a href=<a href=https://example.com target='_blank'>https://example.com</a> target='_blank'><a href=https://example.com target='_blank'>https://example.com</a></a> repeated again <a href=<a href=https://example.com target='_blank'>https://example.com</a> target='_blank'><a href=https://example.com target='_blank'>https://example.com</a></a>" 

This can be fixed by adding a .uniq after the call to URI.extract.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment