Skip to content

Instantly share code, notes, and snippets.

@ohaal
Last active October 6, 2015 06:38
Show Gist options
  • Save ohaal/2952574 to your computer and use it in GitHub Desktop.
Save ohaal/2952574 to your computer and use it in GitHub Desktop.
Regular expression to grab (most) URLs from any string, pre gTLD craze
# Regex to grab (most) URLs from any string (except most gTLD's, but can be added manually)
# This will never be perfect (see [1]&[2]), but it does its job fair enough, currently only 2 exceptions are added (museum & travel)
#
# Capture groups:
# 1. Full URL
# 2. The protocol
# 3. Hostname+path including `www*.`
# 4. Hostname+path excluding `www*.`
# TODO: Capture group for path
#
# [1] http://en.wikipedia.org/wiki/Top-level_domain
# [2] http://newgtlds.icann.org/en/program-status/application-results/strings-1200utc-13jun12-en
/
(
\b
(?:(https?|ftp):\/\/)?
(
(?:www\d{0,3}\.)?
(
[a-z0-9.-]+\.
(?:[a-z]{2,4}|museum|travel)
(?:\/[^\/\s]+)*
)
)
\b
)
/ix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment