Skip to content

Instantly share code, notes, and snippets.

@EliasRanz
Last active January 2, 2019 21:50
Show Gist options
  • Save EliasRanz/ff40397864310338a17f916c8777d737 to your computer and use it in GitHub Desktop.
Save EliasRanz/ff40397864310338a17f916c8777d737 to your computer and use it in GitHub Desktop.
Link Utility that allows you to extract urls from a given text of string. Utilized by my Twitch moderator bot.
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class LinkUtil {
public static List<String> extractUrls(String text) {
List<String> containedUrls = new ArrayList<>();
Pattern urlPattern = Pattern.compile(
"(?:(?:https?|ftp):\\/\\/)?(?:\\S+(?::\\S*)?@)?(?:(?!10(?:\\.\\d{1,3}){3})(?!127(?:\\.\\d{1,3}){3})(?!169\\.254(?:\\.\\d{1,3}){2})(?!192\\.168(?:\\.\\d{1,3}){2})(?!172\\.(?:1[6-9]|2\\d|3[0-1])(?:\\.\\d{1,3}){2})(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}(?:\\.(?:[1-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))|(?:(?:[a-z\\x{00a1}-\\x{ffff}0-9]+-?)*[a-z\\x{00a1}-\\x{ffff}0-9]+)(?:\\.(?:[a-z\\x{00a1}-\\x{ffff}0-9]+-?)*[a-z\\x{00a1}-\\x{ffff}0-9]+)*(?:\\.(?:[a-z\\x{00a1}-\\x{ffff}]{2,})))(?::\\d{2,5})?(?:\\/[^\\s]*)?",
Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
Matcher urlMatcher = urlPattern.matcher(text);
while (urlMatcher.find()) {
String url = text.substring(urlMatcher.start(0), urlMatcher.end());
containedUrls.add(url);
}
return new ArrayList<>(new HashSet<>(containedUrls));
}
}
@EliasRanz
Copy link
Author

EliasRanz commented Jan 2, 2019

Test cases: https://regex101.com/r/nvCTnh/2/
Regex courtesy of diegoperini's entry on https://mathiasbynens.be/demo/url-regex

Usage example:
The new ArrayList<>(new HashSet<>(containedUrls)) will get rid of duplicate results, which should make parsing more efficient.

List<String> urls = LinkUtil.extractUrls(text);
assert(!urls.isEmpty() && urls.size() >= 1);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment