Skip to content

Instantly share code, notes, and snippets.

@JonasCz
Last active May 20, 2021 04:58
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save JonasCz/a3b81def26ecc047ceb5 to your computer and use it in GitHub Desktop.
Save JonasCz/a3b81def26ecc047ceb5 to your computer and use it in GitHub Desktop.
Email and link / URL extraction using Jsoup
package jsouptest;
import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.jsoup.nodes.Document;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class JSoupTest {
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect("http://stackoverflow.com/questions/15893655/magento-ecomdev-phpunit-customer-fixtures-are-not-being-loaded/16668990#16668990").get();
Pattern p = Pattern.compile("[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+");
Matcher matcher = p.matcher(doc.text());
Set<String> emails = new HashSet<String>();
while (matcher.find()) {
emails.add(matcher.group());
}
Set<String> links = new HashSet<String>();
Elements elements = doc.select("a[href]");
for (Element e : elements) {
links.add(e.attr("href"));
}
System.out.println(emails);
System.out.println(links);
}
}
@rba73touro
Copy link

System.out. println will call the toString() method of emails and links, and print the stream signature of the set.
It would correct use the emails.stream().forEach(System.out::println);.
Use a HashSet to check for duplicates; a regular set will only tell you if it exists and then it will add it anyway.

@karenworld
Copy link

Thanks for sharing this wonderful tutorial.

@SMann278
Copy link

How do you solve leading and trailing random characters? Using this code I was able to pick up the following "email address" - administrationcarla_charles@nymc.edu914.594.2590

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment