Skip to content

Instantly share code, notes, and snippets.

@antoniotrento
Forked from sengupta/LICENSE.txt
Created March 26, 2017 10:33
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save antoniotrento/c31b4248ed102ebd78637203730d5451 to your computer and use it in GitHub Desktop.
Save antoniotrento/c31b4248ed102ebd78637203730d5451 to your computer and use it in GitHub Desktop.
Email Scraper

Simple Email Scraper

This contains two files, scraper.sh and scraper.py.

scraper.sh is useful for web pages where email addresses are visible on the rendered page. scraper.py is useful for web pages where email addresses are available anywhere within the HTML (and more expensive).

Usage

./scraper.sh http://example.com

Or

./scraper.py http://example.com
DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
0. You just DO WHAT THE FUCK YOU WANT TO.
#!/usr/bin/python
# This program is free software. It comes without any warranty, to
# the extent permitted by applicable law. You can redistribute it
# and/or modify it under the terms of the Do What The Fuck You Want
# To Public License, Version 2, a copy of which is provided in the
# file LICENSE.txt.
# Enclose the line below in a loop to have it scrape over multiple pages of a site.
# This line currently scrapes one page to pull out emails.
import re
import sys
import urllib
url = urllib.urlopen(sys.argv[1])
response = url.read()
regex = re.compile(r'[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}')
emails = regex.findall(response)
with open('emails.csv', 'w+') as email_file:
email_file.write('\n'.join(set(emails)))
#/bin/bash
# This program is free software. It comes without any warranty, to
# the extent permitted by applicable law. You can redistribute it
# and/or modify it under the terms of the Do What The Fuck You Want
# To Public License, Version 2, a copy of which is provided in the
# file LICENSE.txt.
# Enclose the line below in a loop to have it scrape over multiple pages of a site.
# This line currently scrapes one page to pull out emails.
curl -s "$1" | sed 's/<[^>]*>//g' | sed -e 's/^[ \t]*//' | sed 's/&nbsp;//' | grep -srhw "[[:alnum:]_.-]\+@[[:alnum:]_.-]\+" >> emails.csv
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment