Skip to content

Instantly share code, notes, and snippets.

@sengupta
Created January 16, 2012 18:46
Show Gist options
  • Save sengupta/1622276 to your computer and use it in GitHub Desktop.
Save sengupta/1622276 to your computer and use it in GitHub Desktop.
Email Scraper

Simple Email Scraper

This contains two files, scraper.sh and scraper.py.

scraper.sh is useful for web pages where email addresses are visible on the rendered page. scraper.py is useful for web pages where email addresses are available anywhere within the HTML (and more expensive).

Usage

./scraper.sh http://example.com

Or

./scraper.py http://example.com
DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
0. You just DO WHAT THE FUCK YOU WANT TO.
#!/usr/bin/python
# This program is free software. It comes without any warranty, to
# the extent permitted by applicable law. You can redistribute it
# and/or modify it under the terms of the Do What The Fuck You Want
# To Public License, Version 2, a copy of which is provided in the
# file LICENSE.txt.
# Enclose the line below in a loop to have it scrape over multiple pages of a site.
# This line currently scrapes one page to pull out emails.
import re
import sys
import urllib
url = urllib.urlopen(sys.argv[1])
response = url.read()
regex = re.compile(r'[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}')
emails = regex.findall(response)
with open('emails.csv', 'w+') as email_file:
email_file.write('\n'.join(set(emails)))
#/bin/bash
# This program is free software. It comes without any warranty, to
# the extent permitted by applicable law. You can redistribute it
# and/or modify it under the terms of the Do What The Fuck You Want
# To Public License, Version 2, a copy of which is provided in the
# file LICENSE.txt.
# Enclose the line below in a loop to have it scrape over multiple pages of a site.
# This line currently scrapes one page to pull out emails.
curl -s "$1" | sed 's/<[^>]*>//g' | sed -e 's/^[ \t]*//' | sed 's/&nbsp;//' | grep -srhw "[[:alnum:]_.-]\+@[[:alnum:]_.-]\+" >> emails.csv
@stefanpejcic
Copy link

stefanpejcic commented Dec 19, 2019

#For Python3

import re
import sys
import urllib
import urllib.request

with urllib.request.urlopen(sys.argv[1]) as url:
    response = url.read()

regex = re.compile(r'[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}')

emails = regex.findall(response)
with open('emails.csv', 'w+') as email_file: 
    email_file.write('\n'.join(set(emails)))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment