dideler/example.md

## example.md

      
    Raw
  

              example.md
            
          
    The program below can take one or more plain text files as input. It works with python2 and python3.
Let's say we have two files that may contain email addresses:

file_a.txt

foo bar
ok ideler.dennis@gmail.com sup
 hey...user+123@example.com,wyd
hello world!

RESCHEDULE 2'OCLOCK WITH JEFF@AMAZON.COM FOR TOMORROW@3pm


file_b.html

<html>
<body>
  <ul>
    <li><span class=pl-c>Dennis Ideler &lt;ideler.dennis@gmail.com&gt;</span></li>
    <li><span class=pl-c>Jane Doe &lt;jdoe@example.com&gt;</span></li>
  </ul>
</body>
</html>

To extract the email addresses, download the Python program and execute it on the command line with our files as input.
$ python extract_emails_from_text.py file_a.txt file_b.html
ideler.dennis@gmail.com
user+123@example.com
jeff@amazon.com
ideler.dennis@gmail.com
jdoe@example.com

Voila, it prints all found email addresses. Let's also remove the duplicates and sort the email addresses alphabetically.
$ python extract_emails_from_text.py file_a.txt file_b.html | sort | uniq
ideler.dennis@gmail.com
jdoe@example.com
jeff@amazon.com
user+123@example.com

Looks good! Now let's save the results to a file.
$ python extract_emails_from_text.py file_a.txt file_b.html | sort | uniq > emails.txt

P.S. The above commands for sorting and deduplicating are specific to shells on a UNIX-based machine (e.g. Linux or Mac). If you're using Windows, you can use PowerShell. For example
python extract_emails_from_text.py file_a.txt file_b.html | sort -unique

  
## extract_emails_from_text.py
#!/usr/bin/env python
#
# Extracts email addresses from one or more plain text files.
#
# Notes:
# - Does not save to file (pipe the output to a file if you want it saved).
# - Does not check for duplicates (which can easily be done in the terminal).
#
# (c) 2013  Dennis Ideler <ideler.dennis@gmail.com>

from optparse import OptionParser
import os.path
import re

regex = re.compile(("([a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`"
                    "{|}~-]+)*(@|\sat\s)(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?(\.|"
                    "\sdot\s))+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)"))

def file_to_str(filename):
    """Returns the contents of filename as a string."""
    with open(filename) as f:
        return f.read().lower() # Case is lowered to prevent regex mismatches.

def get_emails(s):
    """Returns an iterator of matched emails found in string s."""
    # Removing lines that start with '//' because the regular expression
    # mistakenly matches patterns like 'http://foo@bar.com' as '//foo@bar.com'.
    return (email[0] for email in re.findall(regex, s) if not email[0].startswith('//'))

if __name__ == '__main__':
    parser = OptionParser(usage="Usage: python %prog [FILE]...")
    # No options added yet. Add them here if you ever need them.
    options, args = parser.parse_args()

    if not args:
        parser.print_usage()
        exit(1)

    for arg in args:
        if os.path.isfile(arg):
            for email in get_emails(file_to_str(arg)):
                print(email)
        else:
            print('"{}" is not a file.'.format(arg))
            parser.print_usage()
	#!/usr/bin/env python
	#
	# Extracts email addresses from one or more plain text files.
	#
	# Notes:
	# - Does not save to file (pipe the output to a file if you want it saved).
	# - Does not check for duplicates (which can easily be done in the terminal).
	#
	# (c) 2013 Dennis Ideler <ideler.dennis@gmail.com>

	from optparse import OptionParser
	import os.path
	import re

	regex = re.compile(("([a-z0-9!#$%&'+\/=?^_`{\|}~-]+(?:\.[a-z0-9!#$%&'+\/=?^_`"
	"{\|}~-]+)(@\|\sat\s)(?:[a-z0-9](?:[a-z0-9-][a-z0-9])?(\.\|"
	"\sdot\s))+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)"))

	def file_to_str(filename):
	"""Returns the contents of filename as a string."""
	with open(filename) as f:
	return f.read().lower() # Case is lowered to prevent regex mismatches.

	def get_emails(s):
	"""Returns an iterator of matched emails found in string s."""
	# Removing lines that start with '//' because the regular expression
	# mistakenly matches patterns like 'http://foo@bar.com' as '//foo@bar.com'.
	return (email[0] for email in re.findall(regex, s) if not email[0].startswith('//'))

	if __name__ == '__main__':
	parser = OptionParser(usage="Usage: python %prog [FILE]...")
	# No options added yet. Add them here if you ever need them.
	options, args = parser.parse_args()

	if not args:
	parser.print_usage()
	exit(1)

	for arg in args:
	if os.path.isfile(arg):
	for email in get_emails(file_to_str(arg)):
	print(email)
	else:
	print('"{}" is not a file.'.format(arg))
	parser.print_usage()