Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
A python script for extracting email addresses from text files.You can pass it multiple files. It prints the email addresses to stdout, one address per line.For ease of use, remove the .py extension and place it in your $PATH (e.g. /usr/local/bin/) to run it like a built-in command.

The program below can take one or more plain text files as input. It works with python2 and python3.

Let's say we have two files that may contain email addresses:

  1. file_a.txt
foo bar
ok ideler.dennis@gmail.com sup
 hey...user+123@example.com,wyd
hello world!

RESCHEDULE 2'OCLOCK WITH JEFF@AMAZON.COM FOR TOMORROW@3pm
  1. file_b.html
<html>
<body>
  <ul>
    <li><span class=pl-c>Dennis Ideler &lt;ideler.dennis@gmail.com&gt;</span></li>
    <li><span class=pl-c>Jane Doe &lt;jdoe@example.com&gt;</span></li>
  </ul>
</body>
</html>

To extract the email addresses, download the Python program and execute it on the command line with our files as input.

$ python extract_emails_from_text.py file_a.txt file_b.html
ideler.dennis@gmail.com
user+123@example.com
jeff@amazon.com
ideler.dennis@gmail.com
jdoe@example.com

Voila, it prints all found email addresses. Let's also remove the duplicates and sort the email addresses alphabetically.

$ python extract_emails_from_text.py file_a.txt file_b.html | sort | uniq
ideler.dennis@gmail.com
jdoe@example.com
jeff@amazon.com
user+123@example.com

Looks good! Now let's save the results to a file.

$ python extract_emails_from_text.py file_a.txt file_b.html | sort | uniq > emails.txt

P.S. The above commands for sorting and deduplicating are specific to shells on a UNIX-based machine (e.g. Linux or Mac). If you're using Windows, you can use PowerShell. For example

python extract_emails_from_text.py file_a.txt file_b.html | sort -unique
#!/usr/bin/env python
#
# Extracts email addresses from one or more plain text files.
#
# Notes:
# - Does not save to file (pipe the output to a file if you want it saved).
# - Does not check for duplicates (which can easily be done in the terminal).
#
# (c) 2013 Dennis Ideler <ideler.dennis@gmail.com>
from optparse import OptionParser
import os.path
import re
regex = re.compile(("([a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`"
"{|}~-]+)*(@|\sat\s)(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?(\.|"
"\sdot\s))+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)"))
def file_to_str(filename):
"""Returns the contents of filename as a string."""
with open(filename) as f:
return f.read().lower() # Case is lowered to prevent regex mismatches.
def get_emails(s):
"""Returns an iterator of matched emails found in string s."""
# Removing lines that start with '//' because the regular expression
# mistakenly matches patterns like 'http://foo@bar.com' as '//foo@bar.com'.
return (email[0] for email in re.findall(regex, s) if not email[0].startswith('//'))
if __name__ == '__main__':
parser = OptionParser(usage="Usage: python %prog [FILE]...")
# No options added yet. Add them here if you ever need them.
options, args = parser.parse_args()
if not args:
parser.print_usage()
exit(1)
for arg in args:
if os.path.isfile(arg):
for email in get_emails(file_to_str(arg)):
print(email)
else:
print('"{}" is not a file.'.format(arg))
parser.print_usage()
@tcgreddy-8553665381
Copy link

tcgreddy-8553665381 commented Jun 1, 2017

Hello Dideler,

I am looking for Email Data Extractor

From Selected Folder
From Selected dates between
The data information i need between selected From & To emails only

Kindly share the script for same, so i can use it in Google spread sheet to track those mails data for my daily use

@kurianbenoy
Copy link

kurianbenoy commented Jun 4, 2017

Thank You

@wrystal
Copy link

wrystal commented Oct 6, 2017

There are small amount of wrong matching cases,such as:
online at www.amazon.com
A kind of stupid way is to adjust it is:

pattern = re.compile("([a-z0-9!#$%&*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`"
                    "{|}~-]+)*(@)(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?(\.))+[a-z0-9]"
                    "(?:[a-z0-9-]*[a-z0-9])?)|([a-z0-9!#$%&*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`"
                    "{|}~-]+)*(\sat\s)(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?(\sdot\s))+[a-z0-9]"
                    "(?:[a-z0-9-]*[a-z0-9])?)",re.S)

@samayia
Copy link

samayia commented Nov 23, 2017

Emails within placeholders should be remove.

e.g. placeholder="your@email.com"

This will generally give dummy and wanted value.

@glunardi
Copy link

glunardi commented Dec 12, 2017

Thanks a bunch, you just saved me 30 minutes! Merci beaucoup!

@futzlarson
Copy link

futzlarson commented Mar 8, 2018

Awesome.

@siafsadki
Copy link

siafsadki commented Mar 12, 2018

so useful... thanks bro :)

@Sreevalli535
Copy link

Sreevalli535 commented May 21, 2018

Can I extract fields From and their correspodning To with this code

@Sreevalli535
Copy link

Sreevalli535 commented May 21, 2018

Can I extract fields From and their corresponding To with this code

@tweetyoc
Copy link

tweetyoc commented Jun 29, 2018

Is there a way search for the actual email addresses and have them obfuscated? I am sharing a file with someone externally so would like to create another file that obfuscated the email address. Any help would be appreciated?

@khris117
Copy link

khris117 commented Sep 20, 2018

I would like to know why you used \sdot\s ?

@sujithvemi
Copy link

sujithvemi commented Jan 30, 2019

@dideler
Great work here. Thank you! Just wondering why you didn't use \w (the metacharacter for word characters) in the regex instead of [a-z0-9]?

'\w' includes underscore as well

@EngrMuhammadUsman
Copy link

EngrMuhammadUsman commented Feb 5, 2019

Great work 👍

@gajus
Copy link

gajus commented May 7, 2020

I have written a module that handles scenarios such as obfuscated emails, emails with tags, unicode characters, etc. https://github.com/gajus/extract-email-address

@bryanseah234
Copy link

bryanseah234 commented Sep 2, 2020

I keep getting this error though... anyone knows why?

"
File "extract_emails_from_text.py", line 29, in file_to_str return f.read().lower() # Case is lowered to prevent regex mismatches. File "C:\Users\bryan\AppData\Local\Programs\Python\Python38-32\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 164972: character maps to
"

@dideler
Copy link
Author

dideler commented Sep 4, 2020

@bryanseah234 the file you're reading has an unexpected encoding. Try setting the appropriate encoding for the file you're reading.

E.g. Change open(filename) to open(file, encoding="utf8") or open(file, encoding="latin-1").

@paseman
Copy link

paseman commented Jan 4, 2021

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment