The program below can take one or more plain text files as input. It works with python2 and python3.
Let's say we have two files that may contain email addresses:
- file_a.txt
foo bar
ok ideler.dennis@gmail.com sup
hey...user+123@example.com,wyd
hello world!
RESCHEDULE 2'OCLOCK WITH JEFF@AMAZON.COM FOR TOMORROW@3pm
- file_b.html
<html>
<body>
<ul>
<li><span class=pl-c>Dennis Ideler <ideler.dennis@gmail.com></span></li>
<li><span class=pl-c>Jane Doe <jdoe@example.com></span></li>
</ul>
</body>
</html>
To extract the email addresses, download the Python program and execute it on the command line with our files as input.
$ python extract_emails_from_text.py file_a.txt file_b.html
ideler.dennis@gmail.com
user+123@example.com
jeff@amazon.com
ideler.dennis@gmail.com
jdoe@example.com
Voila, it prints all found email addresses. Let's also remove the duplicates and sort the email addresses alphabetically.
$ python extract_emails_from_text.py file_a.txt file_b.html | sort | uniq
ideler.dennis@gmail.com
jdoe@example.com
jeff@amazon.com
user+123@example.com
Looks good! Now let's save the results to a file.
$ python extract_emails_from_text.py file_a.txt file_b.html | sort | uniq > emails.txt
P.S. The above commands for sorting and deduplicating are specific to shells on a UNIX-based machine (e.g. Linux or Mac). If you're using Windows, you can use PowerShell. For example
python extract_emails_from_text.py file_a.txt file_b.html | sort -unique
I keep getting this error though... anyone knows why?
"
File "extract_emails_from_text.py", line 29, in file_to_str return f.read().lower() # Case is lowered to prevent regex mismatches. File "C:\Users\bryan\AppData\Local\Programs\Python\Python38-32\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 164972: character maps to
"