Created
January 2, 2021 17:02
-
-
Save johnbumgarner/a18da7f45f856ee7f1135fb8288a7b64 to your computer and use it in GitHub Desktop.
This function is designed to parse an email message using the built-in Python module email.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
###################################################################################### | |
# The re module is part of the standard Python library which regular expression | |
# matching operations similar to those found in Perl. | |
###################################################################################### | |
import re as regex | |
###################################################################################### | |
# The email module is part of the standard Python library which provides | |
# functions for reading, writing, and sending simple email messages. | |
# | |
# email.header.decode_header - decodes a message header value without converting the | |
# character set | |
# | |
# ref: https://python.readthedocs.io/en/stable/library/email.html#module-email | |
###################################################################################### | |
import email | |
from email.header import decode_header | |
def expunge_doublespaces(raw_string): | |
""" | |
This function is designed to remove all double spaces | |
from the input string. | |
:param raw_string: string to clean | |
:return: clean string | |
""" | |
if ' ' not in raw_string: | |
return raw_string | |
return expunge_doublespaces(raw_string.replace(' ', ' ')) | |
raw_message = r"""Message-ID: <12626409.1075857596370.JavaMail.evans@thyme> | |
Date: Tue, 17 Oct 2000 10:36:00 -0700 (PDT) | |
From: john.arnold@enron.com | |
To: jenwhite7@zdnetonebox.com | |
Subject: Re: Hi | |
Mime-Version: 1.0 | |
Content-Type: text/plain; charset=us-ascii | |
Content-Transfer-Encoding: 7bit | |
X-From: John Arnold | |
X-To: "Jennifer White" <jenwhite7@zdnetonebox.com> @ ENRON | |
X-cc: | |
X-bcc: | |
X-Folder: \John_Arnold_Dec2000\Notes Folders\'sent mail | |
X-Origin: Arnold-J | |
X-FileName: Jarnold.nsf | |
So, what is it? And by the way, don't start with the excuses. You're | |
expected to be a full, gourmet cook. | |
Kisses, not music, makes cooking a more enjoyable experience. | |
"Jennifer White" <jenwhite7@zdnetonebox.com> on 10/17/2000 04:19:20 PM | |
To: jarnold@enron.com | |
cc: | |
Subject: Hi | |
I told you I have a long email address. | |
I've decided what to prepare for dinner tomorrow. I hope you aren't | |
expecting anything extravagant because my culinary skills haven't been | |
put to use in a while. My only request is that your stereo works. Music | |
makes cooking a more enjoyable experience. | |
Watch the debate if you are home tonight. I want a report tomorrow... | |
Jen | |
""" | |
email_message = email.message_from_string(raw_message) | |
message_id, message_id_encoding = decode_header(email_message["Message-ID"])[0] | |
date_sent, date_encoding = decode_header(email_message["Date"])[0] | |
sender, sender_encoding = decode_header(email_message["From"])[0] | |
x_sender, sender_encoding = decode_header(email_message["X-From"])[0] | |
recipient, recipient_encoding = decode_header(email_message["To"])[0] | |
x_recipient, recipient_encoding = decode_header(email_message["X-To"])[0] | |
subject, subject_encoding = decode_header(email_message["Subject"])[0] | |
for part in email_message.walk(): | |
content_type = part.get_content_type() | |
if content_type == "text/plain": | |
email_body = part.get_payload(decode=True).decode() | |
# this regex removes: | |
# "Jennifer White" <jenwhite7@zdnetonebox.com> on 10/17/2000 04:19:20 PM | |
# To: jarnold@enron.com | |
# cc: | |
# Subject: Hi | |
first_cleaning = regex.sub(r"((\W\w+\W).*(\d{2}:\d{2}:\d{2})\s(AM|PM)\n(To:.*)\n(cc:.*)\n(Subject:.*))", | |
r' ', email_body) | |
clean_body = expunge_doublespaces(first_cleaning.replace('\n', ' ')) | |
print(clean_body) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment