Skip to content

Instantly share code, notes, and snippets.

@johnbumgarner
Created January 2, 2021 17:02
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save johnbumgarner/a18da7f45f856ee7f1135fb8288a7b64 to your computer and use it in GitHub Desktop.
Save johnbumgarner/a18da7f45f856ee7f1135fb8288a7b64 to your computer and use it in GitHub Desktop.
This function is designed to parse an email message using the built-in Python module email.
######################################################################################
# The re module is part of the standard Python library which regular expression
# matching operations similar to those found in Perl.
######################################################################################
import re as regex
######################################################################################
# The email module is part of the standard Python library which provides
# functions for reading, writing, and sending simple email messages.
#
# email.header.decode_header - decodes a message header value without converting the
# character set
#
# ref: https://python.readthedocs.io/en/stable/library/email.html#module-email
######################################################################################
import email
from email.header import decode_header
def expunge_doublespaces(raw_string):
"""
This function is designed to remove all double spaces
from the input string.
:param raw_string: string to clean
:return: clean string
"""
if ' ' not in raw_string:
return raw_string
return expunge_doublespaces(raw_string.replace(' ', ' '))
raw_message = r"""Message-ID: <12626409.1075857596370.JavaMail.evans@thyme>
Date: Tue, 17 Oct 2000 10:36:00 -0700 (PDT)
From: john.arnold@enron.com
To: jenwhite7@zdnetonebox.com
Subject: Re: Hi
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: John Arnold
X-To: "Jennifer White" <jenwhite7@zdnetonebox.com> @ ENRON
X-cc:
X-bcc:
X-Folder: \John_Arnold_Dec2000\Notes Folders\'sent mail
X-Origin: Arnold-J
X-FileName: Jarnold.nsf
So, what is it? And by the way, don't start with the excuses. You're
expected to be a full, gourmet cook.
Kisses, not music, makes cooking a more enjoyable experience.
"Jennifer White" <jenwhite7@zdnetonebox.com> on 10/17/2000 04:19:20 PM
To: jarnold@enron.com
cc:
Subject: Hi
I told you I have a long email address.
I've decided what to prepare for dinner tomorrow. I hope you aren't
expecting anything extravagant because my culinary skills haven't been
put to use in a while. My only request is that your stereo works. Music
makes cooking a more enjoyable experience.
Watch the debate if you are home tonight. I want a report tomorrow...
Jen
"""
email_message = email.message_from_string(raw_message)
message_id, message_id_encoding = decode_header(email_message["Message-ID"])[0]
date_sent, date_encoding = decode_header(email_message["Date"])[0]
sender, sender_encoding = decode_header(email_message["From"])[0]
x_sender, sender_encoding = decode_header(email_message["X-From"])[0]
recipient, recipient_encoding = decode_header(email_message["To"])[0]
x_recipient, recipient_encoding = decode_header(email_message["X-To"])[0]
subject, subject_encoding = decode_header(email_message["Subject"])[0]
for part in email_message.walk():
content_type = part.get_content_type()
if content_type == "text/plain":
email_body = part.get_payload(decode=True).decode()
# this regex removes:
# "Jennifer White" <jenwhite7@zdnetonebox.com> on 10/17/2000 04:19:20 PM
# To: jarnold@enron.com
# cc:
# Subject: Hi
first_cleaning = regex.sub(r"((\W\w+\W).*(\d{2}:\d{2}:\d{2})\s(AM|PM)\n(To:.*)\n(cc:.*)\n(Subject:.*))",
r' ', email_body)
clean_body = expunge_doublespaces(first_cleaning.replace('\n', ' '))
print(clean_body)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment