Skip to content

Instantly share code, notes, and snippets.

@benwattsjones
Last active November 14, 2024 10:24
Show Gist options
  • Save benwattsjones/060ad83efd2b3afc8b229d41f9b246c4 to your computer and use it in GitHub Desktop.
Save benwattsjones/060ad83efd2b3afc8b229d41f9b246c4 to your computer and use it in GitHub Desktop.
Quick python code to parse mbox files, specifically those used by GMail. Extracts sender, date, plain text contents etc., ignores base64 attachments.
#! /usr/bin/env python3
# ~*~ utf-8 ~*~
import mailbox
import bs4
def get_html_text(html):
try:
return bs4.BeautifulSoup(html, 'lxml').body.get_text(' ', strip=True)
except AttributeError: # message contents empty
return None
class GmailMboxMessage():
def __init__(self, email_data):
if not isinstance(email_data, mailbox.mboxMessage):
raise TypeError('Variable must be type mailbox.mboxMessage')
self.email_data = email_data
def parse_email(self):
email_labels = self.email_data['X-Gmail-Labels']
email_date = self.email_data['Date']
email_from = self.email_data['From']
email_to = self.email_data['To']
email_subject = self.email_data['Subject']
email_text = self.read_email_payload()
def read_email_payload(self):
email_payload = self.email_data.get_payload()
if self.email_data.is_multipart():
email_messages = list(self._get_email_messages(email_payload))
else:
email_messages = [email_payload]
return [self._read_email_text(msg) for msg in email_messages]
def _get_email_messages(self, email_payload):
for msg in email_payload:
if isinstance(msg, (list,tuple)):
for submsg in self._get_email_messages(msg):
yield submsg
elif msg.is_multipart():
for submsg in self._get_email_messages(msg.get_payload()):
yield submsg
else:
yield msg
def _read_email_text(self, msg):
content_type = 'NA' if isinstance(msg, str) else msg.get_content_type()
encoding = 'NA' if isinstance(msg, str) else msg.get('Content-Transfer-Encoding', 'NA')
if 'text/plain' in content_type and 'base64' not in encoding:
msg_text = msg.get_payload()
elif 'text/html' in content_type and 'base64' not in encoding:
msg_text = get_html_text(msg.get_payload())
elif content_type == 'NA':
msg_text = get_html_text(msg)
else:
msg_text = None
return (content_type, encoding, msg_text)
######################### End of library, example of use below
mbox_obj = mailbox.mbox('path/to/your-mbox-file-from-gmail.mbox')
num_entries = len(mbox_obj)
for idx, email_obj in enumerate(mbox_obj):
email_data = GmailMboxMessage(email_obj)
email_data.parse_email()
print('Parsing email {0} of {1}'.format(idx, num_entries))
@npup
Copy link

npup commented Mar 29, 2022

row 68, suggestion: print('Parsing email {0} of {1}'.format(idx + 1, num_entries)) (idx fix)

@tripleee
Copy link

tripleee commented Jun 5, 2022

The comment # ~*~ utf-8 ~*~ is useless; the default source code encoding for Python 3 is UTF-8 anyway, and if you wanted to communicate this fact to Emacs etc, the proper format uses dashes, not tildes, and a token coding: before the encoding name. See PEP-263.

@redcay
Copy link

redcay commented Aug 4, 2022

Thanks! This helped me out. In parse_email(), would it make more sense to assign the email parts to instance variables? E.g.
self.email_from = self.email_data['From']
instead of
email_from = self.email_data['From']
Otherwise, how is a user of this class meant to access these?

@NickCrews
Copy link

@benwattsjones Brilliant, thanks for posting this!

@redcay yes, that or something like it is needed.

@dtaylor79
Copy link

Thanks @redcay I added this:

        self.email_from = self.email_data['From']
        email_to = self.email_data['To']
        self.email_to = self.email_data['To']
        email_subject = self.email_data['Subject']
        self.email_subject = self.email_data['Subject']
        email_text = self.read_email_payload() 
        self.email_text = self.read_email_payload()

Then I was able to the response I was looking for in

    email_data.parse_email()
    print(email_data.email_from)
    print(email_data.email_to)
    print(email_data.email_subject)
    print(email_data.email_text)

@jasonm23
Copy link

jasonm23 commented May 18, 2023

If you want to preserve threads, google uses:

self.email_data['X-GM-THRID'] # Google Conversation Thread ID

For some messages (when Gmail would store Google Chat messages in the mailbox) there is no email_data['Date'] header, instead a usable date, plus other junk, is stored in email_data['Received']

For example

message["Date"] = None
message["Received"] == "by 10.220.171.77 with SMTP id g13cs116810vcz;        Wed, 9 Mar 2011 17:10:56 -0800 (PST)"

So you could extract with:

# received_str = self.email_data['Received']
# grab the date parser friendly date segment:
# "9 Mar 2011 17:10:56 -0800"

# Example
received_str = "9 Mar 2011 17:10:56 -0800"
pattern = r"\s+\d+\s+\w+\s+\d{4}\s+\d{2}:\d{2}:\d{2}\s+[+-]\d{4}\s+"
match = re.search(pattern, received_str)

if match:
    date = match.group(0)
    print(date)  # Output: "9 Mar 2011 17:10:56 -0800"
else:
    print("Date not found.")

@SusanPina
Copy link

SusanPina commented Oct 14, 2023

The provided Python script, "gmail_mbox_parser.py," is a handy tool for parsing Gmail mbox files. It extracts sender, date, and plain text contents while ignoring base64 attachments. Ensure you've installed the required libraries, specify your Mbox file path, and run the script to process your Gmail emails efficiently. If you're wondering where to get help with your essays, this link has you covered. It examines crucial factors like the track record of the service, the calibre of their service, and their pricing. It also includes a list of the website writes essays for you offering top-notch essay-writing services. Therefore, these websites are the best bet if you're having trouble writing an essay, need some advice, or just want a professional to do it. It will make your academic life easier, similar to having a personal essay assistant.

@jasonm23
Copy link

Oh goody AI generated comments... that say nothing useful

@dileepainivossl
Copy link

Thanks @redcay I added this:

        self.email_from = self.email_data['From']
        email_to = self.email_data['To']
        self.email_to = self.email_data['To']
        email_subject = self.email_data['Subject']
        self.email_subject = self.email_data['Subject']
        email_text = self.read_email_payload() 
        self.email_text = self.read_email_payload()

Then I was able to the response I was looking for in

    email_data.parse_email()
    print(email_data.email_from)
    print(email_data.email_to)
    print(email_data.email_subject)
    print(email_data.email_text)

is there a method to get only the text values without other data, just the email body in this method

email_text = self.read_email_payload()

@dileepainivossl
Copy link

dileepainivossl commented Feb 17, 2024

email_text does not contain pure email text; there is noisy data (like css tags). Is there a method to get the pure text?

@tripleee
Copy link

"Is there a way" yes; but it depends on what exactly you mean. A common solution to extracting just the text from HTML payloads is to run Beautifulsoup on the HTML. If you want to trim off quoted text from earlier messages in a thread, I don't know of any existing libraries for that (but that doesn't mean there aren't any). Similarly, you might want to trim signature blocks (honest-to-RFC signatures start with newline, dash, dash, space, newline; but very few modern signatures adhere to this convention).

See also:

@scottgigante
Copy link

For anyone who wants useful output out of the HTML: you'll want

msg.get_payload(decode=True).decode()

This would have saved me a lot of heartburn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment