Skip to content

Instantly share code, notes, and snippets.

@karlcow
Created September 9, 2014 05:15
Show Gist options
  • Save karlcow/5194127e9e87acbfaa93 to your computer and use it in GitHub Desktop.
Save karlcow/5194127e9e87acbfaa93 to your computer and use it in GitHub Desktop.
emlx to mbox for MacOSX. The Web site has disappeared so Here the code. http://web.archive.org/web/20130905074537/http://brownjava.org/2007/08/emlx2mboxpy.html

A week or so ago I decided I wanted to get all of my archived mail from Mac OS X's Mail.app into a more readable format. I was a bit surprised to find that ever since 10.4, Mac OS X stores its mail in an Apple-invented format called "EMLX" (well, this is what I'm calling it at least...each mail message is stored in a file that ends in ".emlx"). A very rough sketch of the file format:

  1. The first line of the file (beginning of the document to the first linefeed) is an ASCII-encoded number representing the size of the actual email message in bytes.
  2. Starting with the first byte after the linefeed is the email, exactly N bytes in size where N is the number of bytes specified in #1.
  3. From the end of the email message to the end of the .emlx file is an XML-encoded Apple PList containing metadata about the email message (presumably for spotlight).

I really don't care for the .emlx file format. The only application that's able to read it is Mail.app. Really the only reason for its existence is because Apple wants Mail messages in Mail.app to be indexable by Spotlight, and Spotlight mandates that there's a one-to-one relationship between files and search results.

Anyway, I searched around a bit to try to find an app that would convert my messages from .emlx to mbox. The only thing I could find was this app, which was graphical. It requires you to drag and drop your individual .emlx files from the finder into the app, which is an absolutely horrific way of requiring users to pass input to your app (I have many tens of thousands of messages, and making finder pass this many files via drag-n-drop is really difficult to do without crashing Finder). Additionally, I had some problems with the application dropping some number of messages. So, I wrote my own quick python script to handle converting .emlx files into one giant mbox-formatted file. I've posted the script here:

http://www.brownjava.org/files/emlx2mbox.py
Usage: emlx2mbox.py [mbox file] [emlx files...]

Its behavior is never to overwrite the mbox file specified, only to append to it, so you can pass an existing mail spool (so long as nothing else is writing to it). It doesn't do any sort of mail spool locking (so you may want to manually lock your mail spool first or make a copy). The appending behavior also means you can use xargs to pass the script thousands or millions of emails if you like (e.g. 'find . -name "*.emlx" | xargs emlx2mbox.py mbox_file').

#!/usr/bin/env python
import sys
import re
class InvalidEmlxFileException (Exception):
pass
class CouldNotConstructFromLineException (Exception):
pass
class BadFromHeaderException (CouldNotConstructFromLineException):
pass
class BadDateHeaderException (CouldNotConstructFromLineException):
pass
def getHeader(msg, headerName):
msglines = re.split("\n", msg)
for msgline in msglines:
ha = re.split(":", msgline, 1)
if (len(ha) == 2 and (ha[0] == headerName)):
return ha[1]
def getFromLine(msg):
from_header = getHeader(msg, 'From')
date_header = getHeader(msg, 'Date')
if (from_header == None):
raise BadFromHeaderException, "Could not find 'From' header in msg: " + msg
if (date_header == None):
raise BadDateHeaderException, "Could not find 'Date' header in msg: " + msg
fromline = "From "
m = re.match("^.*<(.*)>.*$", from_header)
if (m != None):
fromline += m.group(1) + " "
else:
m = re.match("^ *(.*@.*) *$", from_header)
if (m != None):
fromline += m.group(1) + " "
else:
raise BadFromHeaderException, "Couldn't interpret From header: " + from_header
m = re.match("^ +(...), +(\d+) +(...) +(\d\d\d\d) +(\d\d):(\d\d):(\d\d) +.*$", date_header)
if (m == None):
raise BadDateHeaderException, "Couldn't interpret Date header: " + date_header
fromline += "%s %s %s %s:%s:%s %s" % \
(m.group(1), m.group(3), m.group(2), m.group(5), m.group(6), m.group(7), m.group(4))
return fromline
def main ():
# check arguments
if (len(sys.argv) < 3):
print "usage: %s [mbox file] [emlx files...]" % sys.argv[0]
exit(1)
mbox_file = sys.argv[1]
emlx_files = sys.argv[2:]
# open mbox file
mf = open (mbox_file, 'a')
# open each emlx file
for emlx_file in emlx_files:
try:
# Read message
ef = open (emlx_file, 'r')
try:
size = long(re.sub("\n", "", ef.readline()))
except ValueError:
raise InvalidEmlxFileException, "Couldn't interpret size of Emlx file \"" + emlx_file + "\""
msg = ef.read(size)
ef.close()
# Construct from line
try:
from_line = getFromLine(msg);
except CouldNotConstructFromLineException, cncfle:
from_line = "From unknown Sun Jan 0 00:00:00 1900"
mf.write(from_line)
mf.write("\n")
mf.write(msg)
mf.write("\n")
except InvalidEmlxFileException, e:
print e
# Close mail spool
mf.close()
if __name__ == '__main__':
main ()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment