A week or so ago I decided I wanted to get all of my archived mail from Mac OS X's Mail.app into a more readable format. I was a bit surprised to find that ever since 10.4, Mac OS X stores its mail in an Apple-invented format called "EMLX" (well, this is what I'm calling it at least...each mail message is stored in a file that ends in ".emlx"). A very rough sketch of the file format:
- The first line of the file (beginning of the document to the first linefeed) is an ASCII-encoded number representing the size of the actual email message in bytes.
- Starting with the first byte after the linefeed is the email, exactly N bytes in size where N is the number of bytes specified in #1.
- From the end of the email message to the end of the .emlx file is an XML-encoded Apple PList containing metadata about the email message (presumably for spotlight).
I really don't care for the .emlx file format. The only application that's able to read it is Mail.app. Really the only reason for its existence is because Apple wants Mail messages in Mail.app to be indexable by Spotlight, and Spotlight mandates that there's a one-to-one relationship between files and search results.
Anyway, I searched around a bit to try to find an app that would convert my messages from .emlx to mbox. The only thing I could find was this app, which was graphical. It requires you to drag and drop your individual .emlx files from the finder into the app, which is an absolutely horrific way of requiring users to pass input to your app (I have many tens of thousands of messages, and making finder pass this many files via drag-n-drop is really difficult to do without crashing Finder). Additionally, I had some problems with the application dropping some number of messages. So, I wrote my own quick python script to handle converting .emlx files into one giant mbox-formatted file. I've posted the script here:
http://www.brownjava.org/files/emlx2mbox.py
Usage: emlx2mbox.py [mbox file] [emlx files...]
Its behavior is never to overwrite the mbox file specified, only to append to it, so you can pass an existing mail spool (so long as nothing else is writing to it). It doesn't do any sort of mail spool locking (so you may want to manually lock your mail spool first or make a copy). The appending behavior also means you can use xargs to pass the script thousands or millions of emails if you like (e.g. 'find . -name "*.emlx" | xargs emlx2mbox.py mbox_file').