Background: I am moving away from GMail to my own Haskell-based
server (SMTP receiver/sender, web-based email client, spam filtering,
etc.). All email to
goes through this server
) as of today, and
forwards a
copy of everything to it.
This is a summary/tracking document of my efforts to simply parse email messages in Haskell for the SMTP receiver.
The problem: There are many packages on Hackage capable of parsing some or all of an email message, but almost all of them are incomplete in some way (either they are too old using String everywhere, have encoding problems, are not streaming, or have bugs).
Without exception, they are all poorly documented.
- If you're looking to reliably parse emails in Haskell today, you will be disappointed.
- Apparently none of them have been using in a real setting.
- You will have to make correctness vs performance trade-offs.
Ideally, they would all be deprecated, in favor of a package (or one of them) that looks like this:
- Written in pure Haskell.
- Is well documented!
- Has a thorough test suite including full email samples from real servers.
- Uses modern libraries (time, bytestring, text, attoparsec).
- Uses attoparsec for:
- Fast parsing.
- Streaming parsing.
- Handles multiparts properly.
- Provides a SAX-style streaming interface, so that:
- Message parts can be streamed to file, database, or network.
- We can have conduit and pipes interfaces.
- Uses ByteString for everything except where appropriate (e.g. a part which is known to have a text UTF-8 encoding can be decoded into Text).
- Has a benchmark suite.
(c) 2006-2009 Galois Inc.
I am currently using this package.
- It can parse everything that I have received so far on my server after a week, from postfix, outlook and gmail servers, and mailing lists (mailman and the kernel), with multiple attachments.
- It didn't handle \n as a line separator in QuotedPrintable. I know, this isn't standard, but I received an email like this from Haskell-Cafe. I patched it.
- It has performance bugs. I found an O(n^2) time complexity bug
in the normalizeCLRF function, which caused my server to spin at
100% for minutes at a time while receiving attachments, causing the
mail to have to be re-sent for days. I
fixed the bug
by changing its output type to a
. - It has no test suite.
- It was based on
, now it usesText
in a misguided attempt to add correctness. Unfortunately, the user of the library is forced to shoe horn binary data and non-UTF8 data into and out of aText
value. I've already had to work around bugs due to this. - It's not a streaming parser.
> :!cat > in.txt
From: John Doe <>
MIME-Version: 1.0
Content-Type: multipart/mixed;
boundary="XXXXboundary text"
This is a multipart message in MIME format.
--XXXXboundary text
Content-Type: text/plain
this is the body text
--XXXXboundary text
Content-Type: text/plain;
Content-Disposition: attachment;
this is the attachment text
--XXXXboundary text--
> import qualified Data.Text.IO as T
> fmap parseMIMEMessage (T.readFile "in.txt")
MIMEValue {mime_val_type = Type {mimeType = Multipart Mixed, mimeParams = [MIMEParam {paramName = "boundary", paramValue = "XXXXboundary text"}]}, mime_val_disp = Nothing, mime_val_content = Multi [MIMEValue {mime_val_type = Type {mimeType = Text "plain", mimeParams = []}, mime_val_disp = Nothing, mime_val_content = Single "this is the body text\r\n", mime_val_headers = [MIMEParam {paramName = "content-type", paramValue = "text/plain"}], mime_val_inc_type = True},MIMEValue {mime_val_type = Type {mimeType = Text "plain", mimeParams = []}, mime_val_disp = Just (Disposition {dispType = DispAttachment, dispParams = [Filename "test.txt"]}), mime_val_content = Single "this is the attachment text\r\n", mime_val_headers = [MIMEParam {paramName = "content-type", paramValue = "text/plain;"},MIMEParam {paramName = "content-disposition", paramValue = "attachment; filename=\"test.txt\""}], mime_val_inc_type = True}], mime_val_headers = [MIMEParam {paramName = "from", paramValue = "John Doe <>"},MIMEParam {paramName = "mime-version", paramValue = "1.0"},MIMEParam {paramName = "content-type", paramValue = "multipart/mixed; boundary=\"XXXXboundary text\""}], mime_val_inc_type = True}
Aycan iRiCAN
- It's a streaming parser.
- It does not seem to actually parse messages:
> fmap (parseOnly parseMimeHeaders) (S.readFile "/tmp/gmail.txt")
Right (MimeValue {mvType = Type {mimeType = Text "plain", mimeParams = fromList [("charset","us-ascii")]}, mvDisp = Nothing, mvContent = Multi [], mvHeaders = fromList [], mvIncType = True})
> fmap (parseOnly parseMimeHeaders) (S.readFile "/tmp/gmail-attachment.txt")
Right (MimeValue {mvType = Type {mimeType = Text "plain", mimeParams = fromList [("charset","us-ascii")]}, mvDisp = Nothing, mvContent = Multi [], mvHeaders = fromList [], mvIncType = True})
Same example from above:
> :!cat > in.txt
From: John Doe <>
MIME-Version: 1.0
Content-Type: multipart/mixed;
boundary="XXXXboundary text"
This is a multipart message in MIME format.
--XXXXboundary text
Content-Type: text/plain
this is the body text
--XXXXboundary text
Content-Type: text/plain;
Content-Disposition: attachment;
this is the attachment text
--XXXXboundary text--
> fmap (parseOnly parseMimeHeaders) (S.readFile "in.txt")
Left "string"
2014-2018 Kyle Raftogianis
- It's a streaming parser.
- It doesn't actually parse a list of headers, just individual header
values from a list of
[(CI ByteString, ByteString)]
Peter Simons, Ali Abrar, Gero Kriependorf, Marty Pauley
- It properly parses emails that I received.
- You can parse from a
- It uses the archaic
package, so you have to convert all the times to the moderntime
package. - It uses
everywhere. - It does not parse the MIME bodies, so one has to manually handle
multipart messages; which I did for a while, before switching to the
package. - It's using parsec which is not streaming or efficient, for handling megabytes of traffic.
Prelude Text.Parsec.Rfc2822 Text.Parsec S> fmap (parse message "") (S.readFile "in.txt")
[ From
[ NameAddr
{ nameAddr_name = Just "John Doe"
, nameAddr_addr = ""
, OptionalField "MIME-Version" " 1.0"
, OptionalField
" multipart/mixed;\r\n boundary=\"XXXXboundary text\""
"This is a multipart message in MIME format.\r\n\r\n--XXXXboundary text\r\nContent-Type: text/html\r\n\r\nthis is the <b>body</b> text\r\n\r\n--XXXXboundary text\r\nContent-Type: text/plain;\r\nContent-Disposition: attachment;\r\n filename=\"test.txt\"\r\n\r\nthis is the attachment text\r\n\r\n--XXXXboundary text--\r\n\n\n")
Ian Lynagh
- String-based.
- Not streaming.
Michal Kawalec
I tested this out on my server for a little while.
- It's based on attoparsec, so streaming.
- Messages (multipart) are yielded as a tree, not streaming; so one cannot write parts to disk/DB in a streaming fashion. All of a 10MB email would have to be loaded into memory.
- It does not handle nested multipart messages. It handles one level
of nesting. But it's common for mesages to be e.g. of this nesting
- Doesn't build; depends on a C library for decoding base64.
, emailBodies =
[ MessageBody
{ emailHeaders =
[ Header
{ headerName = "Content-Type"
, headerContents =
"multipart/alternative; boundary=\"000000000000a2a84f05715ab8ef\""
, emailBodies =
[ TextBody
"--000000000000a2a84f05715ab8ef\r\nContent-Type: text/plain; charset=\"UTF-8\"\r\n\r\nHere's a smaller file\r\n\r\n--000000000000a2a84f05715ab8ef\r\nContent-Type: text/html; charset=\"UTF-8\"\r\n\r\n<div dir=\"ltr\">Here's a smaller file</div>\r\n\r\n--000000000000a2a84f05715ab8ef--\r\n"
Thanks @romanofski, I'll check it out next time I come back to parsing. 👍