tfirdaus/email-regex-rfc6531.md

## email-regex-rfc6531.md

      
    Raw
  

              email-regex-rfc6531.md
            
          
    Email Regex (RFC 6531)

This gist is an exploration of how to validate email addresses with regex as close as possible to the specification in RFC 6531.
Table of Contents


The Pattern
Background
Breakdown

Flags and Top Level Groups
Local Part

Dot Strings
Quoted Strings


Address Literals

IPv4 Literals
IPv6 Literals
General Address Literals


Domains


Final Commentary

The Pattern

Here is the pattern:
/^(?<localPart>(?<dotString>[0-9a-z!#$%&'*+-\/=?^_`\{|\}~\u{80}-\u{10FFFF}]+(\.[0-9a-z!#$%&'*+-\/=?^_`\{|\}~\u{80}-\u{10FFFF}]+)*)|(?<quotedString>"([\x20-\x21\x23-\x5B\x5D-\x7E\u{80}-\u{10FFFF}]|\\[\x20-\x7E])*"))(?<!.{64,})@(?<domainOrAddressLiteral>(?<addressLiteral>\[((?<IPv4>\d{1,3}(\.\d{1,3}){3})|(?<IPv6Full>IPv6:[0-9a-f]{1,4}(:[0-9a-f]{1,4}){7})|(?<IPv6Comp>IPv6:([0-9a-f]{1,4}(:[0-9a-f]{1,4}){0,5})?::([0-9a-f]{1,4}(:[0-9a-f]{1,4}){0,5})?)|(?<IPv6v4Full>IPv6:[0-9a-f]{1,4}(:[0-9a-f]{1,4}){5}:\d{1,3}(\.\d{1,3}){3})|(?<IPv6v4Comp>IPv6:([0-9a-f]{1,4}(:[0-9a-f]{1,4}){0,3})?::([0-9a-f]{1,4}(:[0-9a-f]{1,4}){0,3}:)?\d{1,3}(\.\d{1,3}){3})|(?<generalAddressLiteral>[a-z0-9-]*[[a-z0-9]:[\x21-\x5A\x5E-\x7E]+))\])|(?<Domain>(?!.{256,})(([0-9a-z\u{80}-\u{10FFFF}]([0-9a-z-\u{80}-\u{10FFFF}]*[0-9a-z\u{80}-\u{10FFFF}])?))(\.([0-9a-z\u{80}-\u{10FFFF}]([0-9a-z-\u{80}-\u{10FFFF}]*[0-9a-z\u{80}-\u{10FFFF}])?))*))$/iu
The pattern is also posted on regex101.com.
Background

RFC 6531 is the most recent specification for email addresses. It extends the prior standard RFC 5321 to add support for "internationalized" email addresses.
Let's first look at how the rules according to RFC 5321.
   Domain         = sub-domain *("." sub-domain)

   sub-domain     = Let-dig [Ldh-str]

   Let-dig        = ALPHA / DIGIT

   Ldh-str        = *( ALPHA / DIGIT / "-" ) Let-dig

   address-literal  = "[" ( IPv4-address-literal /
                    IPv6-address-literal /
                    General-address-literal ) "]"
                    ; See Section 4.1.3

   Mailbox        = Local-part "@" ( Domain / address-literal )

   Local-part     = Dot-string / Quoted-string
                  ; MAY be case-sensitive

   Dot-string     = Atom *("."  Atom)

   Atom           = 1*atext

   Quoted-string  = DQUOTE *QcontentSMTP DQUOTE

   QcontentSMTP   = qtextSMTP / quoted-pairSMTP

   quoted-pairSMTP  = %d92 %d32-126
                    ; i.e., backslash followed by any ASCII
                    ; graphic (including itself) or SPace

   qtextSMTP      = %d32-33 / %d35-91 / %d93-126
                  ; i.e., within a quoted string, any
                  ; ASCII graphic or space is permitted
                  ; without blackslash-quoting except
                  ; double-quote and the backslash itself.

The options for the middle component of address-literals are defined as follows in RFC 5321, Section 4.1.3
   IPv4-address-literal  = Snum 3("."  Snum)

   IPv6-address-literal  = "IPv6:" IPv6-addr

   General-address-literal  = Standardized-tag ":" 1*dcontent

   Standardized-tag  = Ldh-str
                     ; Standardized-tag MUST be specified in a
                     ; Standards-Track RFC and registered with IANA

   dcontent       = %d33-90 / ; Printable US-ASCII
                  %d94-126 ; excl. "[", "\", "]"

   Snum           = 1*3DIGIT
                  ; representing a decimal integer
                  ; value in the range 0 through 255

   IPv6-addr      = IPv6-full / IPv6-comp / IPv6v4-full / IPv6v4-comp

   IPv6-hex       = 1*4HEXDIG

   IPv6-full      = IPv6-hex 7(":" IPv6-hex)

   IPv6-comp      = [IPv6-hex *5(":" IPv6-hex)] "::"
                  [IPv6-hex *5(":" IPv6-hex)]
                  ; The "::" represents at least 2 16-bit groups of
                  ; zeros.  No more than 6 groups in addition to the
                  ; "::" may be present.

   IPv6v4-full    = IPv6-hex 5(":" IPv6-hex) ":" IPv4-address-literal

   IPv6v4-comp    = [IPv6-hex *3(":" IPv6-hex)] "::"
                  [IPv6-hex *3(":" IPv6-hex) ":"]
                  IPv4-address-literal
                  ; The "::" represents at least 2 16-bit groups of
                  ; zeros.  No more than 4 groups in addition to the
                  ; "::" and IPv4-address-literal may be present.

atext above is defined in RFC 5322 as follows.
   atext           =   ALPHA / DIGIT /    ; Printable US-ASCII
                       "!" / "#" /        ;  characters not including
                       "$" / "%" /        ;  specials.  Used for atoms.
                       "&" / "'" /
                       "*" / "+" /
                       "-" / "/" /
                       "=" / "?" /
                       "^" / "_" /
                       "`" / "{" /
                       "|" / "}" /
                       "~"

However RFC 6531, makes the following updates to these rules:
   sub-domain   =/  U-label
    ; extend the definition of sub-domain in RFC 5321, Section 4.1.2

   atext   =/  UTF8-non-ascii
    ; extend the implicit definition of atext in
    ; RFC 5321, Section 4.1.2, which ultimately points to
    ; the actual definition in RFC 5322, Section 3.2.3

   qtextSMTP  =/ UTF8-non-ascii
    ; extend the definition of qtextSMTP in RFC 5321, Section 4.1.2

U-labels, as defined in RFC 5321, Section 4.1.2, have requirements that would be difficult to enforce via a simple regex pattern (such as the sequence of characters being in Normalization Form C). For purposes of this regex, I will assume that they are any sequence of ASCII characters allowable in the sub-domains and non-ASCII Unicode characters. Note that the maximum length of the entire domain is 255 characters per RFC 2181 Section 11.
UTF8-non-ascii is defined as follows in RFC:
   UTF-8 characters can be defined in terms of octets using the
   following ABNF [RFC5234], taken from [RFC3629]:

   UTF8-non-ascii  =   UTF8-2 / UTF8-3 / UTF8-4

   UTF8-2          =   <Defined in Section 4 of RFC3629>

   UTF8-3          =   <Defined in Section 4 of RFC3629>

   UTF8-4          =   <Defined in Section 4 of RFC3629>

UTF8-2, UTF8-3, and UTF8-4 are defined in RFC 3629 Section 4 as follows:
   UTF8-octets = *( UTF8-char )
   UTF8-char   = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4
   UTF8-1      = %x00-7F
   UTF8-2      = %xC2-DF UTF8-tail
   UTF8-3      = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
                 %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
   UTF8-4      = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
                 %xF4 %x80-8F 2( UTF8-tail )
   UTF8-tail   = %x80-BF

For the regex flavor that we are using, we can treat our characters as Unicode code points and treat any character as with a code point higher than 80 as a UTF8-non-ascii.
Breakdown

This pattern is quite a monstrosity. so I will break it down and explain how it works. As I break down each level I will use the marking ▒▒▒ inside capture groups when I omit the capture group contents for brevity.
Flags and Top Level Groups

First note the flags i and u at the end of the pattern. The i means that the pattern is case-insensitive, which makes some of the classes simpler as it eliminates the need to specify both cases. The u pattern indicates that characters are parsed as Unicode code points, and it is also necessary to use to the \u{Hex codepoint} syntax.
Now, remember the rule Mailbox = Local-part "@" ( Domain / address-literal ) from the RFC specifications. We can reflect this rule in regex as follows.
/^(?<localPart>▒▒▒)(?<!.{64,})@(?<domainOrAddressLiteral>(?<addressLiteral>▒▒▒)|(?<Domain>(?!.{256,})▒▒▒))$/iu
We start with an anchor ^ to the front of the string, and end with an anchor $ to the back of the string. This means that the pattern will only match strings that form an email address with no other text. Remove these anchors if you want to find email addresses inside of text rather than validate a string that is supposed to be an email address.
Next the email has a named capture group (?<localPart>▒▒▒) followed by a negative lookback (?<!.{64,}) followed by a the character literal @. This means that pattern will match a localpart that appears directly before an atmark, except when that sequence before the atmark is 64 characters or longer.
Similarly, there is another capture group immediately following the atmark for the mail host portion (?<domainOrAddressLiteral>▒▒▒). The target mail host can be specified either as an address literal or by domain name, so the domainOrAddressLiteral capture group itself contains two nested capture groups, joined by the OR operator |: (?<addressLiteral>▒▒▒) and (?<Domain>(?!.{256,})▒▒▒). The second of these starts with a negative lookahead: (?!.{64,}). Similar to the negative lookback for the localpart, this will prevent matches on domains longer than 255 characters.
Local Part

The local part is defined in the RFC spec as: Local-part = Dot-string / Quoted-string.
This is reflected in the regex as two nested capture groups joined by the | operator.
(?<localPart>(?<dotString>▒▒▒)|(?<quotedString>▒▒▒))
Dot Strings

The important rules for dot strings are:
Dot-string     = Atom *("."  Atom)
Atom           = 1*atext

   atext           =   ALPHA / DIGIT /    ; Printable US-ASCII
                       "!" / "#" /        ;  characters not including
                       "$" / "%" /        ;  specials.  Used for atoms.
                       "&" / "'" /
                       "*" / "+" /
                       "-" / "/" /
                       "=" / "?" /
                       "^" / "_" /
                       "`" / "{" /
                       "|" / "}" /
                       "~"

atext   =/  UTF8-non-ascii

These can be reflected in regex as follows:
(?<dotString>[0-9a-z!#$%&'*+-\/=?^_`\{|\}~\u{80}-\u{10FFFF}]+(\.[0-9a-z!#$%&'*+-\/=?^_`\{|\}~\u{80}-\u{10FFFF}]+)*)
In this part of the regex the subpattern [0-9a-z!#$%&'*+-\/=?^_`\{|\}~\u{80}-\u{10FFFF}]+ appears twice. This represents the Atom. It is formed with a long class in [ ] followed by a Kleene plus (+). This pattern will capture any the characters in the original definition of atext plus non-ASCII Unicode characters added in the internationalization update.
The second time this subpattern appears, it is inside a capture group with \. at the front of the capture group and with a Kleene star operator (*) attached the capture group. That means that a period followed by atext can repeat any number of times, including zero.
This section will match the bold portions of the following examples:

user%example.com@example.org
user-@example.org
postmaster@[123.123.123.123]
медведь@с-балалайкой.рф

Quoted Strings

Here are the relevant rules for quoted strings:
   Quoted-string  = DQUOTE *QcontentSMTP DQUOTE

   QcontentSMTP   = qtextSMTP / quoted-pairSMTP

   quoted-pairSMTP  = %d92 %d32-126
                    ; i.e., backslash followed by any ASCII
                    ; graphic (including itself) or SPace

   qtextSMTP      = %d32-33 / %d35-91 / %d93-126

   qtextSMTP  =/ UTF8-non-ascii

These are reflected in the regex as follows:
(?<quotedString>"([\x20-\x21\x23-\x5B\x5D-\x7E\u{80}-\u{10FFFF}]|\\[\x20-\x7E])*")
At the deepest level there are two parts joined by the | operator: [\x20-\x21\x23-\x5B\x5D-\x7E\u{80}-\u{10FFFF}], which represents qtextSMTP and \\[\x20-\x7E], which represents quoted-pairSMTP. These are all inside a capture group with a Kleene star (*) after it to match both types of QcontentSMTP that repeats any number of times inside of double-quotes.
This section will match the bold portions of the following examples:

" "@example.org
"john..doe"@example.org

Address Literals

There are multiple types of address literals, but they all appear inside of brackets ([]). This can be reflected by putting in escaped bracket characters around a capture group with | between another level of nested capture groups for each different type of address literal.
(?<addressLiteral>\[((?<IPv4>▒▒▒)|(?<IPv6Full>▒▒▒)|(?<IPv6Comp>▒▒▒)|(?<IPv6v4Full>▒▒▒)|(?<IPv6v4Comp>▒▒▒)|(?<generalAddressLiteral>▒▒▒))\])
IPv4 Literals

IPv4 literals are made with 4 sequences of 1 to 3 digits, joined by periods. This can be reflected in the regex as follows:
(?<IPv4>\d{1,3}(\.\d{1,3}){3})
This will match the bold portion of the following email address.

postmaster@[123.123.123.123]

IPv6 Literals

There are multiple forms of IPv6 literals. The first, an unabbreviated IPv6 address is can be matched by the following regex:
(?<IPv6Full>IPv6:[0-9a-f]{1,4}(:[0-9a-f]{1,4}){7})
IPv6 with sections of zeroes abbreviated to :: (per the rule IPv6-comp = [IPv6-hex *5(":" IPv6-hex)] "::" [IPv6-hex *5(":" IPv6-hex)]). This can be matched in regex as follows:
(?<IPv6Comp>IPv6:([0-9a-f]{1,4}(:[0-9a-f]{1,4}){0,5})?::([0-9a-f]{1,4}(:[0-9a-f]{1,4}){0,5})?)
Then there are two forms of IPv6 addresses that end in IPv4 addresses. Those can be matched with the following two patterns.
(?<IPv6v4Full>IPv6:[0-9a-f]{1,4}(:[0-9a-f]{1,4}){5}:\d{1,3}(\.\d{1,3}){3})

(?<IPv6v4Comp>IPv6:([0-9a-f]{1,4}(:[0-9a-f]{1,4}){0,3})?::([0-9a-f]{1,4}(:[0-9a-f]{1,4}){0,3}:)?\d{1,3}(\.\d{1,3}){3})

These subpatterns will match the bold portions of the following email addresses.

postmaster@[IPv6:2001:0db8:85a3:0000:0000:8a2e:0370:7334]
postmaster@[IPv6:2001:0db8:85a3::8a2e:0370:7334]
postmaster@[IPv6:2001:0db8:85a3:0000:0000:8a2e:123.123.123.123]
postmaster@[IPv6:2001:0db8:85a3::8a2e:123.123.123.123]

General Address Literals

General address literals are defined in the email RFC specifications as follows:
   General-address-literal  = Standardized-tag ":" 1*dcontent

   Standardized-tag  = Ldh-str
                     ; Standardized-tag MUST be specified in a
                     ; Standards-Track RFC and registered with IANA

   dcontent       = %d33-90 / ; Printable US-ASCII
                  %d94-126 ; excl. "[", "\", "]"

   Let-dig        = ALPHA / DIGIT

   Ldh-str        = *( ALPHA / DIGIT / "-" ) Let-dig

These grammatical rules can be reflected in regex as follows:
(?<generalAddressLiteral>[a-z0-9-]*[[a-z0-9]:[\x21-\x5A\x5E-\x7E]+)
Note that this does not enforce the constraint that "Standardized-tag MUST be specified in a Standards-Track RFC and registered with IANA".
This subpattern would match the bold portion of the following, even if it is not standardized or recognized by IANA.

postmaster@[abc:a]

Domains

The rules for domains are:
   Domain         = sub-domain *("." sub-domain)

   sub-domain     = Let-dig [Ldh-str]

   Let-dig        = ALPHA / DIGIT

   Ldh-str        = *( ALPHA / DIGIT / "-" ) Let-dig

As I stated above, I will treat internationalization as allowing non-ASCII Unicode characters as extra alternatives where ALPHA and DIGIT appear above.
(?<Domain>(?!.{256,})(([0-9a-z\u{80}-\u{10FFFF}]([0-9a-z-\u{80}-\u{10FFFF}]*[0-9a-z\u{80}-\u{10FFFF}])?))(\.([0-9a-z\u{80}-\u{10FFFF}]([0-9a-z-\u{80}-\u{10FFFF}]*[0-9a-z\u{80}-\u{10FFFF}])?))*)
This pattern contains the subpattern ([0-9a-z\u{80}-\u{10FFFF}]([0-9a-z-\u{80}-\u{10FFFF}]*[0-9a-z\u{80}-\u{10FFFF}])?) twice. This subpattern represents the subdomain from the RFC rules. Inside this subpattern, the class [0-9a-z\u{80}-\u{10FFFF}] which match any ASCII alphanumeric character or non-ASCII Unicode character.
This subpattern would match the bold portions of the following email addresses.

admin@mailserver1
user%example.com@example.org
медведь@с-балалайкой.рф

Final Commentary

This regex is the most complex regex that I have ever written, and I have yet to use it in running code. I developed this regex mostly as an exercise to practice reading RFCs and following them as closely as possible. That said, I have tested this pattern on the examples from the Wikipedia article on email addresses. See my regex101.com post for those test results.
I am somewhat concerned that the complexity of this regex pattern could cause the parsing engine to slow down. It may be better to use a simpler regex like /(".+"|\S+)@\S+/ui.
If you do use this regex in a project, please let me know how it performs for you.
Author

This was written by Brian Baker. Feel free to use the pattern or subpatterns in this gist. If you have any comments, please leave them below.