Skip to content

Instantly share code, notes, and snippets.

@Terrain2
Last active Jan 4, 2022
Embed
What would you like to do?
Emogex reference document

Emogex

Hi, emogex is an esoteric regular expression language based entirely on emojis. It was originally created as a joke at the @happyautomata twitter bot under this thread.

I've ended up actually using it regularly because i solve those FSMs to regexes on my phone, and it's more convenient to use emoji-based syntax because it's on the same keyboard as the language characters.

I've also noticed that other people (well, it's just this one guy so far at least two people) are now using emogex unironically to solve them, so i thought it's about time to write it all in a single document so you can more conveniently reference all the features of the language without scrolling through a tweet thread which also does contain later on changes to previous definitions.

This document will be updated whenever i make changes or add new features. I will also be posting those updates under the twitter thread. If you wanna scroll from the bottom (without any breaks due to multiple replies) here is the last one in the chain

Also, the characters ๐Ÿ…ฐ๏ธ, ๐Ÿ…ฑ๏ธ, #๏ธโƒฃ will often be used in this document as placeholders, where #๏ธโƒฃ is a numeric literal and ๐Ÿ…ฐ๏ธ/๐Ÿ…ฑ๏ธ are subpatterns. They do not have any special meaning in the language

Core features

Taken directly out of emojicode, groupings are delimited by a ๐Ÿ‡ to start them and end with a ๐Ÿ‰. Taken directly out of PCRE, conditionals and lookahead/behind use a grouping syntax with a special character as the very first character in the group. I will explain those features later, but it is important to note that some characters have special meaning directly after a ๐Ÿ‡ and nowhere else. A group is a single subpattern and besides the literal postfix operator, this has the highest precedence.

The โ€ผ๏ธ token is the "literal character" postfix. When tokenizing emogex, you should scan from right-to-left for this character as the very first step to tell which characters are literal and which may be metacharacters. I am so sorry if you're doing this in C or something where emoji character width in UTF-8 will cause you immense pain. You did choose to do it that way, though.

Any pride flag, even future ones added in future versions of Unicode, shall be a wildcard. Whether you're ๐Ÿณ๏ธโ€๐ŸŒˆ, ๐Ÿณ๏ธโ€โšง๏ธ, or any other category that isn't near the start of my flags, you're valid. Pride patterns accept everyone*.

Numerical literals are used in a few places in the syntax. Originally, as a joke, i gave you all the math operators, integer literals, and the imaginary unit. That was a bad idea. Numeric literals (denoted by #๏ธโƒฃ in this document) consist of the characters 0๏ธโƒฃ1๏ธโƒฃ2๏ธโƒฃ3๏ธโƒฃ4๏ธโƒฃ5๏ธโƒฃ6๏ธโƒฃ7๏ธโƒฃ8๏ธโƒฃ๐Ÿ”Ÿ9๏ธโƒฃ. You may have noticed that's 11 characters, and that 9๏ธโƒฃ comes after ๐Ÿ”Ÿ. That's because, even though i've made changes to make it somewhat more practical, you can still live with elevenary (positional base eleven or "undecimal" using latin naming, base-11 in decimal). ๐Ÿ”Ÿ (therble) is the symbol used for the number nine, and 9๏ธโƒฃ is the symbol for ten. 1๏ธโƒฃ0๏ธโƒฃ will be the number we call eleven and so on.

Numeric literals may also be the symbol โ™พ๏ธ which is a special case and will be explained in each usage.


Any character that is not a metacharacter is in itself a subpattern matching that character literally. Any subpattern is also a valid pattern. To combine multiple subpatterns into a bigger pattern, you need to use the mathematical operators โž–โœ–๏ธโž—โž• as well as โ†”๏ธ and ๐Ÿ”€. Here are their purposes, in the order of precedence (none of them are on the same level)

  • ๐Ÿ…ฐ๏ธโž–๐Ÿ…ฑ๏ธ is a "subtraction" operation. It means that the expression must match the subpattern ๐Ÿ…ฐ๏ธ AND NOT the subpattern ๐Ÿ…ฑ๏ธ. In terms of set theory where subpatterns are set of valid strings, this is the asymmetric difference of ๐Ÿ…ฐ๏ธ and ๐Ÿ…ฑ๏ธ, or the intersection of ๐Ÿ…ฐ๏ธ and the complement of ๐Ÿ…ฑ๏ธ.
  • ๐Ÿ…ฐ๏ธ๐Ÿ”€๐Ÿ…ฑ๏ธ is a "interlacing" operation. It means that the expression must first match ๐Ÿ…ฐ๏ธ and then the contents of ๐Ÿ…ฐ๏ธ's match must further also match ๐Ÿ…ฑ๏ธ. In terms of set theory where subpatterns are set of valid strings, this is the intersection of ๐Ÿ…ฐ๏ธ and ๐Ÿ…ฑ๏ธ.
  • ๐Ÿ…ฐ๏ธโ†”๏ธ๐Ÿ…ฑ๏ธ is a "switcheroo" operation. It's similar to โœ–๏ธ in that it's concatenation, but either way. It is equivalent to ๐Ÿ‡๐Ÿ…ฐ๏ธ๐Ÿ…ฑ๏ธโž•๐Ÿ…ฑ๏ธ๐Ÿ…ฐ๏ธ๐Ÿ‰.
  • ๐Ÿ…ฐ๏ธโœ–๏ธ๐Ÿ…ฑ๏ธ is a "concatenation" operation. It means that the expression must match the subpattern ๐Ÿ…ฐ๏ธ immediately followed by the subpattern ๐Ÿ…ฑ๏ธ. In terms of set theory where subpatterns are set of valid strings, this is the cartesian product of ๐Ÿ…ฐ๏ธ and ๐Ÿ…ฑ๏ธ.
  • ๐Ÿ…ฐ๏ธโž—๐Ÿ…ฑ๏ธ is an "XOR" operation. It is equivalent to ๐Ÿ‡๐Ÿ‡๐Ÿ…ฐ๏ธโž•๐Ÿ…ฑ๏ธ๐Ÿ‰โž–๐Ÿ‡๐Ÿ…ฐ๏ธโž–โ›”๐Ÿ…ฑ๏ธ๐Ÿ‰๐Ÿ‰. This is the only one that bears no relation to its symbol lol. In terms of set theory where subpatterns are set of valid strings, this is the symmetric difference of ๐Ÿ…ฐ๏ธ and ๐Ÿ…ฑ๏ธ, or the intersection of their union and the complement of their intersection. In other words, intersection(union(๐Ÿ…ฐ๏ธ, ๐Ÿ…ฑ๏ธ), complement(intersection(๐Ÿ…ฐ๏ธ, ๐Ÿ…ฑ๏ธ)))
  • ๐Ÿ…ฐ๏ธโž•๐Ÿ…ฑ๏ธ is an "OR" operation. It means match either ๐Ÿ…ฐ๏ธ or ๐Ÿ…ฑ๏ธ. In terms of set theory where subpatterns are set of valid strings, this is the union of ๐Ÿ…ฐ๏ธ and ๐Ÿ…ฑ๏ธ, or the "sum" of all their allowed strings.

For the sake of convenience, if a subpattern is immediately followed by another subpattern, juxtaposition implies multiplication at the same precedence as with an explicit โœ–๏ธ.


The ๐Ÿ“ง token will match any valid email address. Yes, this is under the "core features" header. No, the email address need not exist and receive anything sent its way, it just has to be correctly formatted so that it could exist if someone wanted to register it.

The ๐Ÿšซ token is by definition a zero-width anchor that always fails to match. It is equivalent to โš“โ™พ๏ธโš“

Positional anchors

An important detail is that there is a "cursor" which is at any position between two characters or either end of the string. That means for a string of length 3 (i.e. "abc") there are 4 positions the cursor can be, those are:

  • at the start; before a
  • between a and b
  • between b and c
  • at the end; after c

The basic syntax for an anchor literal is โš“#๏ธโƒฃโš“ or โš“๐Ÿ”š#๏ธโƒฃโš“ the #๏ธโƒฃ is the index, and the ๐Ÿ”š means "from the end". For example, in the string "abc" here are what the anchors correspond to:

  • โš“0๏ธโƒฃโš“ / โš“๐Ÿ”š3๏ธโƒฃโš“ at the start; before a
  • โš“1๏ธโƒฃโš“ / โš“๐Ÿ”š2๏ธโƒฃโš“ between a and b
  • โš“2๏ธโƒฃโš“ / โš“๐Ÿ”š1๏ธโƒฃโš“ between b and c
  • โš“3๏ธโƒฃโš“ / โš“๐Ÿ”š0๏ธโƒฃโš“ at the end; after c

For a length of 3, โš“4๏ธโƒฃโš“ and โš“๐Ÿ”š4๏ธโƒฃโš“ will be outside the string bounds, and as such will never match (they are equivalent to ๐Ÿšซ in this particular input string). An anchor with an index of โ™พ๏ธ will always be outside the string bounds, and as such never match either.

The following characters are shorthands for common anchors you'll wanna use:

  • โฎ๏ธ => โš“0๏ธโƒฃโš“
  • ๐Ÿฅ‡ => โš“1๏ธโƒฃโš“
  • ๐Ÿฅˆ => โš“2๏ธโƒฃโš“
  • ๐Ÿฅ‰ => โš“3๏ธโƒฃโš“
  • โญ๏ธ => โš“๐Ÿ”š0๏ธโƒฃโš“

Something i just became acutely aware of while writing this is that i cannot really us emoji to spice up this document (they will be confused with metacharacters) and GitHub does not display this document in a font with nice arrow ligatures.

And additionally, the โ›” token will negate an anchor. That way, it checks that the cursor is NOT at that position in the string.

Quantifiers

A quantifier takes the form of ๐Ÿ…ฐ๏ธ๐Ÿ”„#๏ธโƒฃ๐Ÿ”„#๏ธโƒฃ๐Ÿ”„ where the first #๏ธโƒฃ is the lower bound, and the second #๏ธโƒฃ is the upper bound. A quantifier pattern will match ๐Ÿ…ฐ๏ธ multiple times, specifically at least as many times (greater than or equal to) the lower bound AND no more than (less than or equal to) the upper bound.

If the lower bound is greater than the upper bound, the quantifier will never match (as there is no amount of times it can match where it is greater than or equal to the lower bound and less than or equal to the upper bound when the lower bound is strictly less than the upper bound, due to the absolute order of the real numbers). If the lower bound is โ™พ๏ธ, the quantifier will never match.

If the lower bound is 0๏ธโƒฃ, the quantifier may also match an empty string. If the upper bound is โ™พ๏ธ, then it practically does not exist and the quantifier can match as many times as possible, just at least as much as the lower bound.

For some common (and uncommon) quantifier amounts, there are shorthands:

  • ๐Ÿ’ค => ๐Ÿ”„0๏ธโƒฃ๐Ÿ”„1๏ธโƒฃ๐Ÿ”„ ("lazy pattern might exist or might not")
  • ๐Ÿ” => ๐Ÿ”„0๏ธโƒฃ๐Ÿ”„โ™พ๏ธ๐Ÿ”„ ("repeat")
  • ๐Ÿ”‚ => ๐Ÿ”„1๏ธโƒฃ๐Ÿ”„โ™พ๏ธ๐Ÿ”„ ("repeat at least once")
  • ๐Ÿ’ฏ => ๐Ÿ”„๐Ÿ”Ÿ1๏ธโƒฃ๐Ÿ”„๐Ÿ”Ÿ1๏ธโƒฃ๐Ÿ”„ ("repeat exactly 100 times", remember quantifier bounds are in elevenary)

Capture groups

When parsing a string using emogex, it may be useful to extract only part of the string, such as what a specific subpattern matched. Maybe this is the email domain, maybe the whole number part of a number, maybe individual fields in a specific plaintext structure. To do this, you can put a โ‡๏ธ immediately after a ๐Ÿ‡ and it can be accessed by index. If it matches multiple times due to a quantifier, it will contain the last match.

If indexes are undesirable/unclear, you can make your life easier by using one of the colored circles: ๐Ÿ”ด ๐ŸŸ  ๐ŸŸก ๐ŸŸข ๐Ÿ”ต ๐ŸŸฃ ๐ŸŸค โšซ โšช. When using one of these characters, the behaviour is the same as with โ‡๏ธ, except instead of being put into a list of capture groups accessed by index, there are 9 reserved slots for the contents of these matches.

Only one of each circle should exist, but if you have multiple, the behaviour is the same as when repeating a group. Note that in PCRE duplicate names are disallowed, but in emogex the last one matched will overwrite the previous one. This is partially because of the limited slots, so this can be useful, and it's also partially to make it easier to shoot yourself in the foot by accidentally using the same color twice. Hopefully the emoji stand out enough visually, but if you cannot see or are writing in plaintext using some kind of markup for the emoji (like i am, check the source of this document), then you're on your own.

Colored squares (๐ŸŸฅ ๐ŸŸง ๐ŸŸจ ๐ŸŸฉ ๐ŸŸฆ ๐ŸŸช ๐ŸŸซ โฌ› โฌœ) currently serve no purpose, but are reserved for features relating to colored capture groups. The colors will correspond to the relevant group.

Lookahead and lookbehind

Immediately at the start of a group may be the tokens โ“ and โ—. Both of these must be followed by either โช or โฉ. A lookahead group looks like either ๐Ÿ‡โ“โฉ๐Ÿ…ฐ๏ธ๐Ÿ‰ or ๐Ÿ‡โ—โฉ๐Ÿ…ฐ๏ธ๐Ÿ‰. A lookahead group will match to the right for the pattern ๐Ÿ…ฐ๏ธ, but the cursor stays where it is. The โ— variety negates this, meaning that ๐Ÿ…ฐ๏ธ must NOT match to the right of here. A lookbehind group looks like either ๐Ÿ‡โ“โช๐Ÿ…ฐ๏ธ๐Ÿ‰ or ๐Ÿ‡โ—โช๐Ÿ…ฐ๏ธ๐Ÿ‰. A lookbehind group will match to the left for the pattern ๐Ÿ…ฐ๏ธ, but the cursor stays where it is. The โ— variety negates this, meaning that ๐Ÿ…ฐ๏ธ must NOT have just matched to the left of here.

  • ๐Ÿ‡โ“โฉ๐Ÿ…ฐ๏ธ๐Ÿ‰ means that the pattern ๐Ÿ…ฐ๏ธ must match if put directly after this
  • ๐Ÿ‡โ—โฉ๐Ÿ…ฐ๏ธ๐Ÿ‰ means the pattern ๐Ÿ…ฐ๏ธ must fail if put directly after this
  • ๐Ÿ‡โ“โช๐Ÿ…ฐ๏ธ๐Ÿ‰ means that the pattern ๐Ÿ…ฐ๏ธ could have immediately preceeded this in the pattern
  • ๐Ÿ‡โ—โช๐Ÿ…ฐ๏ธ๐Ÿ‰ means that the pattern ๐Ÿ…ฐ๏ธ couldn't have immediately preceeded this in the pattern

Lookbehind can also use โ†ฉ๏ธ instead of โช, which means lookahead on a reversed string. String reversal is intentionally not defined here because it's easy to get wrong. This is a performance optimization because lookbehind needs to try from many initial positions, but if the pattern can be reversed, then this kind of "lookbehind" is much more efficient since only one look operaton needs to be performed.

Conditional patterns

A conditional pattern takes the form of ๐Ÿ‡โ”โš ๏ธโœ…๐Ÿ…ฐ๏ธโŽ๐Ÿ…ฑ๏ธ๐Ÿ‰ where โš ๏ธ is NOT a special character but a placeholder for the "condition pattern" which is a special syntax type only used here. The condition pattern must start with one of these characters:

  • โช => lookbehind condition. It performs lookbehind (with the subpattern between โช and โœ…) and if it succeeds ๐Ÿ…ฐ๏ธ will be matched, otherwise ๐Ÿ…ฑ๏ธ is matched.
  • โฉ => lookahead condition. It performs lookahead (with the subpattern between โฉ and โœ…) and if it succeeds, ๐Ÿ…ฐ๏ธ will be matched, otherwise ๐Ÿ…ฑ๏ธ is matched.
  • โ†ฉ๏ธ => reversed lookbehind condition. It performs lookahead with the reversed string. This does look "behind" the current cursor, but it works a lot closer to lookahead, and is more performant. If you reverse your pattern and use this, it will be more performant than real lookbehind.
  • โš“ => positional anchor condition. There is a numeric literal between the โš“ and the โœ…, and ๐Ÿ…ฐ๏ธ will be matched if the cursor is at that location in the string, otherwise ๐Ÿ…ฑ๏ธ is matched.
  • โš“๐Ÿ”š => positional anchor condition (from end). There is a numeric literal between the โš“ and the โœ…, and ๐Ÿ…ฐ๏ธ will be matched if the cursor is at that location in the string (from the end), otherwise ๐Ÿ…ฑ๏ธ is matched.

Note that in anchor conditions, they do not need a closing anchor as they do in standalone anchor objects. The anchor condition doesn't just save one character by omitting the direction of search for zero-width patterns, but it actually saves two characters by not requiring you to close the anchor either.

No

๐Ÿ”ž used to be a quantifier ("not 18 times") but due to platform differences showing 18+, crossed out 18, "18 and under" or even the number 19, it can cause confusion. No.

Calendar, moon, clock and weather emojis were jokingly zero-width anchors that matched or failed based on external factors. No.

Complex numbers, negative numbers, fractional/rational numbers. Non-natural numbers. No!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment