Hi, emogex is an esoteric regular expression language based entirely on emojis. It was originally created as a joke at the @happyautomata twitter bot under this thread.
I've ended up actually using it regularly because i solve those FSMs to regexes on my phone, and it's more convenient to use emoji-based syntax because it's on the same keyboard as the language characters.
I've also noticed that other people (well, it's just this one guy so far at least two people) are now using emogex unironically to solve them, so i thought it's about time to write it all in a single document so you can more conveniently reference all the features of the language without scrolling through a tweet thread which also does contain later on changes to previous definitions.
This document will be updated whenever i make changes or add new features. I will also be posting those updates under the twitter thread. If you wanna scroll from the bottom (without any breaks due to multiple replies) here is the last one in the chain
Also, the characters
Taken directly out of emojicode, groupings are delimited by a ๐ to start them and end with a ๐. Taken directly out of PCRE, conditionals and lookahead/behind use a grouping syntax with a special character as the very first character in the group. I will explain those features later, but it is important to note that some characters have special meaning directly after a ๐ and nowhere else. A group is a single subpattern and besides the literal postfix operator, this has the highest precedence.
The
Any pride flag, even future ones added in future versions of Unicode, shall be a wildcard. Whether you're ๐ณ๏ธโ๐, ๐ณ๏ธโโง๏ธ, or any other category that isn't near the start of my flags, you're valid. Pride patterns accept everyone*.
Numerical literals are used in a few places in the syntax. Originally, as a joke, i gave you all the math operators, integer literals, and the imaginary unit. That was a bad idea. Numeric literals (denoted by #๏ธโฃ in this document) consist of the characters 0๏ธโฃ1๏ธโฃ2๏ธโฃ3๏ธโฃ4๏ธโฃ5๏ธโฃ6๏ธโฃ7๏ธโฃ8๏ธโฃ๐9๏ธโฃ. You may have noticed that's 11 characters, and that 9๏ธโฃ comes after ๐. That's because, even though i've made changes to make it somewhat more practical, you can still live with elevenary (positional base eleven or "undecimal" using latin naming, base-11 in decimal). ๐ (therble) is the symbol used for the number nine, and 9๏ธโฃ is the symbol for ten. 1๏ธโฃ0๏ธโฃ will be the number we call eleven and so on.
Numeric literals may also be the symbol โพ๏ธ which is a special case and will be explained in each usage.
Any character that is not a metacharacter is in itself a subpattern matching that character literally. Any subpattern is also a valid pattern. To combine multiple subpatterns into a bigger pattern, you need to use the mathematical operators โโ๏ธโโ as well as
๐ ฐ๏ธ โ๐ ฑ๏ธ is a "subtraction" operation. It means that the expression must match the subpattern๐ ฐ๏ธ AND NOT the subpattern๐ ฑ๏ธ . In terms of set theory where subpatterns are set of valid strings, this is the asymmetric difference of๐ ฐ๏ธ and๐ ฑ๏ธ , or the intersection of๐ ฐ๏ธ and the complement of๐ ฑ๏ธ .๐ ฐ๏ธ ๐๐ ฑ๏ธ is a "interlacing" operation. It means that the expression must first match๐ ฐ๏ธ and then the contents of๐ ฐ๏ธ 's match must further also match๐ ฑ๏ธ . In terms of set theory where subpatterns are set of valid strings, this is the intersection of๐ ฐ๏ธ and๐ ฑ๏ธ .๐ ฐ๏ธ โ๏ธ ๐ ฑ๏ธ is a "switcheroo" operation. It's similar to โ๏ธ in that it's concatenation, but either way. It is equivalent to ๐๐ ฐ๏ธ ๐ ฑ๏ธ โ๐ ฑ๏ธ ๐ ฐ๏ธ ๐.๐ ฐ๏ธ โ๏ธ๐ ฑ๏ธ is a "concatenation" operation. It means that the expression must match the subpattern๐ ฐ๏ธ immediately followed by the subpattern๐ ฑ๏ธ . In terms of set theory where subpatterns are set of valid strings, this is the cartesian product of๐ ฐ๏ธ and๐ ฑ๏ธ .๐ ฐ๏ธ โ๐ ฑ๏ธ is an "XOR" operation. It is equivalent to ๐๐๐ ฐ๏ธ โ๐ ฑ๏ธ ๐โ๐๐ ฐ๏ธ โโ๐ ฑ๏ธ ๐๐. This is the only one that bears no relation to its symbol lol. In terms of set theory where subpatterns are set of valid strings, this is the symmetric difference of๐ ฐ๏ธ and๐ ฑ๏ธ , or the intersection of their union and the complement of their intersection. In other words, intersection(union(๐ ฐ๏ธ ,๐ ฑ๏ธ ), complement(intersection(๐ ฐ๏ธ ,๐ ฑ๏ธ )))๐ ฐ๏ธ โ๐ ฑ๏ธ is an "OR" operation. It means match either๐ ฐ๏ธ or๐ ฑ๏ธ . In terms of set theory where subpatterns are set of valid strings, this is the union of๐ ฐ๏ธ and๐ ฑ๏ธ , or the "sum" of all their allowed strings.
For the sake of convenience, if a subpattern is immediately followed by another subpattern, juxtaposition implies multiplication at the same precedence as with an explicit โ๏ธ.
The ๐ง token will match any valid email address. Yes, this is under the "core features" header. No, the email address need not exist and receive anything sent its way, it just has to be correctly formatted so that it could exist if someone wanted to register it.
The ๐ซ token is by definition a zero-width anchor that always fails to match. It is equivalent to โโพ๏ธโ
An important detail is that there is a "cursor" which is at any position between two characters or either end of the string. That means for a string of length 3 (i.e. "abc") there are 4 positions the cursor can be, those are:
- at the start; before a
- between a and b
- between b and c
- at the end; after c
The basic syntax for an anchor literal is โ#๏ธโฃโ or โ๐#๏ธโฃโ the #๏ธโฃ is the index, and the ๐ means "from the end". For example, in the string "abc" here are what the anchors correspond to:
- โ0๏ธโฃโ / โ๐3๏ธโฃโ at the start; before a
- โ1๏ธโฃโ / โ๐2๏ธโฃโ between a and b
- โ2๏ธโฃโ / โ๐1๏ธโฃโ between b and c
- โ3๏ธโฃโ / โ๐0๏ธโฃโ at the end; after c
For a length of 3, โ4๏ธโฃโ and โ๐4๏ธโฃโ will be outside the string bounds, and as such will never match (they are equivalent to ๐ซ in this particular input string). An anchor with an index of โพ๏ธ will always be outside the string bounds, and as such never match either.
The following characters are shorthands for common anchors you'll wanna use:
- โฎ๏ธ => โ0๏ธโฃโ
- ๐ฅ => โ1๏ธโฃโ
- ๐ฅ => โ2๏ธโฃโ
- ๐ฅ => โ3๏ธโฃโ
- โญ๏ธ => โ๐0๏ธโฃโ
Something i just became acutely aware of while writing this is that i cannot really us emoji to spice up this document (they will be confused with metacharacters) and GitHub does not display this document in a font with nice arrow ligatures.
And additionally, the โ token will negate an anchor. That way, it checks that the cursor is NOT at that position in the string.
A quantifier takes the form of
If the lower bound is greater than the upper bound, the quantifier will never match (as there is no amount of times it can match where it is greater than or equal to the lower bound and less than or equal to the upper bound when the lower bound is strictly less than the upper bound, due to the absolute order of the real numbers). If the lower bound is โพ๏ธ, the quantifier will never match.
If the lower bound is 0๏ธโฃ, the quantifier may also match an empty string. If the upper bound is โพ๏ธ, then it practically does not exist and the quantifier can match as many times as possible, just at least as much as the lower bound.
For some common (and uncommon) quantifier amounts, there are shorthands:
- ๐ค => ๐0๏ธโฃ๐1๏ธโฃ๐ ("lazy pattern might exist or might not")
- ๐ => ๐0๏ธโฃ๐โพ๏ธ๐ ("repeat")
- ๐ => ๐1๏ธโฃ๐โพ๏ธ๐ ("repeat at least once")
- ๐ฏ => ๐๐1๏ธโฃ๐๐1๏ธโฃ๐ ("repeat exactly 100 times", remember quantifier bounds are in elevenary)
When parsing a string using emogex, it may be useful to extract only part of the string, such as what a specific subpattern matched. Maybe this is the email domain, maybe the whole number part of a number, maybe individual fields in a specific plaintext structure. To do this, you can put a โ๏ธ immediately after a ๐ and it can be accessed by index. If it matches multiple times due to a quantifier, it will contain the last match.
If indexes are undesirable/unclear, you can make your life easier by using one of the colored circles: ๐ด ๐ ๐ก ๐ข ๐ต ๐ฃ ๐ค โซ โช. When using one of these characters, the behaviour is the same as with โ๏ธ, except instead of being put into a list of capture groups accessed by index, there are 9 reserved slots for the contents of these matches.
Only one of each circle should exist, but if you have multiple, the behaviour is the same as when repeating a group. Note that in PCRE duplicate names are disallowed, but in emogex the last one matched will overwrite the previous one. This is partially because of the limited slots, so this can be useful, and it's also partially to make it easier to shoot yourself in the foot by accidentally using the same color twice. Hopefully the emoji stand out enough visually, but if you cannot see or are writing in plaintext using some kind of markup for the emoji (like i am, check the source of this document), then you're on your own.
Colored squares (:red_square: :orange_square: :yellow_square: :green_square: :blue_square: :purple_square: :brown_square: :black_large_square: :white_large_square:) currently serve no purpose, but are reserved for features relating to colored capture groups. The colors will correspond to the relevant group.
Immediately at the start of a group may be the tokens โ and โ. Both of these must be followed by either โช or โฉ.
A lookahead group looks like either ๐โโฉ
- ๐โโฉ
๐ ฐ๏ธ ๐ means that the pattern๐ ฐ๏ธ must match if put directly after this - ๐โโฉ
๐ ฐ๏ธ ๐ means the pattern๐ ฐ๏ธ must fail if put directly after this - ๐โโช
๐ ฐ๏ธ ๐ means that the pattern๐ ฐ๏ธ could have immediately preceeded this in the pattern - ๐โโช
๐ ฐ๏ธ ๐ means that the pattern๐ ฐ๏ธ couldn't have immediately preceeded this in the pattern
Lookbehind can also use โฉ๏ธ instead of โช, which means lookahead on a reversed string. String reversal is intentionally not defined here because it's easy to get wrong. This is a performance optimization because lookbehind needs to try from many initial positions, but if the pattern can be reversed, then this kind of "lookbehind" is much more efficient since only one look operaton needs to be performed.
A conditional pattern takes the form of ๐โ
- โช => lookbehind condition. It performs lookbehind (with the subpattern between โช and โ
) and if it succeeds
๐ ฐ๏ธ will be matched, otherwise๐ ฑ๏ธ is matched. - โฉ => lookahead condition. It performs lookahead (with the subpattern between โฉ and โ
) and if it succeeds,
๐ ฐ๏ธ will be matched, otherwise๐ ฑ๏ธ is matched. - โฉ๏ธ => reversed lookbehind condition. It performs lookahead with the reversed string. This does look "behind" the current cursor, but it works a lot closer to lookahead, and is more performant. If you reverse your pattern and use this, it will be more performant than real lookbehind.
- โ => positional anchor condition. There is a numeric literal between the โ and the โ
, and
๐ ฐ๏ธ will be matched if the cursor is at that location in the string, otherwise๐ ฑ๏ธ is matched. - โ๐ => positional anchor condition (from end). There is a numeric literal between the โ and the โ
, and
๐ ฐ๏ธ will be matched if the cursor is at that location in the string (from the end), otherwise๐ ฑ๏ธ is matched.
Note that in anchor conditions, they do not need a closing anchor as they do in standalone anchor objects. The anchor condition doesn't just save one character by omitting the direction of search for zero-width patterns, but it actually saves two characters by not requiring you to close the anchor either.
๐ used to be a quantifier ("not 18 times") but due to platform differences showing 18+, crossed out 18, "18 and under" or even the number 19, it can cause confusion. No.
Calendar, moon, clock and weather emojis were jokingly zero-width anchors that matched or failed based on external factors. No.
Complex numbers, negative numbers, fractional/rational numbers. Non-natural numbers. No!
Is this just a specification, or is there an implementation of this? Any place I can play around with this?