Emogex
Hi, emogex is an esoteric regular expression language based entirely on emojis. It was originally created as a joke at the @happyautomata twitter bot under this thread.
I've ended up actually using it regularly because i solve those FSMs to regexes on my phone, and it's more convenient to use emoji-based syntax because it's on the same keyboard as the language characters.
I've also noticed that other people (well, it's just this one guy so far at least two people) are now using emogex unironically to solve them, so i thought it's about time to write it all in a single document so you can more conveniently reference all the features of the language without scrolling through a tweet thread which also does contain later on changes to previous definitions.
This document will be updated whenever i make changes or add new features. I will also be posting those updates under the twitter thread. If you wanna scroll from the bottom (without any breaks due to multiple replies) here is the last one in the chain
Also, the characters
Core features
Taken directly out of emojicode, groupings are delimited by a
The
Any pride flag, even future ones added in future versions of Unicode, shall be a wildcard. Whether you're
Numerical literals are used in a few places in the syntax. Originally, as a joke, i gave you all the math operators, integer literals, and the imaginary unit.
That was a bad idea. Numeric literals (denoted by
Numeric literals may also be the symbol
Any character that is not a metacharacter is in itself a subpattern matching that character literally. Any subpattern is also a valid pattern. To combine multiple subpatterns into a bigger pattern, you need to use the mathematical operators
๐ ฐ๏ธ โ ๐ ฑ๏ธ is a "subtraction" operation. It means that the expression must match the subpattern๐ ฐ๏ธ AND NOT the subpattern๐ ฑ๏ธ . In terms of set theory where subpatterns are set of valid strings, this is the asymmetric difference of๐ ฐ๏ธ and๐ ฑ๏ธ , or the intersection of๐ ฐ๏ธ and the complement of๐ ฑ๏ธ .๐ ฐ๏ธ ๐ ๐ ฑ๏ธ is a "interlacing" operation. It means that the expression must first match๐ ฐ๏ธ and then the contents of๐ ฐ๏ธ 's match must further also match๐ ฑ๏ธ . In terms of set theory where subpatterns are set of valid strings, this is the intersection of๐ ฐ๏ธ and๐ ฑ๏ธ .๐ ฐ๏ธ โ๏ธ ๐ ฑ๏ธ is a "switcheroo" operation. It's similar toโ๏ธ in that it's concatenation, but either way. It is equivalent to๐ ๐ ฐ๏ธ ๐ ฑ๏ธ โ ๐ ฑ๏ธ ๐ ฐ๏ธ ๐ .๐ ฐ๏ธ โ๏ธ ๐ ฑ๏ธ is a "concatenation" operation. It means that the expression must match the subpattern๐ ฐ๏ธ immediately followed by the subpattern๐ ฑ๏ธ . In terms of set theory where subpatterns are set of valid strings, this is the cartesian product of๐ ฐ๏ธ and๐ ฑ๏ธ .๐ ฐ๏ธ โ ๐ ฑ๏ธ is an "XOR" operation. It is equivalent to๐ ๐ ๐ ฐ๏ธ โ ๐ ฑ๏ธ ๐ โ ๐ ๐ ฐ๏ธ โ โ ๐ ฑ๏ธ ๐ ๐ . This is the only one that bears no relation to its symbol lol. In terms of set theory where subpatterns are set of valid strings, this is the symmetric difference of๐ ฐ๏ธ and๐ ฑ๏ธ , or the intersection of their union and the complement of their intersection. In other words, intersection(union(๐ ฐ๏ธ ,๐ ฑ๏ธ ), complement(intersection(๐ ฐ๏ธ ,๐ ฑ๏ธ )))๐ ฐ๏ธ โ ๐ ฑ๏ธ is an "OR" operation. It means match either๐ ฐ๏ธ or๐ ฑ๏ธ . In terms of set theory where subpatterns are set of valid strings, this is the union of๐ ฐ๏ธ and๐ ฑ๏ธ , or the "sum" of all their allowed strings.
For the sake of convenience, if a subpattern is immediately followed by another subpattern, juxtaposition implies multiplication at the same precedence as with an explicit
The
The
Positional anchors
An important detail is that there is a "cursor" which is at any position between two characters or either end of the string. That means for a string of length 3 (i.e. "abc") there are 4 positions the cursor can be, those are:
- at the start; before a
- between a and b
- between b and c
- at the end; after c
The basic syntax for an anchor literal is
โ 0๏ธโฃ โ /โ ๐ 3๏ธโฃ โ at the start; before aโ 1๏ธโฃ โ /โ ๐ 2๏ธโฃ โ between a and bโ 2๏ธโฃ โ /โ ๐ 1๏ธโฃ โ between b and cโ 3๏ธโฃ โ /โ ๐ 0๏ธโฃ โ at the end; after c
For a length of 3,
The following characters are shorthands for common anchors you'll wanna use:
โฎ๏ธ =>โ 0๏ธโฃ โ ๐ฅ =>โ 1๏ธโฃ โ ๐ฅ =>โ 2๏ธโฃ โ ๐ฅ =>โ 3๏ธโฃ โ โญ๏ธ =>โ ๐ 0๏ธโฃ โ
Something i just became acutely aware of while writing this is that i cannot really us emoji to spice up this document (they will be confused with metacharacters) and GitHub does not display this document in a font with nice arrow ligatures.
And additionally, the
Quantifiers
A quantifier takes the form of
If the lower bound is greater than the upper bound, the quantifier will never match (as there is no amount of times it can match where it is greater than or equal to the lower bound and less than or equal to the upper bound when the lower bound is strictly less than the upper bound, due to the absolute order of the real numbers). If the lower bound is
If the lower bound is
For some common (and uncommon) quantifier amounts, there are shorthands:
๐ค =>๐ 0๏ธโฃ ๐ 1๏ธโฃ ๐ ("lazy pattern might exist or might not")๐ =>๐ 0๏ธโฃ ๐ โพ๏ธ ๐ ("repeat")๐ =>๐ 1๏ธโฃ ๐ โพ๏ธ ๐ ("repeat at least once")๐ฏ =>๐ ๐ 1๏ธโฃ ๐ ๐ 1๏ธโฃ ๐ ("repeat exactly 100 times", remember quantifier bounds are in elevenary)
Capture groups
When parsing a string using emogex, it may be useful to extract only part of the string, such as what a specific subpattern matched. Maybe this is the email domain, maybe the whole number part of a number, maybe individual fields in a specific plaintext structure. To do this, you can put a
If indexes are undesirable/unclear, you can make your life easier by using one of the colored circles:
Only one of each circle should exist, but if you have multiple, the behaviour is the same as when repeating a group. Note that in PCRE duplicate names are disallowed, but in emogex the last one matched will overwrite the previous one. This is partially because of the limited slots, so this can be useful, and it's also partially to make it easier to shoot yourself in the foot by accidentally using the same color twice. Hopefully the emoji stand out enough visually, but if you cannot see or are writing in plaintext using some kind of markup for the emoji (like i am, check the source of this document), then you're on your own.
Colored squares (
Lookahead and lookbehind
Immediately at the start of a group may be the tokens
๐ โ โฉ ๐ ฐ๏ธ ๐ means that the pattern๐ ฐ๏ธ must match if put directly after this๐ โ โฉ ๐ ฐ๏ธ ๐ means the pattern๐ ฐ๏ธ must fail if put directly after this๐ โ โช ๐ ฐ๏ธ ๐ means that the pattern๐ ฐ๏ธ could have immediately preceeded this in the pattern๐ โ โช ๐ ฐ๏ธ ๐ means that the pattern๐ ฐ๏ธ couldn't have immediately preceeded this in the pattern
Lookbehind can also use
Conditional patterns
A conditional pattern takes the form of
โช => lookbehind condition. It performs lookbehind (with the subpattern betweenโช andโ ) and if it succeeds๐ ฐ๏ธ will be matched, otherwise๐ ฑ๏ธ is matched.โฉ => lookahead condition. It performs lookahead (with the subpattern betweenโฉ andโ ) and if it succeeds,๐ ฐ๏ธ will be matched, otherwise๐ ฑ๏ธ is matched.โฉ๏ธ => reversed lookbehind condition. It performs lookahead with the reversed string. This does look "behind" the current cursor, but it works a lot closer to lookahead, and is more performant. If you reverse your pattern and use this, it will be more performant than real lookbehind.โ => positional anchor condition. There is a numeric literal between theโ and theโ , and๐ ฐ๏ธ will be matched if the cursor is at that location in the string, otherwise๐ ฑ๏ธ is matched.โ ๐ => positional anchor condition (from end). There is a numeric literal between theโ and theโ , and๐ ฐ๏ธ will be matched if the cursor is at that location in the string (from the end), otherwise๐ ฑ๏ธ is matched.
Note that in anchor conditions, they do not need a closing anchor as they do in standalone anchor objects. The anchor condition doesn't just save one character by omitting the direction of search for zero-width patterns, but it actually saves two characters by not requiring you to close the anchor either.
No
Calendar, moon, clock and weather emojis were jokingly zero-width anchors that matched or failed based on external factors. No.
Complex numbers, negative numbers, fractional/rational numbers. Non-natural numbers. No!
Is this just a specification, or is there an implementation of this? Any place I can play around with this?