sodiboo/emogex.md

## emogex.md

      
    Raw
  

              emogex.md
            
          
    Emogex

Hi, emogex is an esoteric regular expression language based entirely on emojis. It was originally created as a joke at the @happyautomata twitter bot under this thread.
I've ended up actually using it regularly because i solve those FSMs to regexes on my phone, and it's more convenient to use emoji-based syntax because it's on the same keyboard as the language characters.
I've also noticed that other people (well, it's just this one guy so far at least two people) are now using emogex unironically to solve them, so i thought it's about time to write it all in a single document so you can more conveniently reference all the features of the language without scrolling through a tweet thread which also does contain later on changes to previous definitions.
This document will be updated whenever i make changes or add new features. I will also be posting those updates under the twitter thread. If you wanna scroll from the bottom (without any breaks due to multiple replies) here is the last one in the chain
Also, the characters 🅰️, 🅱️, #️⃣ will often be used in this document as placeholders, where #️⃣ is a numeric literal and 🅰️/🅱️ are subpatterns. They do not have any special meaning in the language
Core features

Taken directly out of emojicode, groupings are delimited by a 🍇 to start them and end with a 🍉. Taken directly out of PCRE, conditionals and lookahead/behind use a grouping syntax with a special character as the very first character in the group.
I will explain those features later, but it is important to note that some characters have special meaning directly after a 🍇 and nowhere else. A group is a single subpattern and besides the literal postfix operator, this has the highest precedence.
The ‼️ token is the "literal character" postfix. When tokenizing emogex, you should scan from right-to-left for this character as the very first step to tell which characters are literal and which may be metacharacters. I am so sorry if you're doing this in C or something where emoji character width in UTF-8 will cause you immense pain. You did choose to do it that way, though.
Any pride flag, even future ones added in future versions of Unicode, shall be a wildcard. Whether you're 🏳️‍🌈, 🏳️‍⚧️, or any other category that isn't near the start of my flags, you're valid. Pride patterns accept everyone*.
Numerical literals are used in a few places in the syntax. Originally, as a joke, i gave you all the math operators, integer literals, and the imaginary unit.
That was a bad idea. Numeric literals (denoted by #️⃣ in this document) consist of the characters 0️⃣1️⃣2️⃣3️⃣4️⃣5️⃣6️⃣7️⃣8️⃣🔟9️⃣. You may have noticed that's 11 characters, and that 9️⃣ comes after 🔟. That's because, even though i've made changes to make it somewhat more practical, you can still live with elevenary (positional base eleven or "undecimal" using latin naming, base-11 in decimal). 🔟 (therble) is the symbol used for the number nine, and 9️⃣ is the symbol for ten. 1️⃣0️⃣ will be the number we call eleven and so on.
Numeric literals may also be the symbol ♾️ which is a special case and will be explained in each usage.

Any character that is not a metacharacter is in itself a subpattern matching that character literally. Any subpattern is also a valid pattern. To combine multiple subpatterns into a bigger pattern, you need to use the mathematical operators ➖✖️➗➕ as well as ↔️ and 🔀. Here are their purposes, in the order of precedence (none of them are on the same level)

🅰️➖🅱️ is a "subtraction" operation. It means that the expression must match the subpattern 🅰️ AND NOT the subpattern 🅱️. In terms of set theory where subpatterns are set of valid strings, this is the asymmetric difference of 🅰️ and 🅱️, or the intersection of 🅰️ and the complement of 🅱️.
🅰️🔀🅱️ is a "interlacing" operation. It means that the expression must first match 🅰️ and then the contents of 🅰️'s match must further also match 🅱️. In terms of set theory where subpatterns are set of valid strings, this is the intersection of 🅰️ and 🅱️.
🅰️↔️🅱️ is a "switcheroo" operation. It's similar to ✖️ in that it's concatenation, but either way. It is equivalent to 🍇🅰️🅱️➕🅱️🅰️🍉.
🅰️✖️🅱️ is a "concatenation" operation. It means that the expression must match the subpattern 🅰️ immediately followed by the subpattern 🅱️. In terms of set theory where subpatterns are set of valid strings, this is the cartesian product of 🅰️ and 🅱️.
🅰️➗🅱️ is an "XOR" operation. It is equivalent to 🍇🍇🅰️➕🅱️🍉➖🍇🅰️➖⛔🅱️🍉🍉. This is the only one that bears no relation to its symbol lol. In terms of set theory where subpatterns are set of valid strings, this is the symmetric difference of 🅰️ and 🅱️, or the intersection of their union and the complement of their intersection. In other words, intersection(union(🅰️, 🅱️), complement(intersection(🅰️, 🅱️)))
🅰️➕🅱️ is an "OR" operation. It means match either 🅰️ or 🅱️. In terms of set theory where subpatterns are set of valid strings, this is the union of 🅰️ and 🅱️, or the "sum" of all their allowed strings.

For the sake of convenience, if a subpattern is immediately followed by another subpattern, juxtaposition implies multiplication at the same precedence as with an explicit ✖️.

The 📧 token will match any valid email address. Yes, this is under the "core features" header. No, the email address need not exist and receive anything sent its way, it just has to be correctly formatted so that it could exist if someone wanted to register it.
The 🚫 token is by definition a zero-width anchor that always fails to match. It is equivalent to ⚓♾️⚓
Positional anchors

An important detail is that there is a "cursor" which is at any position between two characters or either end of the string. That means for a string of length 3 (i.e. "abc") there are 4 positions the cursor can be, those are:

at the start; before a
between a and b
between b and c
at the end; after c

The basic syntax for an anchor literal is ⚓#️⃣⚓ or ⚓🔚#️⃣⚓ the #️⃣ is the index, and the 🔚 means "from the end". For example, in the string "abc" here are what the anchors correspond to:

⚓0️⃣⚓ / ⚓🔚3️⃣⚓ at the start; before a
⚓1️⃣⚓ / ⚓🔚2️⃣⚓ between a and b
⚓2️⃣⚓ / ⚓🔚1️⃣⚓ between b and c
⚓3️⃣⚓ / ⚓🔚0️⃣⚓ at the end; after c

For a length of 3, ⚓4️⃣⚓ and ⚓🔚4️⃣⚓ will be outside the string bounds, and as such will never match (they are equivalent to 🚫 in this particular input string). An anchor with an index of ♾️ will always be outside the string bounds, and as such never match either.
The following characters are shorthands for common anchors you'll wanna use:

⏮️ => ⚓0️⃣⚓
🥇 => ⚓1️⃣⚓
🥈 => ⚓2️⃣⚓
🥉 => ⚓3️⃣⚓
⏭️ => ⚓🔚0️⃣⚓

Something i just became acutely aware of while writing this is that i cannot really us emoji to spice up this document (they will be confused with metacharacters) and GitHub does not display this document in a font with nice arrow ligatures.
And additionally, the ⛔ token will negate an anchor. That way, it checks that the cursor is NOT at that position in the string.
Quantifiers

A quantifier takes the form of 🅰️🔄#️⃣🔄#️⃣🔄 where the first #️⃣ is the lower bound, and the second #️⃣ is the upper bound. A quantifier pattern will match 🅰️ multiple times, specifically at least as many times (greater than or equal to) the lower bound AND no more than (less than or equal to) the upper bound.
If the lower bound is greater than the upper bound, the quantifier will never match (as there is no amount of times it can match where it is greater than or equal to the lower bound and less than or equal to the upper bound when the lower bound is strictly less than the upper bound, due to the absolute order of the real numbers). If the lower bound is ♾️, the quantifier will never match.
If the lower bound is 0️⃣, the quantifier may also match an empty string. If the upper bound is ♾️, then it practically does not exist and the quantifier can match as many times as possible, just at least as much as the lower bound.
For some common (and uncommon) quantifier amounts, there are shorthands:

💤 => 🔄0️⃣🔄1️⃣🔄 ("lazy pattern might exist or might not")
🔁 => 🔄0️⃣🔄♾️🔄 ("repeat")
🔂 => 🔄1️⃣🔄♾️🔄 ("repeat at least once")
💯 => 🔄🔟1️⃣🔄🔟1️⃣🔄 ("repeat exactly 100 times", remember quantifier bounds are in elevenary)

Capture groups

When parsing a string using emogex, it may be useful to extract only part of the string, such as what a specific subpattern matched. Maybe this is the email domain, maybe the whole number part of a number, maybe individual fields in a specific plaintext structure. To do this, you can put a ❇️ immediately after a 🍇 and it can be accessed by index. If it matches multiple times due to a quantifier, it will contain the last match.
If indexes are undesirable/unclear, you can make your life easier by using one of the colored circles: 🔴 🟠 🟡 🟢 🔵 🟣 🟤 ⚫ ⚪. When using one of these characters, the behaviour is the same as with ❇️, except instead of being put into a list of capture groups accessed by index, there are 9 reserved slots for the contents of these matches.
Only one of each circle should exist, but if you have multiple, the behaviour is the same as when repeating a group. Note that in PCRE duplicate names are disallowed, but in emogex the last one matched will overwrite the previous one. This is partially because of the limited slots, so this can be useful, and it's also partially to make it easier to shoot yourself in the foot by accidentally using the same color twice. Hopefully the emoji stand out enough visually, but if you cannot see or are writing in plaintext using some kind of markup for the emoji (like i am, check the source of this document), then you're on your own.
Colored squares (:red_square: :orange_square: :yellow_square: :green_square: :blue_square: :purple_square: :brown_square: :black_large_square: :white_large_square:) currently serve no purpose, but are reserved for features relating to colored capture groups. The colors will correspond to the relevant group.
Lookahead and lookbehind

Immediately at the start of a group may be the tokens ❓ and ❗. Both of these must be followed by either ⏪ or ⏩.
A lookahead group looks like either 🍇❓⏩🅰️🍉 or 🍇❗⏩🅰️🍉. A lookahead group will match to the right for the pattern 🅰️, but the cursor stays where it is. The ❗ variety negates this, meaning that 🅰️ must NOT match to the right of here.
A lookbehind group looks like either 🍇❓⏪🅰️🍉 or 🍇❗⏪🅰️🍉. A lookbehind group will match to the left for the pattern 🅰️, but the cursor stays where it is. The ❗ variety negates this, meaning that 🅰️ must NOT have just matched to the left of here.

🍇❓⏩🅰️🍉 means that the pattern 🅰️ must match if put directly after this
🍇❗⏩🅰️🍉 means the pattern 🅰️ must fail if put directly after this
🍇❓⏪🅰️🍉 means that the pattern 🅰️ could have immediately preceeded this in the pattern
🍇❗⏪🅰️🍉 means that the pattern 🅰️ couldn't have immediately preceeded this in the pattern

Lookbehind can also use ↩️ instead of ⏪, which means lookahead on a reversed string. String reversal is intentionally not defined here because it's easy to get wrong. This is a performance optimization because lookbehind needs to try from many initial positions, but if the pattern can be reversed, then this kind of "lookbehind" is much more efficient since only one look operaton needs to be performed.
Conditional patterns

A conditional pattern takes the form of 🍇❔⚠️✅🅰️❎🅱️🍉 where ⚠️ is NOT a special character but a placeholder for the "condition pattern" which is a special syntax type only used here. The condition pattern must start with one of these characters:

⏪ => lookbehind condition. It performs lookbehind (with the subpattern between ⏪ and ✅) and if it succeeds 🅰️ will be matched, otherwise 🅱️ is matched.
⏩ => lookahead condition. It performs lookahead (with the subpattern between ⏩ and ✅) and if it succeeds, 🅰️ will be matched, otherwise 🅱️ is matched.
↩️ => reversed lookbehind condition. It performs lookahead with the reversed string. This does look "behind" the current cursor, but it works a lot closer to lookahead, and is more performant. If you reverse your pattern and use this, it will be more performant than real lookbehind.
⚓ => positional anchor condition. There is a numeric literal between the ⚓ and the ✅, and 🅰️ will be matched if the cursor is at that location in the string, otherwise 🅱️ is matched.
⚓🔚 => positional anchor condition (from end). There is a numeric literal between the ⚓ and the ✅, and 🅰️ will be matched if the cursor is at that location in the string (from the end), otherwise 🅱️ is matched.

Note that in anchor conditions, they do not need a closing anchor as they do in standalone anchor objects. The anchor condition doesn't just save one character by omitting the direction of search for zero-width patterns, but it actually saves two characters by not requiring you to close the anchor either.
No

🔞 used to be a quantifier ("not 18 times") but due to platform differences showing 18+, crossed out 18, "18 and under" or even the number 19, it can cause confusion. No.
Calendar, moon, clock and weather emojis were jokingly zero-width anchors that matched or failed based on external factors. No.
Complex numbers, negative numbers, fractional/rational numbers. Non-natural numbers. No!