I've ended up actually using it regularly because i solve those FSMs to regexes on my phone, and it's more convenient to use emoji-based syntax because it's on the same keyboard as the language characters.
I've also noticed that other people (
well, it's just this one guy so far at least two people) are now using emogex unironically to solve them, so i thought it's about time to write it all in a single document so you can more conveniently reference all the features of the language without scrolling through a tweet thread which also does contain later on changes to previous definitions.
This document will be updated whenever i make changes or add new features. I will also be posting those updates under the twitter thread. If you wanna scroll from the bottom (without any breaks due to multiple replies) here is the last one in the chain
Also, the characters
Taken directly out of emojicode, groupings are delimited by a
Any pride flag, even future ones added in future versions of Unicode, shall be a wildcard. Whether you're
Numerical literals are used in a few places in the syntax. Originally, as a joke, i gave you all the math operators, integer literals, and the imaginary unit.
That was a bad idea. Numeric literals (denoted by
Numeric literals may also be the symbol
Any character that is not a metacharacter is in itself a subpattern matching that character literally. Any subpattern is also a valid pattern. To combine multiple subpatterns into a bigger pattern, you need to use the mathematical operators
🅰️ ➖ 🅱️is a "subtraction" operation. It means that the expression must match the subpattern 🅰️AND NOT the subpattern 🅱️. In terms of set theory where subpatterns are set of valid strings, this is the asymmetric difference of 🅰️and 🅱️, or the intersection of 🅰️and the complement of 🅱️. 🅰️ 🔀 🅱️is a "interlacing" operation. It means that the expression must first match 🅰️and then the contents of 🅰️'s match must further also match 🅱️. In terms of set theory where subpatterns are set of valid strings, this is the intersection of 🅰️and 🅱️. 🅰️ ↔️ 🅱️is a "switcheroo" operation. It's similar to ✖️in that it's concatenation, but either way. It is equivalent to 🍇 🅰️ 🅱️ ➕ 🅱️ 🅰️ 🍉. 🅰️ ✖️ 🅱️is a "concatenation" operation. It means that the expression must match the subpattern 🅰️immediately followed by the subpattern 🅱️. In terms of set theory where subpatterns are set of valid strings, this is the cartesian product of 🅰️and 🅱️. 🅰️ ➗ 🅱️is an "XOR" operation. It is equivalent to 🍇 🍇 🅰️ ➕ 🅱️ 🍉 ➖ 🍇 🅰️ ➖ ⛔ 🅱️ 🍉 🍉. This is the only one that bears no relation to its symbol lol. In terms of set theory where subpatterns are set of valid strings, this is the symmetric difference of 🅰️and 🅱️, or the intersection of their union and the complement of their intersection. In other words, intersection(union( 🅰️, 🅱️), complement(intersection( 🅰️, 🅱️))) 🅰️ ➕ 🅱️is an "OR" operation. It means match either 🅰️or 🅱️. In terms of set theory where subpatterns are set of valid strings, this is the union of 🅰️and 🅱️, or the "sum" of all their allowed strings.
For the sake of convenience, if a subpattern is immediately followed by another subpattern, juxtaposition implies multiplication at the same precedence as with an explicit
An important detail is that there is a "cursor" which is at any position between two characters or either end of the string. That means for a string of length 3 (i.e. "abc") there are 4 positions the cursor can be, those are:
- at the start; before a
- between a and b
- between b and c
- at the end; after c
The basic syntax for an anchor literal is
⚓ 0️⃣ ⚓/ ⚓ 🔚 3️⃣ ⚓at the start; before a ⚓ 1️⃣ ⚓/ ⚓ 🔚 2️⃣ ⚓between a and b ⚓ 2️⃣ ⚓/ ⚓ 🔚 1️⃣ ⚓between b and c ⚓ 3️⃣ ⚓/ ⚓ 🔚 0️⃣ ⚓at the end; after c
For a length of 3,
The following characters are shorthands for common anchors you'll wanna use:
⏮️=> ⚓ 0️⃣ ⚓ 🥇=> ⚓ 1️⃣ ⚓ 🥈=> ⚓ 2️⃣ ⚓ 🥉=> ⚓ 3️⃣ ⚓ ⏭️=> ⚓ 🔚 0️⃣ ⚓
Something i just became acutely aware of while writing this is that i cannot really us emoji to spice up this document (they will be confused with metacharacters) and GitHub does not display this document in a font with nice arrow ligatures.
And additionally, the
A quantifier takes the form of
If the lower bound is greater than the upper bound, the quantifier will never match (as there is no amount of times it can match where it is greater than or equal to the lower bound and less than or equal to the upper bound when the lower bound is strictly less than the upper bound, due to the absolute order of the real numbers). If the lower bound is
If the lower bound is
For some common (and uncommon) quantifier amounts, there are shorthands:
💤=> 🔄 0️⃣ 🔄 1️⃣ 🔄("lazy pattern might exist or might not") 🔁=> 🔄 0️⃣ 🔄 ♾️ 🔄("repeat") 🔂=> 🔄 1️⃣ 🔄 ♾️ 🔄("repeat at least once") 💯=> 🔄 🔟 1️⃣ 🔄 🔟 1️⃣ 🔄("repeat exactly 100 times", remember quantifier bounds are in elevenary)
When parsing a string using emogex, it may be useful to extract only part of the string, such as what a specific subpattern matched. Maybe this is the email domain, maybe the whole number part of a number, maybe individual fields in a specific plaintext structure. To do this, you can put a
If indexes are undesirable/unclear, you can make your life easier by using one of the colored circles:
Only one of each circle should exist, but if you have multiple, the behaviour is the same as when repeating a group. Note that in PCRE duplicate names are disallowed, but in emogex the last one matched will overwrite the previous one. This is partially because of the limited slots, so this can be useful, and it's also partially to make it easier to shoot yourself in the foot by accidentally using the same color twice. Hopefully the emoji stand out enough visually, but if you cannot see or are writing in plaintext using some kind of markup for the emoji (like i am, check the source of this document), then you're on your own.
Colored squares (
Lookahead and lookbehind
Immediately at the start of a group may be the tokens
🍇 ❓ ⏩ 🅰️ 🍉means that the pattern 🅰️must match if put directly after this 🍇 ❗ ⏩ 🅰️ 🍉means the pattern 🅰️must fail if put directly after this 🍇 ❓ ⏪ 🅰️ 🍉means that the pattern 🅰️could have immediately preceeded this in the pattern 🍇 ❗ ⏪ 🅰️ 🍉means that the pattern 🅰️couldn't have immediately preceeded this in the pattern
Lookbehind can also use
A conditional pattern takes the form of
⏪=> lookbehind condition. It performs lookbehind (with the subpattern between ⏪and ✅) and if it succeeds 🅰️will be matched, otherwise 🅱️is matched. ⏩=> lookahead condition. It performs lookahead (with the subpattern between ⏩and ✅) and if it succeeds, 🅰️will be matched, otherwise 🅱️is matched. ↩️=> reversed lookbehind condition. It performs lookahead with the reversed string. This does look "behind" the current cursor, but it works a lot closer to lookahead, and is more performant. If you reverse your pattern and use this, it will be more performant than real lookbehind. ⚓=> positional anchor condition. There is a numeric literal between the ⚓and the ✅, and 🅰️will be matched if the cursor is at that location in the string, otherwise 🅱️is matched. ⚓ 🔚=> positional anchor condition (from end). There is a numeric literal between the ⚓and the ✅, and 🅰️will be matched if the cursor is at that location in the string (from the end), otherwise 🅱️is matched.
Note that in anchor conditions, they do not need a closing anchor as they do in standalone anchor objects. The anchor condition doesn't just save one character by omitting the direction of search for zero-width patterns, but it actually saves two characters by not requiring you to close the anchor either.
Calendar, moon, clock and weather emojis were jokingly zero-width anchors that matched or failed based on external factors. No.
Complex numbers, negative numbers, fractional/rational numbers. Non-natural numbers. No!