Skip to content

Instantly share code, notes, and snippets.

@HParker
Last active April 26, 2023 19:26
Show Gist options
  • Save HParker/1767b7a525ac7cc594a36c41062d7639 to your computer and use it in GitHub Desktop.
Save HParker/1767b7a525ac7cc594a36c41062d7639 to your computer and use it in GitHub Desktop.
Onigmo Tokens
Node Type Usage
NT_STR String Node
NT_CCLASS Character class i.e. [abc]
NT_CTYPE character type as in \w
NT_CANY anychar node such as .
NT_BREF Backreference node
NT_QTFR Quantifier node
NT_ENCLOSE Enclosing ndoe such as (abc)
NT_ANCHOR Location anchors such as \A
NT_LIST List of nodes which must accure in order
NT_ALT Alternation such as a|b
NT_CALL call referencing a previous sub expression

Limited list of some common operations.

OP Code (variants) Arguments Example
OP_FINISH None
OP_END None
OP_EXACT (1-5, MB, IC) 1-5 characters per bytes /abc/
OP_EXACTN 1 byte specifying length, 1 byte per character /[abc]/
OP_CCLASS (NOT, MB) 32 bytes for standard character class [
OP_ANYCHAR (ML, STAR, PEEK_NEXT) None
OP_WORD (ASCII) None \w
OP_NOT_WORD (ASCII) None \W
OP_ASCII_WORD_BOUND None \b
OP_NOT_ASCII_WORD_BOUND None \B
OP_BEGIN_BUF None \A
OP_END_BUF None \Z
OP_BEGIN_LINE None ^
OP_END_LINE None $
OP_BACKREF (1-2) None /(?<=a)/
OP_BACKREFN (MULTI, IC) 1 byte for backref number /(?<=a)/
OP_FAIL None Stop trying to match.
OP_JUMP 4 byte relative offset commonly found in alternation/optional patterns.
OP_PUSH 4 byte for relative offset used to generate an alternative path for backtracking.
OP_POP None
OP_PUSH_OR_JUMP_EXACT1 4 bytes relative offset, 1 byte character to consume Optimization optimizes some jump patterns.
OP_PUSH_IF_PEEK_NEXT 4 bytes relative offset, 1 byte character to peek if the next character is what this code specifies
OP_REPEAT (NG) 2 bytes memory location, 4 bytes relative offset start of repeat pattern
OP_PUSH_POS None (?=) start
OP_POP_POS None (?=) end
OP_PUSH_POS_NOT 4 bytes relative offset (?!) start
OP_FAIL_POS None (?!) end
OP_PUSH_STOP_BT 4 bytes relative offset (?>) start
OP_POP_STOP_BT None (?>) end
OP_LOOK_BEHIND 4 bytes relative offset (?<=) start
Token example Usage
TK_EOT End of Token. One of the two tokens that can end a subexpression.
TK_RAW_BYTE /\xA1 Raw Byte
TK_CHAR Character literal. Used internally and often changes type before finishing parsing.
TK_STRING /abc/ One or many characters.
TK_CODE_POINT /\n/ /\t/ Codepoint literal for characters including control characters.
TK_ANYCHAR /./ Any character.
TK_CHAR_TYPE /\h/, /\w/ Represents a type of character like whitespace or word characters.
TK_BACKREF /(?<=thing)/ Reference to something that is not included in the match.
TK_CALL /(abc)\g'0'/ Call will re-run the referenced subexpression. in this case this is equivalent to /(abc)(abc)/
TK_ANCHOR /\A/, /\Z/, /^/, /$/ Start, End or other match locations.
TK_OP_REPEAT /a+/, /a*/ Represents characters that happen repeatedly.
TK_INTERVAL /a{3,4} Represents character patterns that happen between two numbers of times.
TK_ANYCHAR_ANYTIME /.*/ Special token for any character anytime.
TK_ALT /ab/ Represents either one character or another.
TK_SUBEXP_OPEN /*(*ab)/ start of subexpression.
TK_SUBEXP_CLOSE /(ab*)*/ end of subexpression.
TK_CC_OPEN /[a]/ character class containing different alternative character matches.
TK_CC_CLOSE /[a]/ close of a character class.
TK_QUOTE_OPEN /\Q/ Start of a quote sequence. Do not include in match. (not enabled in Ruby).
TK_CHAR_PROPERTY /\p{Alnum}/, /\p{Katakana} Match based on a character property like is it alphanumeric or is it katakana.
TK_LINEBREAK \n Literal newline character for multiline regular expressions.
TK_EXTENDED_GRAPHEME_CLUSTER /\X0067/ numer literal form of UTF-8 characters.
TK_KEEP /abc\Kdef/ Keep is another look behind everything before the \K is not included in the match.
TK_CC_RANGE /[a-z] the - meaning that all characters between the two characters are included in the range.
TK_POSIX_BRACKET_OPEN /[:word:]/ POSIX style character matching classes.
TK_CC_AND /[a-k&&h-z]/ Takes the intersection of two character classes.
TK_CC_CC_OPEN /[[ab]c]/ Start of a character class within a character class.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment