Skip to content

Instantly share code, notes, and snippets.

@SoniEx2
Last active December 18, 2016 02:31
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save SoniEx2/36d2ccf875ea025c8fe5 to your computer and use it in GitHub Desktop.
Save SoniEx2/36d2ccf875ea025c8fe5 to your computer and use it in GitHub Desktop.

ClEx - Clunky Expressions

Version 0.10.

Authors:

  • SoniEx2

About

ClEx is a regex-like pattern matching that attempts to K.I.S.S. Basically, everything is a match class. It is also highly flexible and easily extended.

Warning

ClEx was designed for use with binary data. Attempting to match non-binary data with ClEx may be met with discrimination and bigotry.

Simple ClEx Pattern

A Simple ClEx Pattern is just a string with no "special characters". Example: example.

Match classes

Everything from a simple character to a group capture is a "match class". All matches are comparable, and return a number < 0 for less than, == 0 for equal, > 0 for greater than.

Quantifiers

Quantifiers are the most basic feature of ClEx. They work differently from the ones in regex so we can KISS.

The + quantifier

The + quantifier creates a match class that matches 1 or more times, always matching as much as possible. Returns the result of the first match attempt.

The - quantifier

The - quantifier creates a match class that matches 1 or more times, always matching as little as possible. Returns the result of the first match attempt.

The ? quantifier

The ? quantifier creates a match class that matches 1 or 0 times, in that order. Thus, the construction a+? matches "a" 0 or more times in a greedy way, and the construction a-? matches "a" 0 or more times in a non-greedy way. Returns 0.

Groups

Groups are created by enclosing anything between (). For example, (abc) is a group that matches the string "abc". Comparing groups is simple, (abc) compares to the string "abc" as 0, "abb" as a positive value, and "abd" as a negative value. The most significant is compared first, thus (abc) compares to "aad" as a positive value. They also capture their contents. As a special case an empty group () captures the current position. Groups can be made non-capturing by adding a * right after the (.

Groups can be "backmatched" by using the < modifier. This will make the last non-0 comparison more significant than the first. This, combined with sets, is useful when matching little-endian integers in big-endian strings.

Sets

ClEx supports ranges and alternations, collectively called "sets". Sets are made by putting a [, then things to match, then ]. Empty sets are allowed and match empty string. A range is made by putting a : between things to match. A set can be negated with a *.

A set that starts with * matches anything "not in set". Thus, the set [*(abc)(123)(welp)] wouldn't match "abcd", "1234", "welp", but would match "help", "1334", "aelp".

Range matching is done by comparing the start and end matches. For example, the range [(aaa):(acc)] would match "aaa", "acc", "aac", but not "acd", "bbb". (aaa) compares "aaa" < "aac", and (acc) compares "acc" > "aac". It would also match aa\xFF, this is intended.

As a special case, [x:y:z], where x, y and z are match classes, is semantically equivalent to [x:z], although the latter is strongly preferred. (This might change in a later version.)

Ranges follow short-circuit evaluation: If the lower limit of a range doesn't apply (i.e. input < lower), we don't evaluate the upper limit.

Attempting to use a negated set in a less-than or greater-than comparison (as is the case with ranges) is undefined.

A set that doesn't match shall return the last attempt's result.

Start/end of string

Matching at start and end of string are done with ^ and $, respectively. ^ returns the current position (thus 0 when position = 0 = start of string), $ returns length-position (thus 0 when position = length = end of string).

"Anychar"

To match any char you use a .. It matches any character.

Escaping and special matchers

To escape characters in any of the above constructs just use %, e.g. %(, %-, %+, %? would match literals (, -, +, ?, respectively. To escape a literal % just prefix it with another %, as in %%. The % character was chosen because it doesn't conflict with most languages' string escaping character (\, used in C, Java, Lua, JavaScript, Python, PHP, etc).

Anything outside the ASCII range [A:Za:z0:9] (that is, anything that's not uppercase letter, lowercase letter or number) can be escaped like this.

Special matchers

Special matchers are written as % followed by one of [A:Za:z], followed by any metadata the matcher requires.

Currently specified special matchers are:

  • %nxy, where x is a number between 1 and 8, optionally preceded by < or >, and y is a match object, matches an unsigned number of size x, and attempts to match y exactly that many times. The number is read in native endianness by default. < means little-endian, > means big endian. For example, %n<4. matches a string prefixed by a 4-byte/32-bit little-endian unsigned length.
  • %Nxy, where x is a number between 1 and 8, optionally preceded by < or >, and y is a match object, matches a signed number of size x, and attempts to match y exactly that many times. The number is read in native endianness by default. < means little-endian, > means big endian. For example, %n>4. matches a string prefixed by a 4-byte/32-bit big-endian signed length.
  • %bxy, where x and y are different characters, matches strings that start with x, end with y, and where the x and y are balanced. This means that, if one reads the string from left to right, counting +1 for an x and -1 for a y, the ending y is the first y where the count reaches 0. For example, %b() matches expressions with balanced parentheses.

All unknown special matchers are an error condition.

End of string

When reaching end of string, it is recommended to signal an end of string. If an end of string is signaled while matching a range, the range must be discarded, and the set matching should continue.

Simple character

Any simple character returns the difference between the expected character and the character found. This can be either (expected - found) or (found - expected). Ranges must be coded accordingly.

Extensions

Proprietary extensions are allowed.

Encoding

ClEx operates on raw byte streams.

@pablomayobre
Copy link

Wow! This spec sounds really nice!

Just a small suggestion/question: Could you provide some examples? With example functions and what they are expected to return

I would be interested in writing a Lua module (probably in C) so maybe how would this look compared to the builtin "RegEx" would be nice. You could even take from some String Recipes here and show how they would be done with Clex

Just tips! I really like the overall look of it (although the examples would help clarify some stuff haha)

@pablomayobre
Copy link

Sorry wont let me update the comment to the link to the recipes so here it is: lua-users.org/wiki/StringRecipes

@SoniEx2
Copy link
Author

SoniEx2 commented Jul 29, 2015

Sorry, this is still a WIP. Please wait until I at least have a working implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment