Skip to content

Instantly share code, notes, and snippets.

Last active Dec 14, 2015
What would you like to do?

Let me start by saying that this is really amazing work. I was genuinely hoping somebody with an iron will and expert regexp chops would step up to the plate and fill out the clojureRegexp matches. And here you are.

This is pretty sweet. Creating this test suite is nothing short of brilliant. Right off the bat I was able to fix two problems (just pushed the fixes). Nice work!

At first I thought that testing Vim syntax was going to be slow (many shellouts to vim) or complicated (using clientserver interface to talk to a vim server), but once I realized that we had macros available, a simple design just fell out.

It would still be faster to talk to a running vim server, but that seems like a lot of complexity for a relatively small speedup.


I'm ignoring \x{h...h} because, honestly, I'm not familiar with it and was confused by the definition. If you can give me an example, I can write a pattern for it.

I believe the purpose of this is to match Unicode characters beyond the Basic Multilingual Plane (0x0000−0xFFFF). For example, here's the first Plane 1 character that my terminal manages to render:

(re-find #"\x{10300}" "𐌀bc") ; "𐌀"

Also, I've ignored comments which are permitted inside (?x) groups. I can study the ruby syntax file for ideas on how to implement this but I've skipped it for now.

Ah, right. Sounds tricky, but it does seem useful. We can definitely place that on the TODO list for now however.


The documentation doesn't mention this behavior but I'm assuming from the results above my assumption is correct. Since I couldn't figure out a way to handle the case when the \E is dropped I decided to leave it out. On the other hand, implementing it as I've described is a nice way to use syntax highlighting to enforce good coding style.

I've pushed a potential solution here:

Basically, the clojureRegexp region is now declared keepend, which will prevent any nested regions from matching outside its own region. Then, the end pattern end=/"/me=s-1 is added to prevent matching the ending " as clojureRegexpQuote.

In addition, in order to hilink only the delimiters \Q and \E, a subregion called clojureRegexpQuoted is created matching only the contents of the clojureRegexpQuote region. This subregion is linked to the normal clojureRegexp group, and the outer region remains linked to Character (although, do you think it may be better to link in the same style as clojureRegexpBoundary, since the \Q is a delimiter and not a character?)

A few quick notes/questions:

., +, *, [, ], {, }, (, and ) have been added to clojureRegexpEscape (even though they're not documented)

I believe this part in the documentation covers that:

\ Nothing, but quotes the following character

So the fallback role of the backslash is to make the following character unspecial (#"\." matches literal period .). For this reason these should probably not be matched as clojureRegexpEscape, since the other escape chars are meant to represent characters outside of the printable ASCII table.

Should we extract java.lang.Character classes and unicode classes in to their own syntax groups per the docs?

I don't think that's necessary. The only categorical differences among these seem to be their area of origination. They do work on different byte ranges, but not exclusively.

So perhaps clojureRegexpPosixCharClass should be renamed to something more general to match the entire set of \p{…} classes.

The entire set, btw, is apparently quite large (I just learned). Quoting the Android docs on java.util.Pattern:

  • Unicode category names, prefixed by Is. For example \p{IsLu} for all uppercase letters.
  • POSIX class names. These are 'Alnum', 'Alpha', 'ASCII', 'Blank', 'Cntrl', 'Digit', 'Graph', 'Lower', 'Print', 'Punct', 'Upper', 'XDigit'.
  • Unicode block names, as used by forName(String) prefixed by In. For example \p{InHebrew} for all characters in the Hebrew block.
  • Character method names. These are all non-deprecated methods from Character whose name starts with is, but with the is replaced by java. For example, \p{javaLowerCase}.

(Not mentioned is the fact that the two-letter category names seem to be available as both \p{Lu} and \p{IsLu})

So this is a very large list. For example:

(re-find #"\p{InUnifiedCanadianAboriginalSyllabics}" "᐀	U+1400	CANADIAN SYLLABICS HYPHEN")
;; -> "᐀" (I don't have a font that can display this)


I think if we want to be complete, we can build a simple Unicode spec parser, and also use the JVM to find the java.lang.Character is* method names, then dump them for copy&paste in the manner of vim-clojure-static.generate/syntax-words.

Lol, I just chuckled at how ridiculous this sounds! Let me know if you think this is overboard.

Should we extract out ranges and negations [^ ...] ?

Do you mean match these differently to signal an inversion (for instance)? That sounds pretty interesting. I would enjoy it if [abc] flipped to a different color if I changed it to [^abc] or to [a-c].

Also sounds difficult.

I'm planning to add && once I figure out the best way to do it.


That's it for now. I've updated the branch and breaking on this for today! :)

Again, this is very much appreciated. You can move on this at your own pace, so don't feel pressured in any way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment