kragen/java-regexps-suck.md

## java-regexps-suck.md

      
    Raw
  

              java-regexps-suck.md
            
          
    This is slightly munged jstack output from a .find() of a Java regular expression that looks like some initial stuff some stuff (\d+) some more stuff (foo|bar) some more sstuff (\d+) being matched against a logfile line of around 100 characters.
"main" prio=10 tid=0x0000000000f31800 nid=0x523e runnable [0x00007f7ece452000]
   java.lang.Thread.State: RUNNABLE
at java.util.regex.Pattern$Curly.match(Pattern.java:3761)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
at java.util.regex.Pattern$Slice.match(Pattern.java:3499)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4244)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
at java.util.regex.Pattern$Slice.match(Pattern.java:3499)
at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
at java.util.regex.Pattern$Slice.match(Pattern.java:3499)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4244)
at java.util.regex.Pattern$Curly.match0(Pattern.java:3799)
at java.util.regex.Pattern$Curly.match(Pattern.java:3761)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
at java.util.regex.Pattern$BnM.match(Pattern.java:4990)
at java.util.regex.Matcher.search(Matcher.java:1116)
at java.util.regex.Matcher.find(Matcher.java:552)

These are all megamorphic callsites; inlining the calls would require specialization, and as can be seen, specialization on the node types of the entire ridiculously complicated regexp abstract syntax tree.  That means we're doing multiple megamorphic virtual calls per parse tree node.  (The recursion here is not quite as ridiculous as it looks; it's nearly all recursive calls to next.match, but Curly also repeatedly invokes atom.match, and of course BranchConn has its atoms[n].match call.)  This kind of bytecode completely stymies HotSpot's optimizations, because they all rely pretty heavily on inlining.

The top-level BnM node does a useful Boyer-Moore optimization to avoid looking at places where the pattern certainly can't match, and furthermore it does a reasonably good job of it in this case — when I stepped through it, it used the "some initial stuff some stuff" string to examine about seven of the hundred or so characters in the haystack before concluding that there could be no match.  I'm pretty sure that it's searching for an initial invariant segment of the pattern, though, not the longest or least probable invariant segment.

Compare all this nonsense with its backtracking and incessant incrementing of loop variables and megamorphic virtual calls and bullshit like that to Thompson's 1967 regexp patent, which explains how you can compile your finite automaton to IBM 7094 assembly language, compiling "ABCD" to, for example,
TXL FAIL,1,'A'-1    TEST FOR A
TXH FAIL,1,'A'
TSX NNODE,4
TXL FAIL,1,'B'-1    TEST FOR B
TXH FAIL,1,'B'
TSX NNODE,4
TXL FAIL,1,'C'-1    TEST FOR C
TXH FAIL,1,'C'
TSX NNODE,4
TXL FAIL,1,'D'-1    TEST FOR D
TXH FAIL,1,'D'
TRA FOUND           TEST TEXT EXHAUSTED

(I have taken the liberty of omitting the leftmost column, which I suppose is the label field, from the FAP code; it reads CODE on each line.)  NNODE here is a seven-instruction subroutine being invoked to, among other things, advance the haystack pointer; so we can see that this requires 10 machine instructions per character examined.
This is all to explain why the hotspot of my program is in fact the Java regular expression engine (and related parsing problems with SimpleDateFormat, DecimalFormat, DigitList, and parseLong) rather than anything actually useful, and how I feel that a regexp engine that emitted JVM bytecode could perform dramatically better.