Skip to content

Instantly share code, notes, and snippets.

@kragen
Created January 15, 2015 02:01
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kragen/bf0ad9ded9843082e7f8 to your computer and use it in GitHub Desktop.
Save kragen/bf0ad9ded9843082e7f8 to your computer and use it in GitHub Desktop.

This is slightly munged jstack output from a .find() of a Java regular expression that looks like some initial stuff some stuff (\d+) some more stuff (foo|bar) some more sstuff (\d+) being matched against a logfile line of around 100 characters.

"main" prio=10 tid=0x0000000000f31800 nid=0x523e runnable [0x00007f7ece452000]
   java.lang.Thread.State: RUNNABLE
at java.util.regex.Pattern$Curly.match(Pattern.java:3761)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
at java.util.regex.Pattern$Slice.match(Pattern.java:3499)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4244)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
at java.util.regex.Pattern$Slice.match(Pattern.java:3499)
at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
at java.util.regex.Pattern$Slice.match(Pattern.java:3499)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4244)
at java.util.regex.Pattern$Curly.match0(Pattern.java:3799)
at java.util.regex.Pattern$Curly.match(Pattern.java:3761)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
at java.util.regex.Pattern$BnM.match(Pattern.java:4990)
at java.util.regex.Matcher.search(Matcher.java:1116)
at java.util.regex.Matcher.find(Matcher.java:552)

These are all megamorphic callsites; inlining the calls would require specialization, and as can be seen, specialization on the node types of the entire ridiculously complicated regexp abstract syntax tree. That means we're doing multiple megamorphic virtual calls per parse tree node. (The recursion here is not quite as ridiculous as it looks; it's nearly all recursive calls to next.match, but Curly also repeatedly invokes atom.match, and of course BranchConn has its atoms[n].match call.) This kind of bytecode completely stymies HotSpot's optimizations, because they all rely pretty heavily on inlining.

The top-level BnM node does a useful Boyer-Moore optimization to avoid looking at places where the pattern certainly can't match, and furthermore it does a reasonably good job of it in this case — when I stepped through it, it used the "some initial stuff some stuff" string to examine about seven of the hundred or so characters in the haystack before concluding that there could be no match. I'm pretty sure that it's searching for an initial invariant segment of the pattern, though, not the longest or least probable invariant segment.

Compare all this nonsense with its backtracking and incessant incrementing of loop variables and megamorphic virtual calls and bullshit like that to Thompson's 1967 regexp patent, which explains how you can compile your finite automaton to IBM 7094 assembly language, compiling "ABCD" to, for example,

TXL FAIL,1,'A'-1    TEST FOR A
TXH FAIL,1,'A'
TSX NNODE,4
TXL FAIL,1,'B'-1    TEST FOR B
TXH FAIL,1,'B'
TSX NNODE,4
TXL FAIL,1,'C'-1    TEST FOR C
TXH FAIL,1,'C'
TSX NNODE,4
TXL FAIL,1,'D'-1    TEST FOR D
TXH FAIL,1,'D'
TRA FOUND           TEST TEXT EXHAUSTED

(I have taken the liberty of omitting the leftmost column, which I suppose is the label field, from the FAP code; it reads CODE on each line.) NNODE here is a seven-instruction subroutine being invoked to, among other things, advance the haystack pointer; so we can see that this requires 10 machine instructions per character examined.

This is all to explain why the hotspot of my program is in fact the Java regular expression engine (and related parsing problems with SimpleDateFormat, DecimalFormat, DigitList, and parseLong) rather than anything actually useful, and how I feel that a regexp engine that emitted JVM bytecode could perform dramatically better.

@kragen
Copy link
Author

kragen commented Jan 15, 2015

As a point of comparison for the speed, with GNU grep, LC_ALL=C time egrep 'pattern' logfile >/dev/null was able to chew through 2200 megabytes comprising 31 million log lines in 6.24 wallclock seconds, which is 350 megabytes per second or 5 million lines per second. I wrote a very simple performance test harness for my logfile-line-dispatching function that generated the above jstack trace, parsing the same 61-byte line a million times, and it took 3–5 seconds, on the same machine; that's about 15 megabytes per second or 0.3 million lines per second. It's testing 8 patterns (and stupidly not exiting after the first match, but whatever), but even multiplying by 8, not taking into account the Boyer-Moore optimization that makes failed matches almost free, we get 120 megabytes per second or 2 million lines per second. And grep isn't even using dynamic code generation.

(Grep also finds non-matches rapidly. Searching for one of the 8 patterns that occurred on only 500 of the 31 million lines, it completes in 0.44 seconds rather than 6+ seconds.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment