Skip to content

Instantly share code, notes, and snippets.

@slimsag
Created May 6, 2021 23:58
Show Gist options
  • Save slimsag/c243dd016ab7afe0967450a61869ef97 to your computer and use it in GitHub Desktop.
Save slimsag/c243dd016ab7afe0967450a61869ef97 to your computer and use it in GitHub Desktop.

Index

Package regexp implements regular expression search.

The syntax of the regular expressions accepted is the same general syntax used by Perl, Python, and other languages. More precisely, it is the syntax accepted by RE2 and described at https://golang.org/s/re2syntax, except for \C. For an overview of the syntax, run

go doc regexp/syntax

The regexp implementation provided by this package is guaranteed to run in time linear in the size of the input. (This is a property not guaranteed by most open source implementations of regular expressions.) For more information about this property, see

[https://swtch.com/~rsc/regexp/regexp1.html](https://swtch.com/~rsc/regexp/regexp1.html)

or any book about automata theory.

All characters are UTF-8-encoded code points.

There are 16 methods of Regexp that match a regular expression and identify the matched text. Their names are matched by this regular expression:

Find(All)?(String)?(Submatch)?(Index)?

If 'All' is present, the routine matches successive non-overlapping matches of the entire expression. Empty matches abutting a preceding match are ignored. The return value is a slice containing the successive return values of the corresponding non-'All' routine. These routines take an extra integer argument, n. If n >= 0, the function returns at most n matches/submatches; otherwise, it returns all of them.

If 'String' is present, the argument is a string; otherwise it is a slice of bytes; return values are adjusted as appropriate.

If 'Submatch' is present, the return value is a slice identifying the successive submatches of the expression. Submatches are matches of parenthesized subexpressions (also known as capturing groups) within the regular expression, numbered from left to right in order of opening parenthesis. Submatch 0 is the match of the entire expression, submatch 1 the match of the first parenthesized subexpression, and so on.

If 'Index' is present, matches and submatches are identified by byte index pairs within the input string: result[2n:2n+1] identifies the indexes of the nth submatch. The pair for n==0 identifies the match of the entire expression. If 'Index' is not present, the match is identified by the text of the match/submatch. If an index is negative or text is nil, it means that subexpression did not match any string in the input. For 'String' versions an empty string means either no match or an empty match.

There is also a subset of the methods that can be applied to text read from a RuneReader:

MatchReader, FindReaderIndex, FindReaderSubmatchIndex

This set may grow. Note that regular expression matches may need to examine text beyond the text returned by a match, so the methods that match text from a RuneReader may read arbitrarily far into the input before returning.

(There are a few other methods that do not match this pattern.)

Index

const visitedBits = 32
const maxBacktrackProg = 500 // len(prog.Inst) <= max
const maxBacktrackVector = 256 * 1024 // bit vector size <= max (bits)
const mergeFailed = uint32(0xffffffff)

mergeRuneSets merges two non-intersecting runesets, and returns the merged result, and a NextIp array. The idea is that if a rune matches the OnePassRunes at index i, NextIp[i/2] is the target. If the input sets intersect, an empty runeset and a NextIp array with the single element mergeFailed is returned. The code assumes that both inputs contain ordered and non-intersecting rune pairs.

const endOfText rune = -1
const startSize = 10 // The size at which to start a slice in the 'All' routines.
var bitStatePool sync.Pool
var t *thread
var flag lazyFlag
var onePassPool sync.Pool
var flag lazyFlag
var arrayNoInts [0]int

arrayNoInts is returned by doExecute match if nil dstCap is passed to it with ncap=0.

var buf strings.Builder

Have prefix; gather characters.

var noRune = ...
var noNext = ...
var lx, rx int
var lx, rx int
var anyRuneNotNL = ...
var anyRune = ...
var instQueue = newQueue(len(p.Inst))
var visitQueue = newQueue(len(p.Inst))
var check func(uint32, []bool) bool
var onePassRunes = make([][]rune, len(p.Inst))
var matchSize = ...

Pools of machine for use during (Regexp).doExecute, split up by the size of the execution queues. matchPool[i] machines have queue size matchSize[i]. On a 64-bit system each queue entry is 16 bytes, so matchPool[0] has 162128 = 4kB queues, etc. The final matchPool is a catch-all for very large queues.

var matchPool [len(matchSize)]sync.Pool

Pools of machine for use during (Regexp).doExecute, split up by the size of the execution queues. matchPool[i] machines have queue size matchSize[i]. On a 64-bit system each queue entry is 16 bytes, so matchPool[0] has 162128 = 4kB queues, etc. The final matchPool is a catch-all for very large queues.

var lnext int
var buf []byte
var endPos int
var dstCap [2]int
var width int

Advance past this match; always advance at least one character.

var specialBytes [16]byte

Bitmap used by func special to check whether a character needs to be escaped.

var i int

A byte loop is correct because all metacharacters are ASCII.

var end int
var width int
var dstCap [2]int
var dstCap [2]int
var dstCap [4]int
var dstCap [4]int
var result [][]byte
var result [][]int
var result []string
var result [][]int
var result [][][]byte
var result [][]int
var result [][]string
var result [][]int
var goodRe = ...
var badRe = ...
var replaceTests = ...
var replaceLiteralTests = ...
var replaceFuncTests = ...
var metaTests = ...
var literalPrefixTests = ...
var emptySubexpIndices = ...
var subexpCases = ...
var splitTests = ...
var sink string
var compileBenchData = ...
var minInputLenTests = ...
var txt io.Reader
var str []string
var input []string
var inStrings bool
var re *Regexp
var refull *Regexp
var nfail int
var ncase int
var text string
var run = ...
var match = ...
var notab = MustCompilePOSIX(`[^\t]+`)
var x []int
var end byte = ')'
var v = -1
var err error
var text []byte
var benchData = ...
var benchSizes = ...
var findTests = ...
var runeMergeTests = ...
var onePassTests = ...
var p *syntax.Prog
var re *syntax.Regexp
var err error
var onePassTests1 = ...

TODO(cespare): Unify with onePassTests and rationalize one-pass test cases.

type job struct {
	pc  uint32
	arg bool
	pos int
}

A job is an entry on the backtracker's job stack. It holds the instruction pc and the position in the input.

type bitState struct {
	end      int
	cap      []int
	matchcap []int
	jobs     []job
	visited  []uint32

	inputs inputs
}

bitState holds state for the backtracker.

func newBitState() *bitState
func (b *bitState) reset(prog *syntax.Prog, end int, ncap int)

reset resets the state of the backtracker. end is the end position in the input. ncap is the number of captures.

func (b *bitState) shouldVisit(pc uint32, pos int) bool

shouldVisit reports whether the combination of (pc, pos) has not been visited yet.

func (b *bitState) push(re *Regexp, pc uint32, pos int, arg bool)

push pushes (pc, pos, arg) onto the job stack if it should be visited.

type queue struct {
	sparse []uint32
	dense  []entry
}

A queue is a 'sparse array' holding pending threads of execution. See https://research.swtch.com/2008/03/using-uninitialized-memory-for-fun-and.html

type entry struct {
	pc uint32
	t  *thread
}

An entry is an entry on a queue. It holds both the instruction pc and the actual thread. Some queue entries are just place holders so that the machine knows it has considered that pc. Such entries have t == nil.

type thread struct {
	inst *syntax.Inst
	cap  []int
}

A thread is the state of a single path through the machine: an instruction and a corresponding capture array. See https://swtch.com/~rsc/regexp/regexp2.html

type machine struct {
	re       *Regexp      // corresponding Regexp
	p        *syntax.Prog // compiled program
	q0, q1   queue        // two queues for runq, nextq
	pool     []*thread    // pool of available threads
	matched  bool         // whether a match was found
	matchcap []int        // capture information for the match

	inputs inputs
}

A machine holds all the state during an NFA simulation for p.

func (m *machine) init(ncap int)
func (m *machine) alloc(i *syntax.Inst) *thread

alloc allocates a new thread with the given instruction. It uses the free pool if possible.

func (m *machine) match(i input, pos int) bool

match runs the machine over the input starting at pos. It reports whether a match was found. If so, m.matchcap holds the submatch information.

func (m *machine) clear(q *queue)

clear frees all threads on the thread queue.

func (m *machine) step(runq, nextq *queue, pos, nextPos int, c rune, nextCond *lazyFlag)

step executes one step of the machine, running each of the threads on runq and appending new threads to nextq. The step processes the rune c (which may be endOfText), which starts at position pos and ends at nextPos. nextCond gives the setting for the empty-width flags after c.

func (m *machine) add(q *queue, pc uint32, pos int, cap []int, cond *lazyFlag, t *thread) *thread

add adds an entry to q for pc, unless the q already has such an entry. It also recursively adds an entry for all instructions reachable from pc by following empty-width conditions satisfied by cond. pos gives the current position in the input.

type inputs struct {
	// cached inputs, to avoid allocation
	bytes  inputBytes
	string inputString
	reader inputReader
}
func (i *inputs) newBytes(b []byte) input
func (i *inputs) newString(s string) input
func (i *inputs) newReader(r io.RuneReader) input
func (i *inputs) clear()
func (i *inputs) init(r io.RuneReader, b []byte, s string) (input, int)
type lazyFlag uint64

A lazyFlag is a lazily-evaluated syntax.EmptyOp, for checking zero-width flags like ^ $ \A \z \B \b. It records the pair of relevant runes and does not determine the implied flags until absolutely necessary (most of the time, that means never).

func newLazyFlag(r1, r2 rune) lazyFlag
func (f lazyFlag) match(op syntax.EmptyOp) bool
type onePassMachine struct {
	inputs   inputs
	matchcap []int
}
func newOnePassMachine() *onePassMachine
type onePassProg struct {
	Inst   []onePassInst
	Start  int // index of start instruction
	NumCap int // number of InstCapture insts in re
}

A onePassProg is a compiled one-pass regular expression program. It is the same as syntax.Prog except for the use of onePassInst.

func onePassCopy(prog *syntax.Prog) *onePassProg

onePassCopy creates a copy of the original Prog, as we'll be modifying it

func makeOnePass(p *onePassProg) *onePassProg

makeOnePass creates a onepass Prog, if possible. It is possible if at any alt, the match engine can always tell which branch to take. The routine may modify p if it is turned into a onepass Prog. If it isn't possible for this to be a onepass Prog, the Prog nil is returned. makeOnePass is recursive to the size of the Prog.

func compileOnePass(prog *syntax.Prog) (p *onePassProg)

compileOnePass returns a new *syntax.Prog suitable for onePass execution if the original Prog can be recharacterized as a one-pass regexp program, or syntax.nil if the Prog cannot be converted. For a one pass prog, the fundamental condition that must be true is: at any InstAlt, there must be no ambiguity about what branch to take.

type onePassInst struct {
	syntax.Inst
	Next []uint32
}

A onePassInst is a single instruction in a one-pass regular expression program. It is the same as syntax.Inst except for the new 'Next' field.

type queueOnePass struct {
	sparse          []uint32
	dense           []uint32
	size, nextIndex uint32
}

Sparse Array implementation is used as a queueOnePass.

func newQueue(size int) (q *queueOnePass)
func (q *queueOnePass) empty() bool
func (q *queueOnePass) next() (n uint32)
func (q *queueOnePass) clear()
func (q *queueOnePass) contains(u uint32) bool
func (q *queueOnePass) insert(u uint32)
func (q *queueOnePass) insertNew(u uint32)
type runeSlice []rune

runeSlice exists to permit sorting the case-folded rune sets.

func (p runeSlice) Len() int
func (p runeSlice) Less(i, j int) bool
func (p runeSlice) Swap(i, j int)
type Regexp struct {
	expr           string       // as passed to Compile
	prog           *syntax.Prog // compiled program
	onepass        *onePassProg // onepass program or nil
	numSubexp      int
	maxBitStateLen int
	subexpNames    []string
	prefix         string         // required prefix in unanchored matches
	prefixBytes    []byte         // prefix, as a []byte
	prefixRune     rune           // first rune in prefix
	prefixEnd      uint32         // pc for last rune in prefix
	mpool          int            // pool for machines
	matchcap       int            // size of recorded match lengths
	prefixComplete bool           // prefix is the entire regexp
	cond           syntax.EmptyOp // empty-width conditions required at start of match
	minInputLen    int            // minimum length of the input in bytes

	// This field can be modified by the Longest method,
	// but it is otherwise read-only.
	longest bool // whether regexp prefers leftmost-longest match
}

Regexp is the representation of a compiled regular expression. A Regexp is safe for concurrent use by multiple goroutines, except for configuration methods, such as Longest.

func Compile(expr string) (*Regexp, error)

Compile parses a regular expression and returns, if successful, a Regexp object that can be used to match against text.

When matching against text, the regexp returns a match that begins as early as possible in the input (leftmost), and among those it chooses the one that a backtracking search would have found first. This so-called leftmost-first matching is the same semantics that Perl, Python, and other implementations use, although this package implements it without the expense of backtracking. For POSIX leftmost-longest matching, see CompilePOSIX.

func CompilePOSIX(expr string) (*Regexp, error)

CompilePOSIX is like Compile but restricts the regular expression to POSIX ERE (egrep) syntax and changes the match semantics to leftmost-longest.

That is, when matching against text, the regexp returns a match that begins as early as possible in the input (leftmost), and among those it chooses a match that is as long as possible. This so-called leftmost-longest matching is the same semantics that early regular expression implementations used and that POSIX specifies.

However, there can be multiple leftmost-longest matches, with different submatch choices, and here this package diverges from POSIX. Among the possible leftmost-longest matches, this package chooses the one that a backtracking search would have found first, while POSIX specifies that the match be chosen to maximize the length of the first subexpression, then the second, and so on from left to right. The POSIX rule is computationally prohibitive and not even well-defined. See https://swtch.com/~rsc/regexp/regexp2.html#posix for details.

func compile(expr string, mode syntax.Flags, longest bool) (*Regexp, error)
func MustCompile(str string) *Regexp

MustCompile is like Compile but panics if the expression cannot be parsed. It simplifies safe initialization of global variables holding compiled regular expressions.

func MustCompilePOSIX(str string) *Regexp

MustCompilePOSIX is like CompilePOSIX but panics if the expression cannot be parsed. It simplifies safe initialization of global variables holding compiled regular expressions.

func compileTest(t *testing.T, expr string, error string) *Regexp
func tryCompile(s string) (re *Regexp, err error)
func (re *Regexp) tryBacktrack(b *bitState, i input, pc uint32, pos int) bool

tryBacktrack runs a backtracking search starting at pos.

func (re *Regexp) backtrack(ib []byte, is string, pos int, ncap int, dstCap []int) []int

backtrack runs a backtracking search of prog on the input starting at pos.

func (re *Regexp) doOnePass(ir io.RuneReader, ib []byte, is string, pos, ncap int, dstCap []int) []int

doOnePass implements r.doExecute using the one-pass execution engine.

func (re *Regexp) doMatch(r io.RuneReader, b []byte, s string) bool

doMatch reports whether either r, b or s match the regexp.

func (re *Regexp) doExecute(r io.RuneReader, b []byte, s string, pos int, ncap int, dstCap []int) []int

doExecute finds the leftmost match in the input, appends the position of its subexpressions to dstCap and returns dstCap.

nil is returned if no matches are found and non-nil if matches are found.

func (re *Regexp) String() string

String returns the source text used to compile the regular expression.

func (re *Regexp) Copy() *Regexp

Copy returns a new Regexp object copied from re. Calling Longest on one copy does not affect another.

Deprecated: In earlier releases, when using a Regexp in multiple goroutines, giving each goroutine its own copy helped to avoid lock contention. As of Go 1.12, using Copy is no longer necessary to avoid lock contention. Copy may still be appropriate if the reason for its use is to make two copies with different Longest settings.

func (re *Regexp) Longest()

Longest makes future searches prefer the leftmost-longest match. That is, when matching against text, the regexp returns a match that begins as early as possible in the input (leftmost), and among those it chooses a match that is as long as possible. This method modifies the Regexp and may not be called concurrently with any other methods.

func (re *Regexp) get() *machine

get returns a machine to use for matching re. It uses the re's machine cache if possible, to avoid unnecessary allocation.

func (re *Regexp) put(m *machine)

put returns a machine to the correct machine pool.

func (re *Regexp) NumSubexp() int

NumSubexp returns the number of parenthesized subexpressions in this Regexp.

func (re *Regexp) SubexpNames() []string

SubexpNames returns the names of the parenthesized subexpressions in this Regexp. The name for the first sub-expression is names[1], so that if m is a match slice, the name for m[i] is SubexpNames()[i]. Since the Regexp as a whole cannot be named, names[0] is always the empty string. The slice should not be modified.

func (re *Regexp) SubexpIndex(name string) int

SubexpIndex returns the index of the first subexpression with the given name, or -1 if there is no subexpression with that name.

Note that multiple subexpressions can be written using the same name, as in (?Pa+)(?Pb+), which declares two subexpressions named "bob". In this case, SubexpIndex returns the index of the leftmost such subexpression in the regular expression.

func (re *Regexp) LiteralPrefix() (prefix string, complete bool)

LiteralPrefix returns a literal string that must begin any match of the regular expression re. It returns the boolean true if the literal string comprises the entire regular expression.

func (re *Regexp) MatchReader(r io.RuneReader) bool

MatchReader reports whether the text returned by the RuneReader contains any match of the regular expression re.

func (re *Regexp) MatchString(s string) bool

MatchString reports whether the string s contains any match of the regular expression re.

func (re *Regexp) Match(b []byte) bool

Match reports whether the byte slice b contains any match of the regular expression re.

func (re *Regexp) ReplaceAllString(src, repl string) string

ReplaceAllString returns a copy of src, replacing matches of the Regexp with the replacement string repl. Inside repl, $ signs are interpreted as in Expand, so for instance $1 represents the text of the first submatch.

func (re *Regexp) ReplaceAllLiteralString(src, repl string) string

ReplaceAllLiteralString returns a copy of src, replacing matches of the Regexp with the replacement string repl. The replacement repl is substituted directly, without using Expand.

func (re *Regexp) ReplaceAllStringFunc(src string, repl func(string) string) string

ReplaceAllStringFunc returns a copy of src in which all matches of the Regexp have been replaced by the return value of function repl applied to the matched substring. The replacement returned by repl is substituted directly, without using Expand.

func (re *Regexp) replaceAll(bsrc []byte, src string, nmatch int, repl func(dst []byte, m []int) []byte) []byte
func (re *Regexp) ReplaceAll(src, repl []byte) []byte

ReplaceAll returns a copy of src, replacing matches of the Regexp with the replacement text repl. Inside repl, $ signs are interpreted as in Expand, so for instance $1 represents the text of the first submatch.

func (re *Regexp) ReplaceAllLiteral(src, repl []byte) []byte

ReplaceAllLiteral returns a copy of src, replacing matches of the Regexp with the replacement bytes repl. The replacement repl is substituted directly, without using Expand.

func (re *Regexp) ReplaceAllFunc(src []byte, repl func([]byte) []byte) []byte

ReplaceAllFunc returns a copy of src in which all matches of the Regexp have been replaced by the return value of function repl applied to the matched byte slice. The replacement returned by repl is substituted directly, without using Expand.

func (re *Regexp) pad(a []int) []int

The number of capture values in the program may correspond to fewer capturing expressions than are in the regexp. For example, "(a){0}" turns into an empty program, so the maximum capture in the program is 0 but we need to return an expression for \1. Pad appends -1s to the slice a as needed.

func (re *Regexp) allMatches(s string, b []byte, n int, deliver func([]int))

allMatches calls deliver at most n times with the location of successive matches in the input text. The input text is b if non-nil, otherwise s.

func (re *Regexp) Find(b []byte) []byte

Find returns a slice holding the text of the leftmost match in b of the regular expression. A return value of nil indicates no match.

func (re *Regexp) FindIndex(b []byte) (loc []int)

FindIndex returns a two-element slice of integers defining the location of the leftmost match in b of the regular expression. The match itself is at b[loc[0]:loc[1]]. A return value of nil indicates no match.

func (re *Regexp) FindString(s string) string

FindString returns a string holding the text of the leftmost match in s of the regular expression. If there is no match, the return value is an empty string, but it will also be empty if the regular expression successfully matches an empty string. Use FindStringIndex or FindStringSubmatch if it is necessary to distinguish these cases.

func (re *Regexp) FindStringIndex(s string) (loc []int)

FindStringIndex returns a two-element slice of integers defining the location of the leftmost match in s of the regular expression. The match itself is at s[loc[0]:loc[1]]. A return value of nil indicates no match.

func (re *Regexp) FindReaderIndex(r io.RuneReader) (loc []int)

FindReaderIndex returns a two-element slice of integers defining the location of the leftmost match of the regular expression in text read from the RuneReader. The match text was found in the input stream at byte offset loc[0] through loc[1]-1. A return value of nil indicates no match.

func (re *Regexp) FindSubmatch(b []byte) [][]byte

FindSubmatch returns a slice of slices holding the text of the leftmost match of the regular expression in b and the matches, if any, of its subexpressions, as defined by the 'Submatch' descriptions in the package comment. A return value of nil indicates no match.

func (re *Regexp) Expand(dst []byte, template []byte, src []byte, match []int) []byte

Expand appends template to dst and returns the result; during the append, Expand replaces variables in the template with corresponding matches drawn from src. The match slice should have been returned by FindSubmatchIndex.

In the template, a variable is denoted by a substring of the form $name or ${name}, where name is a non-empty sequence of letters, digits, and underscores. A purely numeric name like $1 refers to the submatch with the corresponding index; other names refer to capturing parentheses named with the (?P...) syntax. A reference to an out of range or unmatched index or a name that is not present in the regular expression is replaced with an empty slice.

In the $name form, name is taken to be as long as possible: $1x is equivalent to ${1x}, not ${1}x, and, $10 is equivalent to ${10}, not ${1}0.

To insert a literal $ in the output, use $$ in the template.

func (re *Regexp) ExpandString(dst []byte, template string, src string, match []int) []byte

ExpandString is like Expand but the template and source are strings. It appends to and returns a byte slice in order to give the calling code control over allocation.

func (re *Regexp) expand(dst []byte, template string, bsrc []byte, src string, match []int) []byte
func (re *Regexp) FindSubmatchIndex(b []byte) []int

FindSubmatchIndex returns a slice holding the index pairs identifying the leftmost match of the regular expression in b and the matches, if any, of its subexpressions, as defined by the 'Submatch' and 'Index' descriptions in the package comment. A return value of nil indicates no match.

func (re *Regexp) FindStringSubmatch(s string) []string

FindStringSubmatch returns a slice of strings holding the text of the leftmost match of the regular expression in s and the matches, if any, of its subexpressions, as defined by the 'Submatch' description in the package comment. A return value of nil indicates no match.

func (re *Regexp) FindStringSubmatchIndex(s string) []int

FindStringSubmatchIndex returns a slice holding the index pairs identifying the leftmost match of the regular expression in s and the matches, if any, of its subexpressions, as defined by the 'Submatch' and 'Index' descriptions in the package comment. A return value of nil indicates no match.

func (re *Regexp) FindReaderSubmatchIndex(r io.RuneReader) []int

FindReaderSubmatchIndex returns a slice holding the index pairs identifying the leftmost match of the regular expression of text read by the RuneReader, and the matches, if any, of its subexpressions, as defined by the 'Submatch' and 'Index' descriptions in the package comment. A return value of nil indicates no match.

func (re *Regexp) FindAll(b []byte, n int) [][]byte

FindAll is the 'All' version of Find; it returns a slice of all successive matches of the expression, as defined by the 'All' description in the package comment. A return value of nil indicates no match.

func (re *Regexp) FindAllIndex(b []byte, n int) [][]int

FindAllIndex is the 'All' version of FindIndex; it returns a slice of all successive matches of the expression, as defined by the 'All' description in the package comment. A return value of nil indicates no match.

func (re *Regexp) FindAllString(s string, n int) []string

FindAllString is the 'All' version of FindString; it returns a slice of all successive matches of the expression, as defined by the 'All' description in the package comment. A return value of nil indicates no match.

func (re *Regexp) FindAllStringIndex(s string, n int) [][]int

FindAllStringIndex is the 'All' version of FindStringIndex; it returns a slice of all successive matches of the expression, as defined by the 'All' description in the package comment. A return value of nil indicates no match.

func (re *Regexp) FindAllSubmatch(b []byte, n int) [][][]byte

FindAllSubmatch is the 'All' version of FindSubmatch; it returns a slice of all successive matches of the expression, as defined by the 'All' description in the package comment. A return value of nil indicates no match.

func (re *Regexp) FindAllSubmatchIndex(b []byte, n int) [][]int

FindAllSubmatchIndex is the 'All' version of FindSubmatchIndex; it returns a slice of all successive matches of the expression, as defined by the 'All' description in the package comment. A return value of nil indicates no match.

func (re *Regexp) FindAllStringSubmatch(s string, n int) [][]string

FindAllStringSubmatch is the 'All' version of FindStringSubmatch; it returns a slice of all successive matches of the expression, as defined by the 'All' description in the package comment. A return value of nil indicates no match.

func (re *Regexp) FindAllStringSubmatchIndex(s string, n int) [][]int

FindAllStringSubmatchIndex is the 'All' version of FindStringSubmatchIndex; it returns a slice of all successive matches of the expression, as defined by the 'All' description in the package comment. A return value of nil indicates no match.

func (re *Regexp) Split(s string, n int) []string

Split slices s into substrings separated by the expression and returns a slice of the substrings between those expression matches.

The slice returned by this method consists of all the substrings of s not contained in the slice returned by FindAllString. When called on an expression that contains no metacharacters, it is equivalent to strings.SplitN.

Example:

s := regexp.MustCompile("a*").Split("abaabaccadaaae", 5)
// s: ["", "b", "b", "c", "cadaaae"]

The count determines the number of substrings to return:

n > 0: at most n substrings; the last substring will be the unsplit remainder.
n == 0: the result is nil (zero substrings)
n < 0: all substrings

type input interface {
	step(pos int) (r rune, width int) // advance one rune
	canCheckPrefix() bool             // can we look ahead without losing info?
	hasPrefix(re *Regexp) bool
	index(re *Regexp, pos int) int
	context(pos int) lazyFlag
}

input abstracts different representations of the input text. It provides one-character lookahead.

type inputString struct {
	str string
}

inputString scans a string.

func (i *inputString) step(pos int) (rune, int)
func (i *inputString) canCheckPrefix() bool
func (i *inputString) hasPrefix(re *Regexp) bool
func (i *inputString) index(re *Regexp, pos int) int
func (i *inputString) context(pos int) lazyFlag
type inputBytes struct {
	str []byte
}

inputBytes scans a byte slice.

func (i *inputBytes) step(pos int) (rune, int)
func (i *inputBytes) canCheckPrefix() bool
func (i *inputBytes) hasPrefix(re *Regexp) bool
func (i *inputBytes) index(re *Regexp, pos int) int
func (i *inputBytes) context(pos int) lazyFlag
type inputReader struct {
	r     io.RuneReader
	atEOT bool
	pos   int
}

inputReader scans a RuneReader.

func (i *inputReader) step(pos int) (rune, int)
func (i *inputReader) canCheckPrefix() bool
func (i *inputReader) hasPrefix(re *Regexp) bool
func (i *inputReader) index(re *Regexp, pos int) int
func (i *inputReader) context(pos int) lazyFlag
type stringError struct {
	re  string
	err string
}
type ReplaceTest struct {
	pattern, replacement, input, output string
}
type ReplaceFuncTest struct {
	pattern       string
	replacement   func(string) string
	input, output string
}
type MetaTest struct {
	pattern, output, literal string
	isLiteral                bool
}
type subexpIndex struct {
	name  string
	index int
}
type subexpCase struct {
	input   string
	num     int
	names   []string
	indices []subexpIndex
}
type FindTest struct {
	pat     string
	text    string
	matches [][]int
}

For each pattern/text pair, what is the expected output of each function? We can derive the textual results from the indexed results, the non-submatch results from the submatched results, the single results from the 'all' results, and the byte results from the string results. Therefore the table includes only the FindAllStringSubmatchIndex result.

func (t FindTest) String() string
func freeBitState(b *bitState)
func maxBitStateLen(prog *syntax.Prog) int

maxBitStateLen returns the maximum length of a string to search with the backtracker using prog.

func shouldBacktrack(prog *syntax.Prog) bool

shouldBacktrack reports whether the program is too long for the backtracker to run.

func freeOnePassMachine(m *onePassMachine)
func onePassPrefix(p *syntax.Prog) (prefix string, complete bool, pc uint32)

OnePassPrefix returns a literal string that all matches for the regexp must start with. Complete is true if the prefix is the entire match. Pc is the index of the last rune instruction in the string. The OnePassPrefix skips over the mandatory EmptyBeginText

func onePassNext(i *onePassInst, r rune) uint32

OnePassNext selects the next actionable state of the prog, based on the input character. It should only be called when i.Op == InstAlt or InstAltMatch, and from the one-pass machine. One of the alternates may ultimately lead without input to end of line. If the instruction is InstAltMatch the path to the InstMatch is in i.Out, the normal node in i.Next.

func iop(i *syntax.Inst) syntax.InstOp
func mergeRuneSets(leftRunes, rightRunes *[]rune, leftPC, rightPC uint32) ([]rune, []uint32)
func cleanupOnePass(prog *onePassProg, original *syntax.Prog)

cleanupOnePass drops working memory, and restores certain shortcut instructions.

func minInputLen(re *syntax.Regexp) int

minInputLen walks the regexp to find the minimum length of any matchable input

func quote(s string) string
func MatchReader(pattern string, r io.RuneReader) (matched bool, err error)

MatchReader reports whether the text returned by the RuneReader contains any match of the regular expression pattern. More complicated queries need to use Compile and the full Regexp interface.

func MatchString(pattern string, s string) (matched bool, err error)

MatchString reports whether the string s contains any match of the regular expression pattern. More complicated queries need to use Compile and the full Regexp interface.

func Match(pattern string, b []byte) (matched bool, err error)

Match reports whether the byte slice b contains any match of the regular expression pattern. More complicated queries need to use Compile and the full Regexp interface.

func special(b byte) bool

special reports whether byte b needs to be escaped by QuoteMeta.

func init()
func QuoteMeta(s string) string

QuoteMeta returns a string that escapes all regular expression metacharacters inside the argument text; the returned string is a regular expression matching the literal text.

func extract(str string) (name string, num int, rest string, ok bool)

extract returns the name from a leading "$name" or "${name}" in str. If it is a number, extract returns num set to that number; otherwise num = -1.

func TestGoodCompile(t *testing.T)
func TestBadCompile(t *testing.T)
func matchTest(t *testing.T, test *FindTest)
func TestMatch(t *testing.T)
func matchFunctionTest(t *testing.T, test *FindTest)
func TestMatchFunction(t *testing.T)
func copyMatchTest(t *testing.T, test *FindTest)
func TestCopyMatch(t *testing.T)
func TestReplaceAll(t *testing.T)
func TestReplaceAllLiteral(t *testing.T)
func TestReplaceAllFunc(t *testing.T)
func TestQuoteMeta(t *testing.T)
func TestLiteralPrefix(t *testing.T)
func TestSubexp(t *testing.T)
func TestSplit(t *testing.T)
func TestParseAndCompile(t *testing.T)

The following sequence of Match calls used to panic. See issue #12980.

func TestOnePassCutoff(t *testing.T)

Check that one-pass cutoff does trigger.

func TestSwitchBacktrack(t *testing.T)

Check that the same machine can be used with the standard matcher and then the backtracker when there are no captures.

func BenchmarkFind(b *testing.B)
func BenchmarkFindAllNoMatches(b *testing.B)
func BenchmarkFindString(b *testing.B)
func BenchmarkFindSubmatch(b *testing.B)
func BenchmarkFindStringSubmatch(b *testing.B)
func BenchmarkLiteral(b *testing.B)
func BenchmarkNotLiteral(b *testing.B)
func BenchmarkMatchClass(b *testing.B)
func BenchmarkMatchClass_InRange(b *testing.B)
func BenchmarkReplaceAll(b *testing.B)
func BenchmarkAnchoredLiteralShortNonMatch(b *testing.B)
func BenchmarkAnchoredLiteralLongNonMatch(b *testing.B)
func BenchmarkAnchoredShortMatch(b *testing.B)
func BenchmarkAnchoredLongMatch(b *testing.B)
func BenchmarkOnePassShortA(b *testing.B)
func BenchmarkNotOnePassShortA(b *testing.B)
func BenchmarkOnePassShortB(b *testing.B)
func BenchmarkNotOnePassShortB(b *testing.B)
func BenchmarkOnePassLongPrefix(b *testing.B)
func BenchmarkOnePassLongNotPrefix(b *testing.B)
func BenchmarkMatchParallelShared(b *testing.B)
func BenchmarkMatchParallelCopied(b *testing.B)
func BenchmarkQuoteMetaAll(b *testing.B)
func BenchmarkQuoteMetaNone(b *testing.B)
func BenchmarkCompile(b *testing.B)
func TestDeepEqual(t *testing.T)
func TestMinInputLen(t *testing.T)
func TestRE2Exhaustive(t *testing.T)

This test is excluded when running under the race detector because it is a very expensive test and takes too long.

func TestRE2Search(t *testing.T)

TestRE2 tests this package's regexp API against test cases considered during RE2's exhaustive tests, which run all possible regexps over a given set of atoms and operators, up to a given complexity, over all possible strings over a given alphabet, up to a given size. Rather than try to link with RE2, we read a log file containing the test cases and the expected matches. The log file, re2-exhaustive.txt, is generated by running 'make log' in the open source RE2 distribution https://github.com/google/re2/.

The test file format is a sequence of stanzas like:

strings
"abc"
"123x"
regexps
"[a-z]+"
0-3;0-3
-;-
"([0-9])([0-9])([0-9])"
-;-
-;0-3 0-1 1-2 2-3

The stanza begins by defining a set of strings, quoted using Go double-quote syntax, one per line. Then the regexps section gives a sequence of regexps to run on the strings. In the block that follows a regexp, each line gives the semicolon-separated match results of running the regexp on the corresponding string. Each match result is either a single -, meaning no match, or a space-separated sequence of pairs giving the match and submatch indices. An unmatched subexpression formats its pair as a single - (not illustrated above). For now each regexp run produces two match results, one for a full match' that restricts the regexp to matching the entire string or nothing, and one for a partial match' that gives the leftmost first match found in the string.

Lines beginning with # are comments. Lines beginning with a capital letter are test names printed during RE2's test suite and are echoed into t but otherwise ignored.

At time of writing, re2-exhaustive.txt is 59 MB but compresses to 385 kB, so we store re2-exhaustive.txt.bz2 in the repository and decompress it on the fly.

func testRE2(t *testing.T, file string)
func runFull(re, refull *Regexp, text string) ([]int, string)
func runPartial(re, refull *Regexp, text string) ([]int, string)
func runFullLongest(re, refull *Regexp, text string) ([]int, string)
func runPartialLongest(re, refull *Regexp, text string) ([]int, string)
func matchFull(re, refull *Regexp, text string) (bool, string)
func matchPartial(re, refull *Regexp, text string) (bool, string)
func matchFullLongest(re, refull *Regexp, text string) (bool, string)
func matchPartialLongest(re, refull *Regexp, text string) (bool, string)
func isSingleBytes(s string) bool
func parseResult(t *testing.T, file string, lineno int, res string) []int
func same(x, y []int) bool
func TestFowler(t *testing.T)

TestFowler runs this package's regexp API against the POSIX regular expression tests collected by Glenn Fowler at http://www2.research.att.com/~astopen/testregex/testregex.html.

func testFowler(t *testing.T, file string)
func parseFowlerResult(s string) (ok, compiled, matched bool, pos []int)
func makeText(n int) []byte
func BenchmarkMatch(b *testing.B)
func BenchmarkMatch_onepass_regex(b *testing.B)
func TestLongest(t *testing.T)
func TestProgramTooLongForBacktrack(t *testing.T)

TestProgramTooLongForBacktrack tests that a regex which is too long for the backtracker still executes properly.

func build(n int, x ...int) [][]int

build is a helper to construct a [][]int by extracting n sequences from x. This represents n matches with len(x)/n submatches each.

func TestFind(t *testing.T)
func TestFindString(t *testing.T)
func testFindIndex(test *FindTest, result []int, t *testing.T)
func TestFindIndex(t *testing.T)
func TestFindStringIndex(t *testing.T)
func TestFindReaderIndex(t *testing.T)
func TestFindAll(t *testing.T)
func TestFindAllString(t *testing.T)
func testFindAllIndex(test *FindTest, result [][]int, t *testing.T)
func TestFindAllIndex(t *testing.T)
func TestFindAllStringIndex(t *testing.T)
func testSubmatchBytes(test *FindTest, n int, submatches []int, result [][]byte, t *testing.T)
func TestFindSubmatch(t *testing.T)
func testSubmatchString(test *FindTest, n int, submatches []int, result []string, t *testing.T)
func TestFindStringSubmatch(t *testing.T)
func testSubmatchIndices(test *FindTest, n int, expect, result []int, t *testing.T)
func testFindSubmatchIndex(test *FindTest, result []int, t *testing.T)
func TestFindSubmatchIndex(t *testing.T)
func TestFindStringSubmatchIndex(t *testing.T)
func TestFindReaderSubmatchIndex(t *testing.T)
func TestFindAllSubmatch(t *testing.T)
func TestFindAllStringSubmatch(t *testing.T)
func testFindAllSubmatchIndex(test *FindTest, result [][]int, t *testing.T)
func TestFindAllSubmatchIndex(t *testing.T)
func TestFindAllStringSubmatchIndex(t *testing.T)
func TestMergeRuneSet(t *testing.T)
func TestCompileOnePass(t *testing.T)
func TestRunOnePass(t *testing.T)

Package syntax parses regular expressions into parse trees and compiles parse trees into programs. Most clients of regular expressions will use the facilities of package regexp (such as Compile and Match) instead of this package.

hdr-SyntaxSyntax

The regular expression syntax understood by this package when parsing with the Perl flag is as follows. Parts of the syntax can be disabled by passing alternate flags to Parse.

Single characters:

.              any character, possibly including newline (flag s=true)
[xyz]          character class
[^xyz]         negated character class
\d             Perl character class
\D             negated Perl character class
[[:alpha:]]    ASCII character class
[[:^alpha:]]   negated ASCII character class
\pN            Unicode character class (one-letter name)
\p{Greek}      Unicode character class
\PN            negated Unicode character class (one-letter name)
\P{Greek}      negated Unicode character class

Composites:

xy             x followed by y
x|y            x or y (prefer x)

Repetitions:

x*             zero or more x, prefer more
x+             one or more x, prefer more
x?             zero or one x, prefer one
x{n,m}         n or n+1 or ... or m x, prefer more
x{n,}          n or more x, prefer more
x{n}           exactly n x
x*?            zero or more x, prefer fewer
x+?            one or more x, prefer fewer
x??            zero or one x, prefer zero
x{n,m}?        n or n+1 or ... or m x, prefer fewer
x{n,}?         n or more x, prefer fewer
x{n}?          exactly n x

Implementation restriction: The counting forms x{n,m}, x{n,}, and x{n} reject forms that create a minimum or maximum repetition count above 1000. Unlimited repetitions are not subject to this restriction.

Grouping:

(re)           numbered capturing group (submatch)
(?P<name>re)   named & numbered capturing group (submatch)
(?:re)         non-capturing group
(?flags)       set flags within current group; non-capturing
(?flags:re)    set flags during re; non-capturing

Flag syntax is xyz (set) or -xyz (clear) or xy-z (set xy, clear z). The flags are:

i              case-insensitive (default false)
m              multi-line mode: ^ and $ match begin/end line in addition to begin/end text (default false)
s              let . match \n (default false)
U              ungreedy: swap meaning of x* and x*?, x+ and x+?, etc (default false)

Empty strings:

^              at beginning of text or line (flag m=true)
$              at end of text (like \z not \Z) or line (flag m=true)
\A             at beginning of text
\b             at ASCII word boundary (\w on one side and \W, \A, or \z on the other)
\B             not at ASCII word boundary
\z             at end of text

Escape sequences:

\a             bell (== \007)
\f             form feed (== \014)
\t             horizontal tab (== \011)
\n             newline (== \012)
\r             carriage return (== \015)
\v             vertical tab character (== \013)
\*             literal *, for any punctuation character *
\123           octal character code (up to three digits)
\x7F           hex character code (exactly two digits)
\x{10FFFF}     hex character code
\Q...\E        literal text ... even if ... has punctuation

Character class elements:

x              single character
A-Z            character range (inclusive)
\d             Perl character class
[:foo:]        ASCII character class foo
\p{Foo}        Unicode character class Foo
\pF            Unicode character class F (one-letter name)

Named character classes as character class elements:

[\d]           digits (== \d)
[^\d]          not digits (== \D)
[\D]           not digits (== \D)
[^\D]          not not digits (== \d)
[[:name:]]     named ASCII class inside character class (== [:name:])
[^[:name:]]    named ASCII class inside negated character class (== [:^name:])
[\p{Name}]     named Unicode property inside character class (== \p{Name})
[^\p{Name}]    named Unicode property inside negated character class (== \P{Name})

Perl character classes (all ASCII-only):

\d             digits (== [0-9])
\D             not digits (== [^0-9])
\s             whitespace (== [\t\n\f\r ])
\S             not whitespace (== [^\t\n\f\r ])
\w             word characters (== [0-9A-Za-z_])
\W             not word characters (== [^0-9A-Za-z_])

ASCII character classes:

[[:alnum:]]    alphanumeric (== [0-9A-Za-z])
[[:alpha:]]    alphabetic (== [A-Za-z])
[[:ascii:]]    ASCII (== [\x00-\x7F])
[[:blank:]]    blank (== [\t ])
[[:cntrl:]]    control (== [\x00-\x1F\x7F])
[[:digit:]]    digits (== [0-9])
[[:graph:]]    graphical (== [!-~] == [A-Za-z0-9!"#$%&'()*+,\-./:;<=>?@[\\\]^_`{|}~])
[[:lower:]]    lower case (== [a-z])
[[:print:]]    printable (== [ -~] == [ [:graph:]])
[[:punct:]]    punctuation (== [!-/:-@[-`{-~])
[[:space:]]    whitespace (== [\t\n\v\f\r ])
[[:upper:]]    upper case (== [A-Z])
[[:word:]]     word characters (== [0-9A-Za-z_])
[[:xdigit:]]   hex digit (== [0-9A-Fa-f])

Unicode character classes are those in unicode.Categories and unicode.Scripts.

Index

const _Op_name_0 = "NoMatchEmptyMatchLiteralCharClassAnyCharNotNLAnyCharBeginLineEndLineBeginTextEndTextWordBoundaryNoWordBoundaryCaptureStarPlusQuestRepeatConcatAlternate"
const _Op_name_1 = "opPseudo"
const ErrInternalError ErrorCode = "regexp/syntax: internal error"

Unexpected error

const ErrInvalidCharClass ErrorCode = "invalid character class"

Parse errors

const ErrInvalidCharRange ErrorCode = "invalid character class range"
const ErrInvalidEscape ErrorCode = "invalid escape sequence"
const ErrInvalidNamedCapture ErrorCode = "invalid named capture"
const ErrInvalidPerlOp ErrorCode = "invalid or unsupported Perl syntax"
const ErrInvalidRepeatOp ErrorCode = "invalid nested repetition operator"
const ErrInvalidRepeatSize ErrorCode = "invalid repeat count"
const ErrInvalidUTF8 ErrorCode = "invalid UTF-8"
const ErrMissingBracket ErrorCode = "missing closing ]"
const ErrMissingParen ErrorCode = "missing closing )"
const ErrMissingRepeatArgument ErrorCode = "missing argument to repetition operator"
const ErrTrailingBackslash ErrorCode = "trailing backslash at end of expression"
const ErrUnexpectedParen ErrorCode = "unexpected )"
const FoldCase Flags = 1 << iota // case-insensitive match
const Literal // treat pattern as literal string
const ClassNL // allow character classes like [^a-z] and [[:space:]] to match newline
const DotNL // allow . to match newline
const OneLine // treat ^ and $ as only matching at beginning and end of text
const NonGreedy // make repetition operators default to non-greedy
const PerlX // allow Perl extensions
const UnicodeGroups // allow \p{Han}, \P{Han} for Unicode group and negation
const WasDollar // regexp OpEndText was $, not \z
const Simple // regexp contains no counted repetition
const MatchNL = ClassNL | DotNL
const Perl = ClassNL | OneLine | PerlX | UnicodeGroups // as close to Perl as possible
const POSIX Flags = 0 // POSIX syntax
const opLeftParen = opPseudo + iota

Pseudo-ops for parsing stack.

const opVerticalBar

Pseudo-ops for parsing stack.

const minFold = 0x0041

minimum and maximum runes involved in folding. checked during test.

const maxFold = 0x1e943
const InstAlt InstOp = iota
const InstAltMatch
const InstCapture
const InstEmptyWidth
const InstMatch
const InstFail
const InstNop
const InstRune
const InstRune1
const InstRuneAny
const InstRuneAnyNotNL
const EmptyBeginLine EmptyOp = 1 << iota
const EmptyEndLine
const EmptyBeginText
const EmptyEndText
const EmptyWordBoundary
const EmptyNoWordBoundary
const noMatch = -1
const OpNoMatch Op = 1 + iota // matches no strings
const OpEmptyMatch // matches empty string
const OpLiteral // matches Runes sequence
const OpCharClass // matches Runes interpreted as range pair list
const OpAnyCharNotNL // matches any character except newline
const OpAnyChar // matches any character
const OpBeginLine // matches empty string at beginning of line
const OpEndLine // matches empty string at end of line
const OpBeginText // matches empty string at beginning of text
const OpEndText // matches empty string at end of text
const OpWordBoundary // matches word boundary `\b`
const OpNoWordBoundary // matches word non-boundary `\B`
const OpCapture // capturing subexpression with index Cap, optional name Name
const OpStar // matches Sub[0] zero or more times
const OpPlus // matches Sub[0] one or more times
const OpQuest // matches Sub[0] zero or one times
const OpRepeat // matches Sub[0] at least Min times, at most Max (Max == -1 is no limit)
const OpConcat // matches concatenation of Subs
const OpAlternate // matches alternation of Subs
const opPseudo Op = 128 // where pseudo-ops start
const meta = `\.+*?()|[]{}^$`
const testFlags = MatchNL | PerlX | UnicodeGroups
var c compiler
var anyRuneNotNL = ...
var anyRune = ...
var f frag
var f frag
var f frag
var _Op_index_0 = ...
var str []rune

Round 1: Factor out common literal prefixes.

var strflags Flags
var istr []rune

Invariant: the Regexps that were in sub[0:start] have been used or marked for reuse, and the slice space has been reused for out (len(out) <= start).

Invariant: sub[start:i] consists of regexps that all begin with str as modified by strflags.

var iflags Flags
var first *Regexp
var ifirst *Regexp

Invariant: the Regexps that were in sub[0:start] have been used or marked for reuse, and the slice space has been reused for out (len(out) <= start).

Invariant: sub[start:i] consists of regexps that all begin with ifirst.

var p parser

Otherwise, must do real work.

var err error

Otherwise, must do real work.

var c rune

Otherwise, must do real work.

var op Op

Otherwise, must do real work.

var lastRepeat string

Otherwise, must do real work.

var lit string

\Q ... \E: the ... is always literals

var ok1 bool
var c rune

Non-capturing group. Might also twiddle Perl flags.

var anyTable = ...
var seq, name string
var seq, name string
var lo, hi rune
var lo, hi rune
var code1 = ...
var code2 = ...
var code3 = ...
var perlGroup = ...
var code4 = ...
var code5 = ...
var code6 = ...
var code7 = ...
var code8 = ...
var code9 = ...
var code10 = ...
var code11 = ...
var code12 = ...
var code13 = ...
var code14 = ...
var code15 = ...
var code16 = ...
var code17 = ...
var posixGroup = ...
var instOpNames = ...
var op EmptyOp = EmptyNoWordBoundary
var boundary byte
var b strings.Builder
var buf strings.Builder

Have prefix; gather characters.

var flag EmptyOp
var b strings.Builder
var b strings.Builder
var prefix *Regexp

Build leading prefix: xx.

var parseTests = ...
var foldcaseTests = ...
var literalTests = ...
var matchnlTests = ...
var nomatchnlTests = ...
var b strings.Builder
var opNames = ...
var r []rune

AppendRange should collapse each of the new ranges into the earlier ones (it looks back two ranges), so that the slice never grows very large. Note that we are not calling cleanClass.

var invalidRegexps = ...
var onlyPerl = ...
var onlyPOSIX = ...
var compileTests = ...
var r1 rune = -1
var simplifyTests = ...
type patchList struct {
	head, tail uint32
}

A patchList is a list of instruction pointers that need to be filled in (patched). Because the pointers haven't been filled in yet, we can reuse their storage to hold the list. It's kind of sleazy, but works well in practice. See https://swtch.com/~rsc/regexp/regexp1.html for inspiration.

These aren't really pointers: they're integers, so we can reinterpret them this way without using package unsafe. A value l.head denotes p.inst[l.head>>1].Out (l.head&1==0) or .Arg (l.head&1==1). head == 0 denotes the empty list, okay because we start every program with a fail instruction, so we'll never want to point at its output link.

func makePatchList(n uint32) patchList
func (l patchList) patch(p *Prog, val uint32)
func (l1 patchList) append(p *Prog, l2 patchList) patchList
type frag struct {
	i   uint32    // index of first instruction
	out patchList // where to record end instruction
}

A frag represents a compiled program fragment.

type compiler struct {
	p *Prog
}
func (c *compiler) init()
func (c *compiler) compile(re *Regexp) frag
func (c *compiler) inst(op InstOp) frag
func (c *compiler) nop() frag
func (c *compiler) fail() frag
func (c *compiler) cap(arg uint32) frag
func (c *compiler) cat(f1, f2 frag) frag
func (c *compiler) alt(f1, f2 frag) frag
func (c *compiler) quest(f1 frag, nongreedy bool) frag
func (c *compiler) star(f1 frag, nongreedy bool) frag
func (c *compiler) plus(f1 frag, nongreedy bool) frag
func (c *compiler) empty(op EmptyOp) frag
func (c *compiler) rune(r []rune, flags Flags) frag
type Error struct {
	Code ErrorCode
	Expr string
}

An Error describes a failure to parse a regular expression and gives the offending expression.

func (e *Error) Error() string
type ErrorCode string

An ErrorCode describes a failure to parse a regular expression.

func (e ErrorCode) String() string
type Flags uint16

Flags control the behavior of the parser and record information about regexp context.

type parser struct {
	flags       Flags     // parse mode flags
	stack       []*Regexp // stack of parsed expressions
	free        *Regexp
	numCap      int // number of capturing groups seen
	wholeRegexp string
	tmpClass    []rune // temporary char class work space
}
func (p *parser) newRegexp(op Op) *Regexp
func (p *parser) reuse(re *Regexp)
func (p *parser) push(re *Regexp) *Regexp

push pushes the regexp re onto the parse stack and returns the regexp.

func (p *parser) maybeConcat(r rune, flags Flags) bool

maybeConcat implements incremental concatenation of literal runes into string nodes. The parser calls this before each push, so only the top fragment of the stack might need processing. Since this is called before a push, the topmost literal is no longer subject to operators like * (Otherwise ab* would turn into (ab)*.) If r >= 0 and there's a node left over, maybeConcat uses it to push r with the given flags. maybeConcat reports whether r was pushed.

func (p *parser) literal(r rune)

literal pushes a literal regexp for the rune r on the stack.

func (p *parser) op(op Op) *Regexp

op pushes a regexp with the given op onto the stack and returns that regexp.

func (p *parser) repeat(op Op, min, max int, before, after, lastRepeat string) (string, error)

repeat replaces the top stack element with itself repeated according to op, min, max. before is the regexp suffix starting at the repetition operator. after is the regexp suffix following after the repetition operator. repeat returns an updated 'after' and an error, if any.

func (p *parser) concat() *Regexp

concat replaces the top of the stack (above the topmost '|' or '(') with its concatenation.

func (p *parser) alternate() *Regexp

alternate replaces the top of the stack (above the topmost '(') with its alternation.

func (p *parser) collapse(subs []*Regexp, op Op) *Regexp

collapse returns the result of applying op to sub. If sub contains op nodes, they all get hoisted up so that there is never a concat of a concat or an alternate of an alternate.

func (p *parser) factor(sub []*Regexp) []*Regexp

factor factors common prefixes from the alternation list sub. It returns a replacement list that reuses the same storage and frees (passes to p.reuse) any removed *Regexps.

For example,

ABC|ABD|AEF|BCX|BCY

simplifies by literal prefix extraction to

A(B(C|D)|EF)|BC(X|Y)

which simplifies by character class introduction to

A(B[CD]|EF)|BC[XY]

func (p *parser) leadingString(re *Regexp) ([]rune, Flags)

leadingString returns the leading literal string that re begins with. The string refers to storage in re or its children.

func (p *parser) removeLeadingString(re *Regexp, n int) *Regexp

removeLeadingString removes the first n leading runes from the beginning of re. It returns the replacement for re.

func (p *parser) leadingRegexp(re *Regexp) *Regexp

leadingRegexp returns the leading regexp that re begins with. The regexp refers to storage in re or its children.

func (p *parser) removeLeadingRegexp(re *Regexp, reuse bool) *Regexp

removeLeadingRegexp removes the leading regexp in re. It returns the replacement for re. If reuse is true, it passes the removed regexp (if no longer needed) to p.reuse.

func (p *parser) parseRepeat(s string) (min, max int, rest string, ok bool)

parseRepeat parses {min} (max=min) or {min,} (max=-1) or {min,max}. If s is not of that form, it returns ok == false. If s has the right form but the values are too big, it returns min == -1, ok == true.

func (p *parser) parsePerlFlags(s string) (rest string, err error)

parsePerlFlags parses a Perl flag setting or non-capturing group or both, like (?i) or (?: or (?i:. It removes the prefix from s and updates the parse state. The caller must have ensured that s begins with "(?".

func (p *parser) parseInt(s string) (n int, rest string, ok bool)

parseInt parses a decimal integer.

func (p *parser) parseVerticalBar() error

parseVerticalBar handles a | in the input.

func (p *parser) swapVerticalBar() bool

If the top of the stack is an element followed by an opVerticalBar swapVerticalBar swaps the two and returns true. Otherwise it returns false.

func (p *parser) parseRightParen() error

parseRightParen handles a ) in the input.

func (p *parser) parseEscape(s string) (r rune, rest string, err error)

parseEscape parses an escape sequence at the beginning of s and returns the rune.

func (p *parser) parseClassChar(s, wholeClass string) (r rune, rest string, err error)

parseClassChar parses a character class character at the beginning of s and returns it.

func (p *parser) parsePerlClassEscape(s string, r []rune) (out []rune, rest string)

parsePerlClassEscape parses a leading Perl character class escape like \d from the beginning of s. If one is present, it appends the characters to r and returns the new slice r and the remainder of the string.

func (p *parser) parseNamedClass(s string, r []rune) (out []rune, rest string, err error)

parseNamedClass parses a leading POSIX named character class like [:alnum:] from the beginning of s. If one is present, it appends the characters to r and returns the new slice r and the remainder of the string.

func (p *parser) appendGroup(r []rune, g charGroup) []rune
func (p *parser) parseUnicodeClass(s string, r []rune) (out []rune, rest string, err error)

parseUnicodeClass parses a leading Unicode character class like \p{Han} from the beginning of s. If one is present, it appends the characters to r and returns the new slice r and the remainder of the string.

func (p *parser) parseClass(s string) (rest string, err error)

parseClass parses a character class at the beginning of s and pushes it onto the parse stack.

type charGroup struct {
	sign  int
	class []rune
}
type ranges struct {
	p *[]rune
}

ranges implements sort.Interface on a []rune. The choice of receiver type definition is strange but avoids an allocation since we already have a *[]rune.

func (ra ranges) Less(i, j int) bool
func (ra ranges) Len() int
func (ra ranges) Swap(i, j int)
type Prog struct {
	Inst   []Inst
	Start  int // index of start instruction
	NumCap int // number of InstCapture insts in re
}

A Prog is a compiled regular expression program.

func Compile(re *Regexp) (*Prog, error)

Compile compiles the regexp into a program to be executed. The regexp should have been simplified already (returned from re.Simplify).

func (p *Prog) String() string
func (p *Prog) skipNop(pc uint32) *Inst

skipNop follows any no-op or capturing instructions.

func (p *Prog) Prefix() (prefix string, complete bool)

Prefix returns a literal string that all matches for the regexp must start with. Complete is true if the prefix is the entire match.

func (p *Prog) StartCond() EmptyOp

StartCond returns the leading empty-width conditions that must be true in any match. It returns ^EmptyOp(0) if no matches are possible.

type InstOp uint8

An InstOp is an instruction opcode.

func (i InstOp) String() string
type EmptyOp uint8

An EmptyOp specifies a kind or mixture of zero-width assertions.

func EmptyOpContext(r1, r2 rune) EmptyOp

EmptyOpContext returns the zero-width assertions satisfied at the position between the runes r1 and r2. Passing r1 == -1 indicates that the position is at the beginning of the text. Passing r2 == -1 indicates that the position is at the end of the text.

type Inst struct {
	Op   InstOp
	Out  uint32 // all but InstMatch, InstFail
	Arg  uint32 // InstAlt, InstAltMatch, InstCapture, InstEmptyWidth
	Rune []rune
}

An Inst is a single instruction in a regular expression program.

func (i *Inst) op() InstOp

op returns i.Op but merges all the Rune special cases into InstRune

func (i *Inst) MatchRune(r rune) bool

MatchRune reports whether the instruction matches (and consumes) r. It should only be called when i.Op == InstRune.

func (i *Inst) MatchRunePos(r rune) int

MatchRunePos checks whether the instruction matches (and consumes) r. If so, MatchRunePos returns the index of the matching rune pair (or, when len(i.Rune) == 1, rune singleton). If not, MatchRunePos returns -1. MatchRunePos should only be called when i.Op == InstRune.

func (i *Inst) MatchEmptyWidth(before rune, after rune) bool

MatchEmptyWidth reports whether the instruction matches an empty string between the runes before and after. It should only be called when i.Op == InstEmptyWidth.

func (i *Inst) String() string
type Regexp struct {
	Op       Op // operator
	Flags    Flags
	Sub      []*Regexp  // subexpressions, if any
	Sub0     [1]*Regexp // storage for short Sub
	Rune     []rune     // matched runes, for OpLiteral, OpCharClass
	Rune0    [2]rune    // storage for short Rune
	Min, Max int        // min, max for OpRepeat
	Cap      int        // capturing index, for OpCapture
	Name     string     // capturing name, for OpCapture
}

A Regexp is a node in a regular expression syntax tree.

func literalRegexp(s string, flags Flags) *Regexp
func Parse(s string, flags Flags) (*Regexp, error)

Parse parses a regular expression string s, controlled by the specified Flags, and returns a regular expression parse tree. The syntax is described in the top-level comment.

func simplify1(op Op, flags Flags, sub, re *Regexp) *Regexp

simplify1 implements Simplify for the unary OpStar, OpPlus, and OpQuest operators. It returns the simple regexp equivalent to

Regexp{Op: op, Flags: flags, Sub: {sub}}

under the assumption that sub is already simple, and without first allocating that structure. If the regexp to be returned turns out to be equivalent to re, simplify1 returns re instead.

simplify1 is factored out of Simplify because the implementation for other operators generates these unary expressions. Letting them call simplify1 makes sure the expressions they generate are simple.

func (x *Regexp) Equal(y *Regexp) bool

Equal reports whether x and y have identical structure.

func (re *Regexp) String() string
func (re *Regexp) MaxCap() int

MaxCap walks the regexp to find the maximum capture index.

func (re *Regexp) CapNames() []string

CapNames walks the regexp to find the names of capturing groups.

func (re *Regexp) capNames(names []string)
func (re *Regexp) Simplify() *Regexp

Simplify returns a regexp equivalent to re but without counted repetitions and with various other simplifications, such as rewriting /(?:a+)+/ to /a+/. The resulting regexp will execute correctly but its string representation will not produce the same parse tree, because capturing parentheses may have been duplicated or removed. For example, the simplified form for /(x){1,2}/ is /(x)(x)?/ but both parentheses capture as $1. The returned regexp may share structure with or be the original.

type Op uint8

An Op is a single regular expression operator.

func (i Op) String() string
type parseTest struct {
	Regexp string
	Dump   string
}
func minFoldRune(r rune) rune

minFoldRune returns the minimum rune fold-equivalent to r.

func repeatIsValid(re *Regexp, n int) bool

repeatIsValid reports whether the repetition re is valid. Valid means that the combination of the top-level repetition and any inner repetitions does not exceed n copies of the innermost thing. This function rewalks the regexp tree and is called for every repetition, so we have to worry about inducing quadratic behavior in the parser. We avoid this by only calling repeatIsValid when min or max >= 2. In that case the depth of any >= 2 nesting can only get to 9 without triggering a parse error, so each subtree can only be rewalked 9 times.

func cleanAlt(re *Regexp)

cleanAlt cleans re for eventual inclusion in an alternation.

func isValidCaptureName(name string) bool

isValidCaptureName reports whether name is a valid capture name: [A-Za-z0-9_]+. PCRE limits names to 32 bytes. Python rejects names starting with digits. We don't enforce either of those.

func isCharClass(re *Regexp) bool

can this be represented as a character class? single-rune literal string, char class, ., and .|\n.

func matchRune(re *Regexp, r rune) bool

does re match r?

func mergeCharClass(dst, src *Regexp)

mergeCharClass makes dst = dst|src. The caller must ensure that dst.Op >= src.Op, to reduce the amount of copying.

func unicodeTable(name string) (*unicode.RangeTable, *unicode.RangeTable)

unicodeTable returns the unicode.RangeTable identified by name and the table of additional fold-equivalent code points.

func cleanClass(rp *[]rune) []rune

cleanClass sorts the ranges (pairs of elements of r), merges them, and eliminates duplicates.

func appendLiteral(r []rune, x rune, flags Flags) []rune

appendLiteral returns the result of appending the literal x to the class r.

func appendRange(r []rune, lo, hi rune) []rune

appendRange returns the result of appending the range lo-hi to the class r.

func appendFoldedRange(r []rune, lo, hi rune) []rune

appendFoldedRange returns the result of appending the range lo-hi and its case folding-equivalent runes to the class r.

func appendClass(r []rune, x []rune) []rune

appendClass returns the result of appending the class x to the class r. It assume x is clean.

func appendFoldedClass(r []rune, x []rune) []rune

appendFolded returns the result of appending the case folding of the class x to the class r.

func appendNegatedClass(r []rune, x []rune) []rune

appendNegatedClass returns the result of appending the negation of the class x to the class r. It assumes x is clean.

func appendTable(r []rune, x *unicode.RangeTable) []rune

appendTable returns the result of appending x to the class r.

func appendNegatedTable(r []rune, x *unicode.RangeTable) []rune

appendNegatedTable returns the result of appending the negation of x to the class r.

func negateClass(r []rune) []rune

negateClass overwrites r and returns r's negation. It assumes the class r is already clean.

func checkUTF8(s string) error
func nextRune(s string) (c rune, t string, err error)
func isalnum(c rune) bool
func unhex(c rune) rune
func IsWordChar(r rune) bool

IsWordChar reports whether r is consider a `word character' during the evaluation of the \b and \B zero-width assertions. These assertions are ASCII-only: the word characters are [A-Za-z0-9_].

func bw(b *strings.Builder, args ...string)
func dumpProg(b *strings.Builder, p *Prog)
func u32(i uint32) string
func dumpInst(b *strings.Builder, i *Inst)
func writeRegexp(b *strings.Builder, re *Regexp)

writeRegexp writes the Perl syntax for the regular expression re to b.

func escape(b *strings.Builder, r rune, force bool)
func TestParseSimple(t *testing.T)
func TestParseFoldCase(t *testing.T)
func TestParseLiteral(t *testing.T)
func TestParseMatchNL(t *testing.T)
func TestParseNoMatchNL(t *testing.T)
func testParseDump(t *testing.T, tests []parseTest, flags Flags)

Test Parse -> Dump.

func dump(re *Regexp) string

dump prints a string representation of the regexp showing the structure explicitly.

func dumpRegexp(b *strings.Builder, re *Regexp)

dumpRegexp writes an encoding of the syntax tree for the regexp re to b. It is used during testing to distinguish between parses that might print the same using re's String method.

func mkCharClass(f func(rune) bool) string
func isUpperFold(r rune) bool
func TestFoldConstants(t *testing.T)
func TestAppendRangeCollapse(t *testing.T)
func TestParseInvalidRegexps(t *testing.T)
func TestToStringEquivalentParse(t *testing.T)
func TestCompile(t *testing.T)
func BenchmarkEmptyOpContext(b *testing.B)
func TestSimplify(t *testing.T)

Index

var validID = regexp.MustCompile(`^[a-z]+\[[0-9]+\]$`)

Compile the expression once, usually at init time. Use raw strings to avoid having to quote the backslashes.

func Example()
func ExampleMatch()
func ExampleMatchString()
func ExampleQuoteMeta()
func ExampleRegexp_Find()
func ExampleRegexp_FindAll()
func ExampleRegexp_FindAllSubmatch()
func ExampleRegexp_FindSubmatch()
func ExampleRegexp_Match()
func ExampleRegexp_FindString()
func ExampleRegexp_FindStringIndex()
func ExampleRegexp_FindStringSubmatch()
func ExampleRegexp_FindAllString()
func ExampleRegexp_FindAllStringSubmatch()
func ExampleRegexp_FindAllStringSubmatchIndex()
func ExampleRegexp_FindSubmatchIndex()
func ExampleRegexp_Longest()
func ExampleRegexp_MatchString()
func ExampleRegexp_NumSubexp()
func ExampleRegexp_ReplaceAll()
func ExampleRegexp_ReplaceAllLiteralString()
func ExampleRegexp_ReplaceAllString()
func ExampleRegexp_ReplaceAllStringFunc()
func ExampleRegexp_SubexpNames()
func ExampleRegexp_SubexpIndex()
func ExampleRegexp_Split()
func ExampleRegexp_Expand()
func ExampleRegexp_ExpandString()
func ExampleRegexp_FindIndex()
func ExampleRegexp_FindAllSubmatchIndex()
func ExampleRegexp_FindAllIndex()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment