Package regexp implements regular expression search.
The syntax of the regular expressions accepted is the same general syntax used by Perl, Python, and other languages. More precisely, it is the syntax accepted by RE2 and described at https://golang.org/s/re2syntax, except for \C. For an overview of the syntax, run
go doc regexp/syntax
The regexp implementation provided by this package is guaranteed to run in time linear in the size of the input. (This is a property not guaranteed by most open source implementations of regular expressions.) For more information about this property, see
[https://swtch.com/~rsc/regexp/regexp1.html](https://swtch.com/~rsc/regexp/regexp1.html)
or any book about automata theory.
All characters are UTF-8-encoded code points.
There are 16 methods of Regexp that match a regular expression and identify the matched text. Their names are matched by this regular expression:
Find(All)?(String)?(Submatch)?(Index)?
If 'All' is present, the routine matches successive non-overlapping matches of the entire expression. Empty matches abutting a preceding match are ignored. The return value is a slice containing the successive return values of the corresponding non-'All' routine. These routines take an extra integer argument, n. If n >= 0, the function returns at most n matches/submatches; otherwise, it returns all of them.
If 'String' is present, the argument is a string; otherwise it is a slice of bytes; return values are adjusted as appropriate.
If 'Submatch' is present, the return value is a slice identifying the successive submatches of the expression. Submatches are matches of parenthesized subexpressions (also known as capturing groups) within the regular expression, numbered from left to right in order of opening parenthesis. Submatch 0 is the match of the entire expression, submatch 1 the match of the first parenthesized subexpression, and so on.
If 'Index' is present, matches and submatches are identified by byte index pairs within the input string: result[2n:2n+1] identifies the indexes of the nth submatch. The pair for n==0 identifies the match of the entire expression. If 'Index' is not present, the match is identified by the text of the match/submatch. If an index is negative or text is nil, it means that subexpression did not match any string in the input. For 'String' versions an empty string means either no match or an empty match.
There is also a subset of the methods that can be applied to text read from a RuneReader:
MatchReader, FindReaderIndex, FindReaderSubmatchIndex
This set may grow. Note that regular expression matches may need to examine text beyond the text returned by a match, so the methods that match text from a RuneReader may read arbitrarily far into the input before returning.
(There are a few other methods that do not match this pattern.)
- Constants
- Variables
- var bitStatePool
- var t
- var flag
- var onePassPool
- var flag
- var arrayNoInts
- var buf
- var noRune
- var noNext
- var lx
- var rx
- var anyRuneNotNL
- var anyRune
- var instQueue
- var visitQueue
- var check
- var onePassRunes
- var matchSize
- var matchPool
- var lnext
- var buf
- var endPos
- var dstCap
- var width
- var specialBytes
- var i
- var end
- var width
- var dstCap
- var dstCap
- var dstCap
- var dstCap
- var result
- var result
- var result
- var result
- var result
- var result
- var result
- var result
- var goodRe
- var badRe
- var replaceTests
- var replaceLiteralTests
- var replaceFuncTests
- var metaTests
- var literalPrefixTests
- var emptySubexpIndices
- var subexpCases
- var splitTests
- var sink
- var compileBenchData
- var minInputLenTests
- var txt
- var str
- var input
- var inStrings
- var re
- var refull
- var nfail
- var ncase
- var text
- var run
- var match
- var notab
- var x
- var end
- var v
- var err
- var text
- var benchData
- var benchSizes
- var findTests
- var runeMergeTests
- var onePassTests
- var p
- var re
- var err
- var onePassTests1
- Types
- type job struct
- type bitState struct
- type queue struct
- type entry struct
- type thread struct
- type machine struct
- func (m *machine) init(ncap int)
- func (m *machine) alloc(i *syntax.Inst) *thread
- func (m *machine) match(i input, pos int) bool
- func (m *machine) clear(q *queue)
- func (m *machine) step(runq, nextq *queue, pos, nextPos int, c rune, nextCond *lazyFlag)
- func (m *machine) add(q *queue, pc uint32, pos int, cap []int, cond *lazyFlag, t *thread) *thread
- type inputs struct
- type lazyFlag uint64
- type onePassMachine struct
- type onePassProg struct
- type onePassInst struct
- type queueOnePass struct
- type runeSlice []rune
- type Regexp struct
- func Compile(expr string) (*Regexp, error)
- func CompilePOSIX(expr string) (*Regexp, error)
- func compile(expr string, mode syntax.Flags, longest bool) (*Regexp, error)
- func MustCompile(str string) *Regexp
- func MustCompilePOSIX(str string) *Regexp
- func compileTest(t *testing.T, expr string, error string) *Regexp
- func tryCompile(s string) (re *Regexp, err error)
- func (re *Regexp) tryBacktrack(b *bitState, i input, pc uint32, pos int) bool
- func (re *Regexp) backtrack(ib []byte, is string, pos int, ncap int, dstCap []int) []int
- func (re *Regexp) doOnePass(ir io.RuneReader, ib []byte, is string, pos, ncap int, dstCap []int) []int
- func (re *Regexp) doMatch(r io.RuneReader, b []byte, s string) bool
- func (re *Regexp) doExecute(r io.RuneReader, b []byte, s string, pos int, ncap int, dstCap []int) []int
- func (re *Regexp) String() string
- func (re *Regexp) Copy() *Regexp
- func (re *Regexp) Longest()
- func (re *Regexp) get() *machine
- func (re *Regexp) put(m *machine)
- func (re *Regexp) NumSubexp() int
- func (re *Regexp) SubexpNames() []string
- func (re *Regexp) SubexpIndex(name string) int
- func (re *Regexp) LiteralPrefix() (prefix string, complete bool)
- func (re *Regexp) MatchReader(r io.RuneReader) bool
- func (re *Regexp) MatchString(s string) bool
- func (re *Regexp) Match(b []byte) bool
- func (re *Regexp) ReplaceAllString(src, repl string) string
- func (re *Regexp) ReplaceAllLiteralString(src, repl string) string
- func (re *Regexp) ReplaceAllStringFunc(src string, repl func(string) string) string
- func (re *Regexp) replaceAll(bsrc []byte, src string, nmatch int, repl func(dst []byte, m []int) []byte) []byte
- func (re *Regexp) ReplaceAll(src, repl []byte) []byte
- func (re *Regexp) ReplaceAllLiteral(src, repl []byte) []byte
- func (re *Regexp) ReplaceAllFunc(src []byte, repl func([]byte) []byte) []byte
- func (re *Regexp) pad(a []int) []int
- func (re *Regexp) allMatches(s string, b []byte, n int, deliver func([]int))
- func (re *Regexp) Find(b []byte) []byte
- func (re *Regexp) FindIndex(b []byte) (loc []int)
- func (re *Regexp) FindString(s string) string
- func (re *Regexp) FindStringIndex(s string) (loc []int)
- func (re *Regexp) FindReaderIndex(r io.RuneReader) (loc []int)
- func (re *Regexp) FindSubmatch(b []byte) [][]byte
- func (re *Regexp) Expand(dst []byte, template []byte, src []byte, match []int) []byte
- func (re *Regexp) ExpandString(dst []byte, template string, src string, match []int) []byte
- func (re *Regexp) expand(dst []byte, template string, bsrc []byte, src string, match []int) []byte
- func (re *Regexp) FindSubmatchIndex(b []byte) []int
- func (re *Regexp) FindStringSubmatch(s string) []string
- func (re *Regexp) FindStringSubmatchIndex(s string) []int
- func (re *Regexp) FindReaderSubmatchIndex(r io.RuneReader) []int
- func (re *Regexp) FindAll(b []byte, n int) [][]byte
- func (re *Regexp) FindAllIndex(b []byte, n int) [][]int
- func (re *Regexp) FindAllString(s string, n int) []string
- func (re *Regexp) FindAllStringIndex(s string, n int) [][]int
- func (re *Regexp) FindAllSubmatch(b []byte, n int) [][][]byte
- func (re *Regexp) FindAllSubmatchIndex(b []byte, n int) [][]int
- func (re *Regexp) FindAllStringSubmatch(s string, n int) [][]string
- func (re *Regexp) FindAllStringSubmatchIndex(s string, n int) [][]int
- func (re *Regexp) Split(s string, n int) []string
- type input interface
- type inputString struct
- type inputBytes struct
- type inputReader struct
- type stringError struct
- type ReplaceTest struct
- type ReplaceFuncTest struct
- type MetaTest struct
- type subexpIndex struct
- type subexpCase struct
- type FindTest struct
- Functions
- func freeBitState(b *bitState)
- func maxBitStateLen(prog *syntax.Prog) int
- func shouldBacktrack(prog *syntax.Prog) bool
- func freeOnePassMachine(m *onePassMachine)
- func onePassPrefix(p *syntax.Prog) (prefix string, complete bool, pc uint32)
- func onePassNext(i *onePassInst, r rune) uint32
- func iop(i *syntax.Inst) syntax.InstOp
- func mergeRuneSets(leftRunes, rightRunes *[]rune, leftPC, rightPC uint32) ([]rune, []uint32)
- func cleanupOnePass(prog *onePassProg, original *syntax.Prog)
- func minInputLen(re *syntax.Regexp) int
- func quote(s string) string
- func MatchReader(pattern string, r io.RuneReader) (matched bool, err error)
- func MatchString(pattern string, s string) (matched bool, err error)
- func Match(pattern string, b []byte) (matched bool, err error)
- func special(b byte) bool
- func init()
- func QuoteMeta(s string) string
- func extract(str string) (name string, num int, rest string, ok bool)
- func TestGoodCompile(t *testing.T)
- func TestBadCompile(t *testing.T)
- func matchTest(t *testing.T, test *FindTest)
- func TestMatch(t *testing.T)
- func matchFunctionTest(t *testing.T, test *FindTest)
- func TestMatchFunction(t *testing.T)
- func copyMatchTest(t *testing.T, test *FindTest)
- func TestCopyMatch(t *testing.T)
- func TestReplaceAll(t *testing.T)
- func TestReplaceAllLiteral(t *testing.T)
- func TestReplaceAllFunc(t *testing.T)
- func TestQuoteMeta(t *testing.T)
- func TestLiteralPrefix(t *testing.T)
- func TestSubexp(t *testing.T)
- func TestSplit(t *testing.T)
- func TestParseAndCompile(t *testing.T)
- func TestOnePassCutoff(t *testing.T)
- func TestSwitchBacktrack(t *testing.T)
- func BenchmarkFind(b *testing.B)
- func BenchmarkFindAllNoMatches(b *testing.B)
- func BenchmarkFindString(b *testing.B)
- func BenchmarkFindSubmatch(b *testing.B)
- func BenchmarkFindStringSubmatch(b *testing.B)
- func BenchmarkLiteral(b *testing.B)
- func BenchmarkNotLiteral(b *testing.B)
- func BenchmarkMatchClass(b *testing.B)
- func BenchmarkMatchClass_InRange(b *testing.B)
- func BenchmarkReplaceAll(b *testing.B)
- func BenchmarkAnchoredLiteralShortNonMatch(b *testing.B)
- func BenchmarkAnchoredLiteralLongNonMatch(b *testing.B)
- func BenchmarkAnchoredShortMatch(b *testing.B)
- func BenchmarkAnchoredLongMatch(b *testing.B)
- func BenchmarkOnePassShortA(b *testing.B)
- func BenchmarkNotOnePassShortA(b *testing.B)
- func BenchmarkOnePassShortB(b *testing.B)
- func BenchmarkNotOnePassShortB(b *testing.B)
- func BenchmarkOnePassLongPrefix(b *testing.B)
- func BenchmarkOnePassLongNotPrefix(b *testing.B)
- func BenchmarkMatchParallelShared(b *testing.B)
- func BenchmarkMatchParallelCopied(b *testing.B)
- func BenchmarkQuoteMetaAll(b *testing.B)
- func BenchmarkQuoteMetaNone(b *testing.B)
- func BenchmarkCompile(b *testing.B)
- func TestDeepEqual(t *testing.T)
- func TestMinInputLen(t *testing.T)
- func TestRE2Exhaustive(t *testing.T)
- func TestRE2Search(t *testing.T)
- func testRE2(t *testing.T, file string)
- func runFull(re, refull *Regexp, text string) ([]int, string)
- func runPartial(re, refull *Regexp, text string) ([]int, string)
- func runFullLongest(re, refull *Regexp, text string) ([]int, string)
- func runPartialLongest(re, refull *Regexp, text string) ([]int, string)
- func matchFull(re, refull *Regexp, text string) (bool, string)
- func matchPartial(re, refull *Regexp, text string) (bool, string)
- func matchFullLongest(re, refull *Regexp, text string) (bool, string)
- func matchPartialLongest(re, refull *Regexp, text string) (bool, string)
- func isSingleBytes(s string) bool
- func parseResult(t *testing.T, file string, lineno int, res string) []int
- func same(x, y []int) bool
- func TestFowler(t *testing.T)
- func testFowler(t *testing.T, file string)
- func parseFowlerResult(s string) (ok, compiled, matched bool, pos []int)
- func makeText(n int) []byte
- func BenchmarkMatch(b *testing.B)
- func BenchmarkMatch_onepass_regex(b *testing.B)
- func TestLongest(t *testing.T)
- func TestProgramTooLongForBacktrack(t *testing.T)
- func build(n int, x ...int) [][]int
- func TestFind(t *testing.T)
- func TestFindString(t *testing.T)
- func testFindIndex(test *FindTest, result []int, t *testing.T)
- func TestFindIndex(t *testing.T)
- func TestFindStringIndex(t *testing.T)
- func TestFindReaderIndex(t *testing.T)
- func TestFindAll(t *testing.T)
- func TestFindAllString(t *testing.T)
- func testFindAllIndex(test *FindTest, result [][]int, t *testing.T)
- func TestFindAllIndex(t *testing.T)
- func TestFindAllStringIndex(t *testing.T)
- func testSubmatchBytes(test *FindTest, n int, submatches []int, result [][]byte, t *testing.T)
- func TestFindSubmatch(t *testing.T)
- func testSubmatchString(test *FindTest, n int, submatches []int, result []string, t *testing.T)
- func TestFindStringSubmatch(t *testing.T)
- func testSubmatchIndices(test *FindTest, n int, expect, result []int, t *testing.T)
- func testFindSubmatchIndex(test *FindTest, result []int, t *testing.T)
- func TestFindSubmatchIndex(t *testing.T)
- func TestFindStringSubmatchIndex(t *testing.T)
- func TestFindReaderSubmatchIndex(t *testing.T)
- func TestFindAllSubmatch(t *testing.T)
- func TestFindAllStringSubmatch(t *testing.T)
- func testFindAllSubmatchIndex(test *FindTest, result [][]int, t *testing.T)
- func TestFindAllSubmatchIndex(t *testing.T)
- func TestFindAllStringSubmatchIndex(t *testing.T)
- func TestMergeRuneSet(t *testing.T)
- func TestCompileOnePass(t *testing.T)
- func TestRunOnePass(t *testing.T)
const visitedBits = 32
const maxBacktrackProg = 500 // len(prog.Inst) <= max
const maxBacktrackVector = 256 * 1024 // bit vector size <= max (bits)
const mergeFailed = uint32(0xffffffff)
mergeRuneSets merges two non-intersecting runesets, and returns the merged result, and a NextIp array. The idea is that if a rune matches the OnePassRunes at index i, NextIp[i/2] is the target. If the input sets intersect, an empty runeset and a NextIp array with the single element mergeFailed is returned. The code assumes that both inputs contain ordered and non-intersecting rune pairs.
const endOfText rune = -1
const startSize = 10 // The size at which to start a slice in the 'All' routines.
var bitStatePool sync.Pool
var t *thread
var flag lazyFlag
var onePassPool sync.Pool
var flag lazyFlag
var arrayNoInts [0]int
arrayNoInts is returned by doExecute match if nil dstCap is passed to it with ncap=0.
var buf strings.Builder
Have prefix; gather characters.
var noRune = ...
var noNext = ...
var lx, rx int
var lx, rx int
var anyRuneNotNL = ...
var anyRune = ...
var instQueue = newQueue(len(p.Inst))
var visitQueue = newQueue(len(p.Inst))
var check func(uint32, []bool) bool
var onePassRunes = make([][]rune, len(p.Inst))
var matchSize = ...
Pools of machine for use during (Regexp).doExecute, split up by the size of the execution queues. matchPool[i] machines have queue size matchSize[i]. On a 64-bit system each queue entry is 16 bytes, so matchPool[0] has 162128 = 4kB queues, etc. The final matchPool is a catch-all for very large queues.
var matchPool [len(matchSize)]sync.Pool
Pools of machine for use during (Regexp).doExecute, split up by the size of the execution queues. matchPool[i] machines have queue size matchSize[i]. On a 64-bit system each queue entry is 16 bytes, so matchPool[0] has 162128 = 4kB queues, etc. The final matchPool is a catch-all for very large queues.
var lnext int
var buf []byte
var endPos int
var dstCap [2]int
var width int
Advance past this match; always advance at least one character.
var specialBytes [16]byte
Bitmap used by func special to check whether a character needs to be escaped.
var i int
A byte loop is correct because all metacharacters are ASCII.
var end int
var width int
var dstCap [2]int
var dstCap [2]int
var dstCap [4]int
var dstCap [4]int
var result [][]byte
var result [][]int
var result []string
var result [][]int
var result [][][]byte
var result [][]int
var result [][]string
var result [][]int
var goodRe = ...
var badRe = ...
var replaceTests = ...
var replaceLiteralTests = ...
var replaceFuncTests = ...
var metaTests = ...
var literalPrefixTests = ...
var emptySubexpIndices = ...
var subexpCases = ...
var splitTests = ...
var sink string
var compileBenchData = ...
var minInputLenTests = ...
var txt io.Reader
var str []string
var input []string
var inStrings bool
var re *Regexp
var refull *Regexp
var nfail int
var ncase int
var text string
var run = ...
var match = ...
var notab = MustCompilePOSIX(`[^\t]+`)
var x []int
var end byte = ')'
var v = -1
var err error
var text []byte
var benchData = ...
var benchSizes = ...
var findTests = ...
var runeMergeTests = ...
var onePassTests = ...
var p *syntax.Prog
var re *syntax.Regexp
var err error
var onePassTests1 = ...
TODO(cespare): Unify with onePassTests and rationalize one-pass test cases.
type job struct {
pc uint32
arg bool
pos int
}
A job is an entry on the backtracker's job stack. It holds the instruction pc and the position in the input.
type bitState struct {
end int
cap []int
matchcap []int
jobs []job
visited []uint32
inputs inputs
}
bitState holds state for the backtracker.
func newBitState() *bitState
func (b *bitState) reset(prog *syntax.Prog, end int, ncap int)
reset resets the state of the backtracker. end is the end position in the input. ncap is the number of captures.
func (b *bitState) shouldVisit(pc uint32, pos int) bool
shouldVisit reports whether the combination of (pc, pos) has not been visited yet.
func (b *bitState) push(re *Regexp, pc uint32, pos int, arg bool)
push pushes (pc, pos, arg) onto the job stack if it should be visited.
type queue struct {
sparse []uint32
dense []entry
}
A queue is a 'sparse array' holding pending threads of execution. See https://research.swtch.com/2008/03/using-uninitialized-memory-for-fun-and.html
type entry struct {
pc uint32
t *thread
}
An entry is an entry on a queue. It holds both the instruction pc and the actual thread. Some queue entries are just place holders so that the machine knows it has considered that pc. Such entries have t == nil.
type thread struct {
inst *syntax.Inst
cap []int
}
A thread is the state of a single path through the machine: an instruction and a corresponding capture array. See https://swtch.com/~rsc/regexp/regexp2.html
type machine struct {
re *Regexp // corresponding Regexp
p *syntax.Prog // compiled program
q0, q1 queue // two queues for runq, nextq
pool []*thread // pool of available threads
matched bool // whether a match was found
matchcap []int // capture information for the match
inputs inputs
}
A machine holds all the state during an NFA simulation for p.
func (m *machine) init(ncap int)
func (m *machine) alloc(i *syntax.Inst) *thread
alloc allocates a new thread with the given instruction. It uses the free pool if possible.
func (m *machine) match(i input, pos int) bool
match runs the machine over the input starting at pos. It reports whether a match was found. If so, m.matchcap holds the submatch information.
func (m *machine) clear(q *queue)
clear frees all threads on the thread queue.
func (m *machine) step(runq, nextq *queue, pos, nextPos int, c rune, nextCond *lazyFlag)
step executes one step of the machine, running each of the threads on runq and appending new threads to nextq. The step processes the rune c (which may be endOfText), which starts at position pos and ends at nextPos. nextCond gives the setting for the empty-width flags after c.
func (m *machine) add(q *queue, pc uint32, pos int, cap []int, cond *lazyFlag, t *thread) *thread
add adds an entry to q for pc, unless the q already has such an entry. It also recursively adds an entry for all instructions reachable from pc by following empty-width conditions satisfied by cond. pos gives the current position in the input.
type inputs struct {
// cached inputs, to avoid allocation
bytes inputBytes
string inputString
reader inputReader
}
func (i *inputs) newBytes(b []byte) input
func (i *inputs) newString(s string) input
func (i *inputs) newReader(r io.RuneReader) input
func (i *inputs) clear()
func (i *inputs) init(r io.RuneReader, b []byte, s string) (input, int)
type lazyFlag uint64
A lazyFlag is a lazily-evaluated syntax.EmptyOp, for checking zero-width flags like ^ $ \A \z \B \b. It records the pair of relevant runes and does not determine the implied flags until absolutely necessary (most of the time, that means never).
func newLazyFlag(r1, r2 rune) lazyFlag
func (f lazyFlag) match(op syntax.EmptyOp) bool
type onePassMachine struct {
inputs inputs
matchcap []int
}
func newOnePassMachine() *onePassMachine
type onePassProg struct {
Inst []onePassInst
Start int // index of start instruction
NumCap int // number of InstCapture insts in re
}
A onePassProg is a compiled one-pass regular expression program. It is the same as syntax.Prog except for the use of onePassInst.
func onePassCopy(prog *syntax.Prog) *onePassProg
onePassCopy creates a copy of the original Prog, as we'll be modifying it
func makeOnePass(p *onePassProg) *onePassProg
makeOnePass creates a onepass Prog, if possible. It is possible if at any alt, the match engine can always tell which branch to take. The routine may modify p if it is turned into a onepass Prog. If it isn't possible for this to be a onepass Prog, the Prog nil is returned. makeOnePass is recursive to the size of the Prog.
func compileOnePass(prog *syntax.Prog) (p *onePassProg)
compileOnePass returns a new *syntax.Prog suitable for onePass execution if the original Prog can be recharacterized as a one-pass regexp program, or syntax.nil if the Prog cannot be converted. For a one pass prog, the fundamental condition that must be true is: at any InstAlt, there must be no ambiguity about what branch to take.
type onePassInst struct {
syntax.Inst
Next []uint32
}
A onePassInst is a single instruction in a one-pass regular expression program. It is the same as syntax.Inst except for the new 'Next' field.
type queueOnePass struct {
sparse []uint32
dense []uint32
size, nextIndex uint32
}
Sparse Array implementation is used as a queueOnePass.
func newQueue(size int) (q *queueOnePass)
func (q *queueOnePass) empty() bool
func (q *queueOnePass) next() (n uint32)
func (q *queueOnePass) clear()
func (q *queueOnePass) contains(u uint32) bool
func (q *queueOnePass) insert(u uint32)
func (q *queueOnePass) insertNew(u uint32)
type runeSlice []rune
runeSlice exists to permit sorting the case-folded rune sets.
func (p runeSlice) Len() int
func (p runeSlice) Less(i, j int) bool
func (p runeSlice) Swap(i, j int)
type Regexp struct {
expr string // as passed to Compile
prog *syntax.Prog // compiled program
onepass *onePassProg // onepass program or nil
numSubexp int
maxBitStateLen int
subexpNames []string
prefix string // required prefix in unanchored matches
prefixBytes []byte // prefix, as a []byte
prefixRune rune // first rune in prefix
prefixEnd uint32 // pc for last rune in prefix
mpool int // pool for machines
matchcap int // size of recorded match lengths
prefixComplete bool // prefix is the entire regexp
cond syntax.EmptyOp // empty-width conditions required at start of match
minInputLen int // minimum length of the input in bytes
// This field can be modified by the Longest method,
// but it is otherwise read-only.
longest bool // whether regexp prefers leftmost-longest match
}
Regexp is the representation of a compiled regular expression. A Regexp is safe for concurrent use by multiple goroutines, except for configuration methods, such as Longest.
func Compile(expr string) (*Regexp, error)
Compile parses a regular expression and returns, if successful, a Regexp object that can be used to match against text.
When matching against text, the regexp returns a match that begins as early as possible in the input (leftmost), and among those it chooses the one that a backtracking search would have found first. This so-called leftmost-first matching is the same semantics that Perl, Python, and other implementations use, although this package implements it without the expense of backtracking. For POSIX leftmost-longest matching, see CompilePOSIX.
func CompilePOSIX(expr string) (*Regexp, error)
CompilePOSIX is like Compile but restricts the regular expression to POSIX ERE (egrep) syntax and changes the match semantics to leftmost-longest.
That is, when matching against text, the regexp returns a match that begins as early as possible in the input (leftmost), and among those it chooses a match that is as long as possible. This so-called leftmost-longest matching is the same semantics that early regular expression implementations used and that POSIX specifies.
However, there can be multiple leftmost-longest matches, with different submatch choices, and here this package diverges from POSIX. Among the possible leftmost-longest matches, this package chooses the one that a backtracking search would have found first, while POSIX specifies that the match be chosen to maximize the length of the first subexpression, then the second, and so on from left to right. The POSIX rule is computationally prohibitive and not even well-defined. See https://swtch.com/~rsc/regexp/regexp2.html#posix for details.
func compile(expr string, mode syntax.Flags, longest bool) (*Regexp, error)
func MustCompile(str string) *Regexp
MustCompile is like Compile but panics if the expression cannot be parsed. It simplifies safe initialization of global variables holding compiled regular expressions.
func MustCompilePOSIX(str string) *Regexp
MustCompilePOSIX is like CompilePOSIX but panics if the expression cannot be parsed. It simplifies safe initialization of global variables holding compiled regular expressions.
func compileTest(t *testing.T, expr string, error string) *Regexp
func tryCompile(s string) (re *Regexp, err error)
func (re *Regexp) tryBacktrack(b *bitState, i input, pc uint32, pos int) bool
tryBacktrack runs a backtracking search starting at pos.
func (re *Regexp) backtrack(ib []byte, is string, pos int, ncap int, dstCap []int) []int
backtrack runs a backtracking search of prog on the input starting at pos.
func (re *Regexp) doOnePass(ir io.RuneReader, ib []byte, is string, pos, ncap int, dstCap []int) []int
func (re *Regexp) doOnePass(ir io.RuneReader, ib []byte, is string, pos, ncap int, dstCap []int) []int
doOnePass implements r.doExecute using the one-pass execution engine.
func (re *Regexp) doMatch(r io.RuneReader, b []byte, s string) bool
doMatch reports whether either r, b or s match the regexp.
func (re *Regexp) doExecute(r io.RuneReader, b []byte, s string, pos int, ncap int, dstCap []int) []int
func (re *Regexp) doExecute(r io.RuneReader, b []byte, s string, pos int, ncap int, dstCap []int) []int
doExecute finds the leftmost match in the input, appends the position of its subexpressions to dstCap and returns dstCap.
nil is returned if no matches are found and non-nil if matches are found.
func (re *Regexp) String() string
String returns the source text used to compile the regular expression.
func (re *Regexp) Copy() *Regexp
Copy returns a new Regexp object copied from re. Calling Longest on one copy does not affect another.
Deprecated: In earlier releases, when using a Regexp in multiple goroutines, giving each goroutine its own copy helped to avoid lock contention. As of Go 1.12, using Copy is no longer necessary to avoid lock contention. Copy may still be appropriate if the reason for its use is to make two copies with different Longest settings.
func (re *Regexp) Longest()
Longest makes future searches prefer the leftmost-longest match. That is, when matching against text, the regexp returns a match that begins as early as possible in the input (leftmost), and among those it chooses a match that is as long as possible. This method modifies the Regexp and may not be called concurrently with any other methods.
func (re *Regexp) get() *machine
get returns a machine to use for matching re. It uses the re's machine cache if possible, to avoid unnecessary allocation.
func (re *Regexp) put(m *machine)
put returns a machine to the correct machine pool.
func (re *Regexp) NumSubexp() int
NumSubexp returns the number of parenthesized subexpressions in this Regexp.
func (re *Regexp) SubexpNames() []string
SubexpNames returns the names of the parenthesized subexpressions in this Regexp. The name for the first sub-expression is names[1], so that if m is a match slice, the name for m[i] is SubexpNames()[i]. Since the Regexp as a whole cannot be named, names[0] is always the empty string. The slice should not be modified.
func (re *Regexp) SubexpIndex(name string) int
SubexpIndex returns the index of the first subexpression with the given name, or -1 if there is no subexpression with that name.
Note that multiple subexpressions can be written using the same name, as in (?Pa+)(?Pb+), which declares two subexpressions named "bob". In this case, SubexpIndex returns the index of the leftmost such subexpression in the regular expression.
func (re *Regexp) LiteralPrefix() (prefix string, complete bool)
LiteralPrefix returns a literal string that must begin any match of the regular expression re. It returns the boolean true if the literal string comprises the entire regular expression.
func (re *Regexp) MatchReader(r io.RuneReader) bool
MatchReader reports whether the text returned by the RuneReader contains any match of the regular expression re.
func (re *Regexp) MatchString(s string) bool
MatchString reports whether the string s contains any match of the regular expression re.
func (re *Regexp) Match(b []byte) bool
Match reports whether the byte slice b contains any match of the regular expression re.
func (re *Regexp) ReplaceAllString(src, repl string) string
ReplaceAllString returns a copy of src, replacing matches of the Regexp with the replacement string repl. Inside repl, $ signs are interpreted as in Expand, so for instance $1 represents the text of the first submatch.
func (re *Regexp) ReplaceAllLiteralString(src, repl string) string
ReplaceAllLiteralString returns a copy of src, replacing matches of the Regexp with the replacement string repl. The replacement repl is substituted directly, without using Expand.
func (re *Regexp) ReplaceAllStringFunc(src string, repl func(string) string) string
ReplaceAllStringFunc returns a copy of src in which all matches of the Regexp have been replaced by the return value of function repl applied to the matched substring. The replacement returned by repl is substituted directly, without using Expand.
func (re *Regexp) replaceAll(bsrc []byte, src string, nmatch int, repl func(dst []byte, m []int) []byte) []byte
func (re *Regexp) replaceAll(bsrc []byte, src string, nmatch int, repl func(dst []byte, m []int) []byte) []byte
func (re *Regexp) ReplaceAll(src, repl []byte) []byte
ReplaceAll returns a copy of src, replacing matches of the Regexp with the replacement text repl. Inside repl, $ signs are interpreted as in Expand, so for instance $1 represents the text of the first submatch.
func (re *Regexp) ReplaceAllLiteral(src, repl []byte) []byte
ReplaceAllLiteral returns a copy of src, replacing matches of the Regexp with the replacement bytes repl. The replacement repl is substituted directly, without using Expand.
func (re *Regexp) ReplaceAllFunc(src []byte, repl func([]byte) []byte) []byte
ReplaceAllFunc returns a copy of src in which all matches of the Regexp have been replaced by the return value of function repl applied to the matched byte slice. The replacement returned by repl is substituted directly, without using Expand.
func (re *Regexp) pad(a []int) []int
The number of capture values in the program may correspond to fewer capturing expressions than are in the regexp. For example, "(a){0}" turns into an empty program, so the maximum capture in the program is 0 but we need to return an expression for \1. Pad appends -1s to the slice a as needed.
func (re *Regexp) allMatches(s string, b []byte, n int, deliver func([]int))
allMatches calls deliver at most n times with the location of successive matches in the input text. The input text is b if non-nil, otherwise s.
func (re *Regexp) Find(b []byte) []byte
Find returns a slice holding the text of the leftmost match in b of the regular expression. A return value of nil indicates no match.
func (re *Regexp) FindIndex(b []byte) (loc []int)
FindIndex returns a two-element slice of integers defining the location of the leftmost match in b of the regular expression. The match itself is at b[loc[0]:loc[1]]. A return value of nil indicates no match.
func (re *Regexp) FindString(s string) string
FindString returns a string holding the text of the leftmost match in s of the regular expression. If there is no match, the return value is an empty string, but it will also be empty if the regular expression successfully matches an empty string. Use FindStringIndex or FindStringSubmatch if it is necessary to distinguish these cases.
func (re *Regexp) FindStringIndex(s string) (loc []int)
FindStringIndex returns a two-element slice of integers defining the location of the leftmost match in s of the regular expression. The match itself is at s[loc[0]:loc[1]]. A return value of nil indicates no match.
func (re *Regexp) FindReaderIndex(r io.RuneReader) (loc []int)
FindReaderIndex returns a two-element slice of integers defining the location of the leftmost match of the regular expression in text read from the RuneReader. The match text was found in the input stream at byte offset loc[0] through loc[1]-1. A return value of nil indicates no match.
func (re *Regexp) FindSubmatch(b []byte) [][]byte
FindSubmatch returns a slice of slices holding the text of the leftmost match of the regular expression in b and the matches, if any, of its subexpressions, as defined by the 'Submatch' descriptions in the package comment. A return value of nil indicates no match.
func (re *Regexp) Expand(dst []byte, template []byte, src []byte, match []int) []byte
Expand appends template to dst and returns the result; during the append, Expand replaces variables in the template with corresponding matches drawn from src. The match slice should have been returned by FindSubmatchIndex.
In the template, a variable is denoted by a substring of the form
In the $name form, name is taken to be as long as possible:
To insert a literal $ in the output, use $$ in the template.
func (re *Regexp) ExpandString(dst []byte, template string, src string, match []int) []byte (exported)
func (re *Regexp) ExpandString(dst []byte, template string, src string, match []int) []byte
ExpandString is like Expand but the template and source are strings. It appends to and returns a byte slice in order to give the calling code control over allocation.
func (re *Regexp) expand(dst []byte, template string, bsrc []byte, src string, match []int) []byte
func (re *Regexp) FindSubmatchIndex(b []byte) []int
FindSubmatchIndex returns a slice holding the index pairs identifying the leftmost match of the regular expression in b and the matches, if any, of its subexpressions, as defined by the 'Submatch' and 'Index' descriptions in the package comment. A return value of nil indicates no match.
func (re *Regexp) FindStringSubmatch(s string) []string
FindStringSubmatch returns a slice of strings holding the text of the leftmost match of the regular expression in s and the matches, if any, of its subexpressions, as defined by the 'Submatch' description in the package comment. A return value of nil indicates no match.
func (re *Regexp) FindStringSubmatchIndex(s string) []int
FindStringSubmatchIndex returns a slice holding the index pairs identifying the leftmost match of the regular expression in s and the matches, if any, of its subexpressions, as defined by the 'Submatch' and 'Index' descriptions in the package comment. A return value of nil indicates no match.
func (re *Regexp) FindReaderSubmatchIndex(r io.RuneReader) []int
FindReaderSubmatchIndex returns a slice holding the index pairs identifying the leftmost match of the regular expression of text read by the RuneReader, and the matches, if any, of its subexpressions, as defined by the 'Submatch' and 'Index' descriptions in the package comment. A return value of nil indicates no match.
func (re *Regexp) FindAll(b []byte, n int) [][]byte
FindAll is the 'All' version of Find; it returns a slice of all successive matches of the expression, as defined by the 'All' description in the package comment. A return value of nil indicates no match.
func (re *Regexp) FindAllIndex(b []byte, n int) [][]int
FindAllIndex is the 'All' version of FindIndex; it returns a slice of all successive matches of the expression, as defined by the 'All' description in the package comment. A return value of nil indicates no match.
func (re *Regexp) FindAllString(s string, n int) []string
FindAllString is the 'All' version of FindString; it returns a slice of all successive matches of the expression, as defined by the 'All' description in the package comment. A return value of nil indicates no match.
func (re *Regexp) FindAllStringIndex(s string, n int) [][]int
FindAllStringIndex is the 'All' version of FindStringIndex; it returns a slice of all successive matches of the expression, as defined by the 'All' description in the package comment. A return value of nil indicates no match.
func (re *Regexp) FindAllSubmatch(b []byte, n int) [][][]byte
FindAllSubmatch is the 'All' version of FindSubmatch; it returns a slice of all successive matches of the expression, as defined by the 'All' description in the package comment. A return value of nil indicates no match.
func (re *Regexp) FindAllSubmatchIndex(b []byte, n int) [][]int
FindAllSubmatchIndex is the 'All' version of FindSubmatchIndex; it returns a slice of all successive matches of the expression, as defined by the 'All' description in the package comment. A return value of nil indicates no match.
func (re *Regexp) FindAllStringSubmatch(s string, n int) [][]string
FindAllStringSubmatch is the 'All' version of FindStringSubmatch; it returns a slice of all successive matches of the expression, as defined by the 'All' description in the package comment. A return value of nil indicates no match.
func (re *Regexp) FindAllStringSubmatchIndex(s string, n int) [][]int
FindAllStringSubmatchIndex is the 'All' version of FindStringSubmatchIndex; it returns a slice of all successive matches of the expression, as defined by the 'All' description in the package comment. A return value of nil indicates no match.
func (re *Regexp) Split(s string, n int) []string
Split slices s into substrings separated by the expression and returns a slice of the substrings between those expression matches.
The slice returned by this method consists of all the substrings of s not contained in the slice returned by FindAllString. When called on an expression that contains no metacharacters, it is equivalent to strings.SplitN.
Example:
s := regexp.MustCompile("a*").Split("abaabaccadaaae", 5)
// s: ["", "b", "b", "c", "cadaaae"]
The count determines the number of substrings to return:
n > 0: at most n substrings; the last substring will be the unsplit remainder.
n == 0: the result is nil (zero substrings)
n < 0: all substrings
type input interface {
step(pos int) (r rune, width int) // advance one rune
canCheckPrefix() bool // can we look ahead without losing info?
hasPrefix(re *Regexp) bool
index(re *Regexp, pos int) int
context(pos int) lazyFlag
}
input abstracts different representations of the input text. It provides one-character lookahead.
type inputString struct {
str string
}
inputString scans a string.
func (i *inputString) step(pos int) (rune, int)
func (i *inputString) canCheckPrefix() bool
func (i *inputString) hasPrefix(re *Regexp) bool
func (i *inputString) index(re *Regexp, pos int) int
func (i *inputString) context(pos int) lazyFlag
type inputBytes struct {
str []byte
}
inputBytes scans a byte slice.
func (i *inputBytes) step(pos int) (rune, int)
func (i *inputBytes) canCheckPrefix() bool
func (i *inputBytes) hasPrefix(re *Regexp) bool
func (i *inputBytes) index(re *Regexp, pos int) int
func (i *inputBytes) context(pos int) lazyFlag
type inputReader struct {
r io.RuneReader
atEOT bool
pos int
}
inputReader scans a RuneReader.
func (i *inputReader) step(pos int) (rune, int)
func (i *inputReader) canCheckPrefix() bool
func (i *inputReader) hasPrefix(re *Regexp) bool
func (i *inputReader) index(re *Regexp, pos int) int
func (i *inputReader) context(pos int) lazyFlag
type stringError struct {
re string
err string
}
type ReplaceTest struct {
pattern, replacement, input, output string
}
type ReplaceFuncTest struct {
pattern string
replacement func(string) string
input, output string
}
type MetaTest struct {
pattern, output, literal string
isLiteral bool
}
type subexpIndex struct {
name string
index int
}
type subexpCase struct {
input string
num int
names []string
indices []subexpIndex
}
type FindTest struct {
pat string
text string
matches [][]int
}
For each pattern/text pair, what is the expected output of each function? We can derive the textual results from the indexed results, the non-submatch results from the submatched results, the single results from the 'all' results, and the byte results from the string results. Therefore the table includes only the FindAllStringSubmatchIndex result.
func (t FindTest) String() string
func freeBitState(b *bitState)
func maxBitStateLen(prog *syntax.Prog) int
maxBitStateLen returns the maximum length of a string to search with the backtracker using prog.
func shouldBacktrack(prog *syntax.Prog) bool
shouldBacktrack reports whether the program is too long for the backtracker to run.
func freeOnePassMachine(m *onePassMachine)
func onePassPrefix(p *syntax.Prog) (prefix string, complete bool, pc uint32)
OnePassPrefix returns a literal string that all matches for the regexp must start with. Complete is true if the prefix is the entire match. Pc is the index of the last rune instruction in the string. The OnePassPrefix skips over the mandatory EmptyBeginText
func onePassNext(i *onePassInst, r rune) uint32
OnePassNext selects the next actionable state of the prog, based on the input character. It should only be called when i.Op == InstAlt or InstAltMatch, and from the one-pass machine. One of the alternates may ultimately lead without input to end of line. If the instruction is InstAltMatch the path to the InstMatch is in i.Out, the normal node in i.Next.
func iop(i *syntax.Inst) syntax.InstOp
func mergeRuneSets(leftRunes, rightRunes *[]rune, leftPC, rightPC uint32) ([]rune, []uint32)
func cleanupOnePass(prog *onePassProg, original *syntax.Prog)
cleanupOnePass drops working memory, and restores certain shortcut instructions.
func minInputLen(re *syntax.Regexp) int
minInputLen walks the regexp to find the minimum length of any matchable input
func quote(s string) string
func MatchReader(pattern string, r io.RuneReader) (matched bool, err error)
MatchReader reports whether the text returned by the RuneReader contains any match of the regular expression pattern. More complicated queries need to use Compile and the full Regexp interface.
func MatchString(pattern string, s string) (matched bool, err error)
MatchString reports whether the string s contains any match of the regular expression pattern. More complicated queries need to use Compile and the full Regexp interface.
func Match(pattern string, b []byte) (matched bool, err error)
Match reports whether the byte slice b contains any match of the regular expression pattern. More complicated queries need to use Compile and the full Regexp interface.
func special(b byte) bool
special reports whether byte b needs to be escaped by QuoteMeta.
func init()
func QuoteMeta(s string) string
QuoteMeta returns a string that escapes all regular expression metacharacters inside the argument text; the returned string is a regular expression matching the literal text.
func extract(str string) (name string, num int, rest string, ok bool)
extract returns the name from a leading "$name" or "${name}" in str. If it is a number, extract returns num set to that number; otherwise num = -1.
func TestGoodCompile(t *testing.T)
func TestBadCompile(t *testing.T)
func matchTest(t *testing.T, test *FindTest)
func TestMatch(t *testing.T)
func matchFunctionTest(t *testing.T, test *FindTest)
func TestMatchFunction(t *testing.T)
func copyMatchTest(t *testing.T, test *FindTest)
func TestCopyMatch(t *testing.T)
func TestReplaceAll(t *testing.T)
func TestReplaceAllLiteral(t *testing.T)
func TestReplaceAllFunc(t *testing.T)
func TestQuoteMeta(t *testing.T)
func TestLiteralPrefix(t *testing.T)
func TestSubexp(t *testing.T)
func TestSplit(t *testing.T)
func TestParseAndCompile(t *testing.T)
The following sequence of Match calls used to panic. See issue #12980.
func TestOnePassCutoff(t *testing.T)
Check that one-pass cutoff does trigger.
func TestSwitchBacktrack(t *testing.T)
Check that the same machine can be used with the standard matcher and then the backtracker when there are no captures.
func BenchmarkFind(b *testing.B)
func BenchmarkFindAllNoMatches(b *testing.B)
func BenchmarkFindString(b *testing.B)
func BenchmarkFindSubmatch(b *testing.B)
func BenchmarkFindStringSubmatch(b *testing.B)
func BenchmarkLiteral(b *testing.B)
func BenchmarkNotLiteral(b *testing.B)
func BenchmarkMatchClass(b *testing.B)
func BenchmarkMatchClass_InRange(b *testing.B)
func BenchmarkReplaceAll(b *testing.B)
func BenchmarkAnchoredLiteralShortNonMatch(b *testing.B)
func BenchmarkAnchoredLiteralLongNonMatch(b *testing.B)
func BenchmarkAnchoredShortMatch(b *testing.B)
func BenchmarkAnchoredLongMatch(b *testing.B)
func BenchmarkOnePassShortA(b *testing.B)
func BenchmarkNotOnePassShortA(b *testing.B)
func BenchmarkOnePassShortB(b *testing.B)
func BenchmarkNotOnePassShortB(b *testing.B)
func BenchmarkOnePassLongPrefix(b *testing.B)
func BenchmarkOnePassLongNotPrefix(b *testing.B)
func BenchmarkMatchParallelShared(b *testing.B)
func BenchmarkMatchParallelCopied(b *testing.B)
func BenchmarkQuoteMetaAll(b *testing.B)
func BenchmarkQuoteMetaNone(b *testing.B)
func BenchmarkCompile(b *testing.B)
func TestDeepEqual(t *testing.T)
func TestMinInputLen(t *testing.T)
func TestRE2Exhaustive(t *testing.T)
This test is excluded when running under the race detector because it is a very expensive test and takes too long.
func TestRE2Search(t *testing.T)
TestRE2 tests this package's regexp API against test cases considered during RE2's exhaustive tests, which run all possible regexps over a given set of atoms and operators, up to a given complexity, over all possible strings over a given alphabet, up to a given size. Rather than try to link with RE2, we read a log file containing the test cases and the expected matches. The log file, re2-exhaustive.txt, is generated by running 'make log' in the open source RE2 distribution https://github.com/google/re2/.
The test file format is a sequence of stanzas like:
strings
"abc"
"123x"
regexps
"[a-z]+"
0-3;0-3
-;-
"([0-9])([0-9])([0-9])"
-;-
-;0-3 0-1 1-2 2-3
The stanza begins by defining a set of strings, quoted using Go double-quote syntax, one per line. Then the regexps section gives a sequence of regexps to run on the strings. In the block that follows a regexp, each line gives the semicolon-separated match results of running the regexp on the corresponding string. Each match result is either a single -, meaning no match, or a space-separated sequence of pairs giving the match and submatch indices. An unmatched subexpression formats its pair as a single - (not illustrated above). For now each regexp run produces two match results, one for a full match' that restricts the regexp to matching the entire string or nothing, and one for a
partial match' that gives the leftmost first match found in the string.
Lines beginning with # are comments. Lines beginning with a capital letter are test names printed during RE2's test suite and are echoed into t but otherwise ignored.
At time of writing, re2-exhaustive.txt is 59 MB but compresses to 385 kB, so we store re2-exhaustive.txt.bz2 in the repository and decompress it on the fly.
func testRE2(t *testing.T, file string)
func runFull(re, refull *Regexp, text string) ([]int, string)
func runPartial(re, refull *Regexp, text string) ([]int, string)
func runFullLongest(re, refull *Regexp, text string) ([]int, string)
func runPartialLongest(re, refull *Regexp, text string) ([]int, string)
func matchFull(re, refull *Regexp, text string) (bool, string)
func matchPartial(re, refull *Regexp, text string) (bool, string)
func matchFullLongest(re, refull *Regexp, text string) (bool, string)
func matchPartialLongest(re, refull *Regexp, text string) (bool, string)
func isSingleBytes(s string) bool
func parseResult(t *testing.T, file string, lineno int, res string) []int
func same(x, y []int) bool
func TestFowler(t *testing.T)
TestFowler runs this package's regexp API against the POSIX regular expression tests collected by Glenn Fowler at http://www2.research.att.com/~astopen/testregex/testregex.html.
func testFowler(t *testing.T, file string)
func parseFowlerResult(s string) (ok, compiled, matched bool, pos []int)
func makeText(n int) []byte
func BenchmarkMatch(b *testing.B)
func BenchmarkMatch_onepass_regex(b *testing.B)
func TestLongest(t *testing.T)
func TestProgramTooLongForBacktrack(t *testing.T)
TestProgramTooLongForBacktrack tests that a regex which is too long for the backtracker still executes properly.
func build(n int, x ...int) [][]int
build is a helper to construct a [][]int by extracting n sequences from x. This represents n matches with len(x)/n submatches each.
func TestFind(t *testing.T)
func TestFindString(t *testing.T)
func testFindIndex(test *FindTest, result []int, t *testing.T)
func TestFindIndex(t *testing.T)
func TestFindStringIndex(t *testing.T)
func TestFindReaderIndex(t *testing.T)
func TestFindAll(t *testing.T)
func TestFindAllString(t *testing.T)
func testFindAllIndex(test *FindTest, result [][]int, t *testing.T)
func TestFindAllIndex(t *testing.T)
func TestFindAllStringIndex(t *testing.T)
func testSubmatchBytes(test *FindTest, n int, submatches []int, result [][]byte, t *testing.T)
func TestFindSubmatch(t *testing.T)
func testSubmatchString(test *FindTest, n int, submatches []int, result []string, t *testing.T)
func TestFindStringSubmatch(t *testing.T)
func testSubmatchIndices(test *FindTest, n int, expect, result []int, t *testing.T)
func testFindSubmatchIndex(test *FindTest, result []int, t *testing.T)
func TestFindSubmatchIndex(t *testing.T)
func TestFindStringSubmatchIndex(t *testing.T)
func TestFindReaderSubmatchIndex(t *testing.T)
func TestFindAllSubmatch(t *testing.T)
func TestFindAllStringSubmatch(t *testing.T)
func testFindAllSubmatchIndex(test *FindTest, result [][]int, t *testing.T)
func TestFindAllSubmatchIndex(t *testing.T)
func TestFindAllStringSubmatchIndex(t *testing.T)
func TestMergeRuneSet(t *testing.T)
func TestCompileOnePass(t *testing.T)
func TestRunOnePass(t *testing.T)
Package syntax parses regular expressions into parse trees and compiles parse trees into programs. Most clients of regular expressions will use the facilities of package regexp (such as Compile and Match) instead of this package.
The regular expression syntax understood by this package when parsing with the Perl flag is as follows. Parts of the syntax can be disabled by passing alternate flags to Parse.
Single characters:
. any character, possibly including newline (flag s=true)
[xyz] character class
[^xyz] negated character class
\d Perl character class
\D negated Perl character class
[[:alpha:]] ASCII character class
[[:^alpha:]] negated ASCII character class
\pN Unicode character class (one-letter name)
\p{Greek} Unicode character class
\PN negated Unicode character class (one-letter name)
\P{Greek} negated Unicode character class
Composites:
xy x followed by y
x|y x or y (prefer x)
Repetitions:
x* zero or more x, prefer more
x+ one or more x, prefer more
x? zero or one x, prefer one
x{n,m} n or n+1 or ... or m x, prefer more
x{n,} n or more x, prefer more
x{n} exactly n x
x*? zero or more x, prefer fewer
x+? one or more x, prefer fewer
x?? zero or one x, prefer zero
x{n,m}? n or n+1 or ... or m x, prefer fewer
x{n,}? n or more x, prefer fewer
x{n}? exactly n x
Implementation restriction: The counting forms x{n,m}, x{n,}, and x{n} reject forms that create a minimum or maximum repetition count above 1000. Unlimited repetitions are not subject to this restriction.
Grouping:
(re) numbered capturing group (submatch)
(?P<name>re) named & numbered capturing group (submatch)
(?:re) non-capturing group
(?flags) set flags within current group; non-capturing
(?flags:re) set flags during re; non-capturing
Flag syntax is xyz (set) or -xyz (clear) or xy-z (set xy, clear z). The flags are:
i case-insensitive (default false)
m multi-line mode: ^ and $ match begin/end line in addition to begin/end text (default false)
s let . match \n (default false)
U ungreedy: swap meaning of x* and x*?, x+ and x+?, etc (default false)
Empty strings:
^ at beginning of text or line (flag m=true)
$ at end of text (like \z not \Z) or line (flag m=true)
\A at beginning of text
\b at ASCII word boundary (\w on one side and \W, \A, or \z on the other)
\B not at ASCII word boundary
\z at end of text
Escape sequences:
\a bell (== \007)
\f form feed (== \014)
\t horizontal tab (== \011)
\n newline (== \012)
\r carriage return (== \015)
\v vertical tab character (== \013)
\* literal *, for any punctuation character *
\123 octal character code (up to three digits)
\x7F hex character code (exactly two digits)
\x{10FFFF} hex character code
\Q...\E literal text ... even if ... has punctuation
Character class elements:
x single character
A-Z character range (inclusive)
\d Perl character class
[:foo:] ASCII character class foo
\p{Foo} Unicode character class Foo
\pF Unicode character class F (one-letter name)
Named character classes as character class elements:
[\d] digits (== \d)
[^\d] not digits (== \D)
[\D] not digits (== \D)
[^\D] not not digits (== \d)
[[:name:]] named ASCII class inside character class (== [:name:])
[^[:name:]] named ASCII class inside negated character class (== [:^name:])
[\p{Name}] named Unicode property inside character class (== \p{Name})
[^\p{Name}] named Unicode property inside negated character class (== \P{Name})
Perl character classes (all ASCII-only):
\d digits (== [0-9])
\D not digits (== [^0-9])
\s whitespace (== [\t\n\f\r ])
\S not whitespace (== [^\t\n\f\r ])
\w word characters (== [0-9A-Za-z_])
\W not word characters (== [^0-9A-Za-z_])
ASCII character classes:
[[:alnum:]] alphanumeric (== [0-9A-Za-z])
[[:alpha:]] alphabetic (== [A-Za-z])
[[:ascii:]] ASCII (== [\x00-\x7F])
[[:blank:]] blank (== [\t ])
[[:cntrl:]] control (== [\x00-\x1F\x7F])
[[:digit:]] digits (== [0-9])
[[:graph:]] graphical (== [!-~] == [A-Za-z0-9!"#$%&'()*+,\-./:;<=>?@[\\\]^_`{|}~])
[[:lower:]] lower case (== [a-z])
[[:print:]] printable (== [ -~] == [ [:graph:]])
[[:punct:]] punctuation (== [!-/:-@[-`{-~])
[[:space:]] whitespace (== [\t\n\v\f\r ])
[[:upper:]] upper case (== [A-Z])
[[:word:]] word characters (== [0-9A-Za-z_])
[[:xdigit:]] hex digit (== [0-9A-Fa-f])
Unicode character classes are those in unicode.Categories and unicode.Scripts.
- Constants
- const _Op_name_0
- const _Op_name_1
- const ErrInternalError
- const ErrInvalidCharClass
- const ErrInvalidCharRange
- const ErrInvalidEscape
- const ErrInvalidNamedCapture
- const ErrInvalidPerlOp
- const ErrInvalidRepeatOp
- const ErrInvalidRepeatSize
- const ErrInvalidUTF8
- const ErrMissingBracket
- const ErrMissingParen
- const ErrMissingRepeatArgument
- const ErrTrailingBackslash
- const ErrUnexpectedParen
- const FoldCase
- const Literal
- const ClassNL
- const DotNL
- const OneLine
- const NonGreedy
- const PerlX
- const UnicodeGroups
- const WasDollar
- const Simple
- const MatchNL
- const Perl
- const POSIX
- const opLeftParen
- const opVerticalBar
- const minFold
- const maxFold
- const InstAlt
- const InstAltMatch
- const InstCapture
- const InstEmptyWidth
- const InstMatch
- const InstFail
- const InstNop
- const InstRune
- const InstRune1
- const InstRuneAny
- const InstRuneAnyNotNL
- const EmptyBeginLine
- const EmptyEndLine
- const EmptyBeginText
- const EmptyEndText
- const EmptyWordBoundary
- const EmptyNoWordBoundary
- const noMatch
- const OpNoMatch
- const OpEmptyMatch
- const OpLiteral
- const OpCharClass
- const OpAnyCharNotNL
- const OpAnyChar
- const OpBeginLine
- const OpEndLine
- const OpBeginText
- const OpEndText
- const OpWordBoundary
- const OpNoWordBoundary
- const OpCapture
- const OpStar
- const OpPlus
- const OpQuest
- const OpRepeat
- const OpConcat
- const OpAlternate
- const opPseudo
- const meta
- const testFlags
- Variables
- var c
- var anyRuneNotNL
- var anyRune
- var f
- var f
- var f
- var _Op_index_0
- var str
- var strflags
- var istr
- var iflags
- var first
- var ifirst
- var p
- var err
- var c
- var op
- var lastRepeat
- var lit
- var ok1
- var c
- var anyTable
- var seq
- var name
- var lo
- var hi
- var code1
- var code2
- var code3
- var perlGroup
- var code4
- var code5
- var code6
- var code7
- var code8
- var code9
- var code10
- var code11
- var code12
- var code13
- var code14
- var code15
- var code16
- var code17
- var posixGroup
- var instOpNames
- var op
- var boundary
- var b
- var buf
- var flag
- var b
- var b
- var prefix
- var parseTests
- var foldcaseTests
- var literalTests
- var matchnlTests
- var nomatchnlTests
- var b
- var opNames
- var r
- var invalidRegexps
- var onlyPerl
- var onlyPOSIX
- var compileTests
- var r1
- var simplifyTests
- Types
- type patchList struct
- type frag struct
- type compiler struct
- func (c *compiler) init()
- func (c *compiler) compile(re *Regexp) frag
- func (c *compiler) inst(op InstOp) frag
- func (c *compiler) nop() frag
- func (c *compiler) fail() frag
- func (c *compiler) cap(arg uint32) frag
- func (c *compiler) cat(f1, f2 frag) frag
- func (c *compiler) alt(f1, f2 frag) frag
- func (c *compiler) quest(f1 frag, nongreedy bool) frag
- func (c *compiler) star(f1 frag, nongreedy bool) frag
- func (c *compiler) plus(f1 frag, nongreedy bool) frag
- func (c *compiler) empty(op EmptyOp) frag
- func (c *compiler) rune(r []rune, flags Flags) frag
- type Error struct
- type ErrorCode string
- type Flags uint16
- type parser struct
- func (p *parser) newRegexp(op Op) *Regexp
- func (p *parser) reuse(re *Regexp)
- func (p *parser) push(re *Regexp) *Regexp
- func (p *parser) maybeConcat(r rune, flags Flags) bool
- func (p *parser) literal(r rune)
- func (p *parser) op(op Op) *Regexp
- func (p *parser) repeat(op Op, min, max int, before, after, lastRepeat string) (string, error)
- func (p *parser) concat() *Regexp
- func (p *parser) alternate() *Regexp
- func (p *parser) collapse(subs []*Regexp, op Op) *Regexp
- func (p *parser) factor(sub []*Regexp) []*Regexp
- func (p *parser) leadingString(re *Regexp) ([]rune, Flags)
- func (p *parser) removeLeadingString(re *Regexp, n int) *Regexp
- func (p *parser) leadingRegexp(re *Regexp) *Regexp
- func (p *parser) removeLeadingRegexp(re *Regexp, reuse bool) *Regexp
- func (p *parser) parseRepeat(s string) (min, max int, rest string, ok bool)
- func (p *parser) parsePerlFlags(s string) (rest string, err error)
- func (p *parser) parseInt(s string) (n int, rest string, ok bool)
- func (p *parser) parseVerticalBar() error
- func (p *parser) swapVerticalBar() bool
- func (p *parser) parseRightParen() error
- func (p *parser) parseEscape(s string) (r rune, rest string, err error)
- func (p *parser) parseClassChar(s, wholeClass string) (r rune, rest string, err error)
- func (p *parser) parsePerlClassEscape(s string, r []rune) (out []rune, rest string)
- func (p *parser) parseNamedClass(s string, r []rune) (out []rune, rest string, err error)
- func (p *parser) appendGroup(r []rune, g charGroup) []rune
- func (p *parser) parseUnicodeClass(s string, r []rune) (out []rune, rest string, err error)
- func (p *parser) parseClass(s string) (rest string, err error)
- type charGroup struct
- type ranges struct
- type Prog struct
- type InstOp uint8
- type EmptyOp uint8
- type Inst struct
- type Regexp struct
- func literalRegexp(s string, flags Flags) *Regexp
- func Parse(s string, flags Flags) (*Regexp, error)
- func simplify1(op Op, flags Flags, sub, re *Regexp) *Regexp
- func (x *Regexp) Equal(y *Regexp) bool
- func (re *Regexp) String() string
- func (re *Regexp) MaxCap() int
- func (re *Regexp) CapNames() []string
- func (re *Regexp) capNames(names []string)
- func (re *Regexp) Simplify() *Regexp
- type Op uint8
- type parseTest struct
- Functions
- func minFoldRune(r rune) rune
- func repeatIsValid(re *Regexp, n int) bool
- func cleanAlt(re *Regexp)
- func isValidCaptureName(name string) bool
- func isCharClass(re *Regexp) bool
- func matchRune(re *Regexp, r rune) bool
- func mergeCharClass(dst, src *Regexp)
- func unicodeTable(name string) (*unicode.RangeTable, *unicode.RangeTable)
- func cleanClass(rp *[]rune) []rune
- func appendLiteral(r []rune, x rune, flags Flags) []rune
- func appendRange(r []rune, lo, hi rune) []rune
- func appendFoldedRange(r []rune, lo, hi rune) []rune
- func appendClass(r []rune, x []rune) []rune
- func appendFoldedClass(r []rune, x []rune) []rune
- func appendNegatedClass(r []rune, x []rune) []rune
- func appendTable(r []rune, x *unicode.RangeTable) []rune
- func appendNegatedTable(r []rune, x *unicode.RangeTable) []rune
- func negateClass(r []rune) []rune
- func checkUTF8(s string) error
- func nextRune(s string) (c rune, t string, err error)
- func isalnum(c rune) bool
- func unhex(c rune) rune
- func IsWordChar(r rune) bool
- func bw(b *strings.Builder, args ...string)
- func dumpProg(b *strings.Builder, p *Prog)
- func u32(i uint32) string
- func dumpInst(b *strings.Builder, i *Inst)
- func writeRegexp(b *strings.Builder, re *Regexp)
- func escape(b *strings.Builder, r rune, force bool)
- func TestParseSimple(t *testing.T)
- func TestParseFoldCase(t *testing.T)
- func TestParseLiteral(t *testing.T)
- func TestParseMatchNL(t *testing.T)
- func TestParseNoMatchNL(t *testing.T)
- func testParseDump(t *testing.T, tests []parseTest, flags Flags)
- func dump(re *Regexp) string
- func dumpRegexp(b *strings.Builder, re *Regexp)
- func mkCharClass(f func(rune) bool) string
- func isUpperFold(r rune) bool
- func TestFoldConstants(t *testing.T)
- func TestAppendRangeCollapse(t *testing.T)
- func TestParseInvalidRegexps(t *testing.T)
- func TestToStringEquivalentParse(t *testing.T)
- func TestCompile(t *testing.T)
- func BenchmarkEmptyOpContext(b *testing.B)
- func TestSimplify(t *testing.T)
const _Op_name_0 = "NoMatchEmptyMatchLiteralCharClassAnyCharNotNLAnyCharBeginLineEndLineBeginTextEndTextWordBoundaryNoWordBoundaryCaptureStarPlusQuestRepeatConcatAlternate"
const _Op_name_1 = "opPseudo"
const ErrInternalError ErrorCode = "regexp/syntax: internal error"
Unexpected error
const ErrInvalidCharClass ErrorCode = "invalid character class"
Parse errors
const ErrInvalidCharRange ErrorCode = "invalid character class range"
const ErrInvalidEscape ErrorCode = "invalid escape sequence"
const ErrInvalidNamedCapture ErrorCode = "invalid named capture"
const ErrInvalidPerlOp ErrorCode = "invalid or unsupported Perl syntax"
const ErrInvalidRepeatOp ErrorCode = "invalid nested repetition operator"
const ErrInvalidRepeatSize ErrorCode = "invalid repeat count"
const ErrInvalidUTF8 ErrorCode = "invalid UTF-8"
const ErrMissingBracket ErrorCode = "missing closing ]"
const ErrMissingParen ErrorCode = "missing closing )"
const ErrMissingRepeatArgument ErrorCode = "missing argument to repetition operator"
const ErrTrailingBackslash ErrorCode = "trailing backslash at end of expression"
const ErrUnexpectedParen ErrorCode = "unexpected )"
const FoldCase Flags = 1 << iota // case-insensitive match
const Literal // treat pattern as literal string
const ClassNL // allow character classes like [^a-z] and [[:space:]] to match newline
const DotNL // allow . to match newline
const OneLine // treat ^ and $ as only matching at beginning and end of text
const NonGreedy // make repetition operators default to non-greedy
const PerlX // allow Perl extensions
const UnicodeGroups // allow \p{Han}, \P{Han} for Unicode group and negation
const WasDollar // regexp OpEndText was $, not \z
const Simple // regexp contains no counted repetition
const MatchNL = ClassNL | DotNL
const Perl = ClassNL | OneLine | PerlX | UnicodeGroups // as close to Perl as possible
const POSIX Flags = 0 // POSIX syntax
const opLeftParen = opPseudo + iota
Pseudo-ops for parsing stack.
const opVerticalBar
Pseudo-ops for parsing stack.
const minFold = 0x0041
minimum and maximum runes involved in folding. checked during test.
const maxFold = 0x1e943
const InstAlt InstOp = iota
const InstAltMatch
const InstCapture
const InstEmptyWidth
const InstMatch
const InstFail
const InstNop
const InstRune
const InstRune1
const InstRuneAny
const InstRuneAnyNotNL
const EmptyBeginLine EmptyOp = 1 << iota
const EmptyEndLine
const EmptyBeginText
const EmptyEndText
const EmptyWordBoundary
const EmptyNoWordBoundary
const noMatch = -1
const OpNoMatch Op = 1 + iota // matches no strings
const OpEmptyMatch // matches empty string
const OpLiteral // matches Runes sequence
const OpCharClass // matches Runes interpreted as range pair list
const OpAnyCharNotNL // matches any character except newline
const OpAnyChar // matches any character
const OpBeginLine // matches empty string at beginning of line
const OpEndLine // matches empty string at end of line
const OpBeginText // matches empty string at beginning of text
const OpEndText // matches empty string at end of text
const OpWordBoundary // matches word boundary `\b`
const OpNoWordBoundary // matches word non-boundary `\B`
const OpCapture // capturing subexpression with index Cap, optional name Name
const OpStar // matches Sub[0] zero or more times
const OpPlus // matches Sub[0] one or more times
const OpQuest // matches Sub[0] zero or one times
const OpRepeat // matches Sub[0] at least Min times, at most Max (Max == -1 is no limit)
const OpConcat // matches concatenation of Subs
const OpAlternate // matches alternation of Subs
const opPseudo Op = 128 // where pseudo-ops start
const meta = `\.+*?()|[]{}^$`
const testFlags = MatchNL | PerlX | UnicodeGroups
var c compiler
var anyRuneNotNL = ...
var anyRune = ...
var f frag
var f frag
var f frag
var _Op_index_0 = ...
var str []rune
Round 1: Factor out common literal prefixes.
var strflags Flags
var istr []rune
Invariant: the Regexps that were in sub[0:start] have been used or marked for reuse, and the slice space has been reused for out (len(out) <= start).
Invariant: sub[start:i] consists of regexps that all begin with str as modified by strflags.
var iflags Flags
var first *Regexp
var ifirst *Regexp
Invariant: the Regexps that were in sub[0:start] have been used or marked for reuse, and the slice space has been reused for out (len(out) <= start).
Invariant: sub[start:i] consists of regexps that all begin with ifirst.
var p parser
Otherwise, must do real work.
var err error
Otherwise, must do real work.
var c rune
Otherwise, must do real work.
var op Op
Otherwise, must do real work.
var lastRepeat string
Otherwise, must do real work.
var lit string
\Q ... \E: the ... is always literals
var ok1 bool
var c rune
Non-capturing group. Might also twiddle Perl flags.
var anyTable = ...
var seq, name string
var seq, name string
var lo, hi rune
var lo, hi rune
var code1 = ...
var code2 = ...
var code3 = ...
var perlGroup = ...
var code4 = ...
var code5 = ...
var code6 = ...
var code7 = ...
var code8 = ...
var code9 = ...
var code10 = ...
var code11 = ...
var code12 = ...
var code13 = ...
var code14 = ...
var code15 = ...
var code16 = ...
var code17 = ...
var posixGroup = ...
var instOpNames = ...
var op EmptyOp = EmptyNoWordBoundary
var boundary byte
var b strings.Builder
var buf strings.Builder
Have prefix; gather characters.
var flag EmptyOp
var b strings.Builder
var b strings.Builder
var prefix *Regexp
Build leading prefix: xx.
var parseTests = ...
var foldcaseTests = ...
var literalTests = ...
var matchnlTests = ...
var nomatchnlTests = ...
var b strings.Builder
var opNames = ...
var r []rune
AppendRange should collapse each of the new ranges into the earlier ones (it looks back two ranges), so that the slice never grows very large. Note that we are not calling cleanClass.
var invalidRegexps = ...
var onlyPerl = ...
var onlyPOSIX = ...
var compileTests = ...
var r1 rune = -1
var simplifyTests = ...
type patchList struct {
head, tail uint32
}
A patchList is a list of instruction pointers that need to be filled in (patched). Because the pointers haven't been filled in yet, we can reuse their storage to hold the list. It's kind of sleazy, but works well in practice. See https://swtch.com/~rsc/regexp/regexp1.html for inspiration.
These aren't really pointers: they're integers, so we can reinterpret them this way without using package unsafe. A value l.head denotes p.inst[l.head>>1].Out (l.head&1==0) or .Arg (l.head&1==1). head == 0 denotes the empty list, okay because we start every program with a fail instruction, so we'll never want to point at its output link.
func makePatchList(n uint32) patchList
func (l patchList) patch(p *Prog, val uint32)
func (l1 patchList) append(p *Prog, l2 patchList) patchList
type frag struct {
i uint32 // index of first instruction
out patchList // where to record end instruction
}
A frag represents a compiled program fragment.
type compiler struct {
p *Prog
}
func (c *compiler) init()
func (c *compiler) compile(re *Regexp) frag
func (c *compiler) inst(op InstOp) frag
func (c *compiler) nop() frag
func (c *compiler) fail() frag
func (c *compiler) cap(arg uint32) frag
func (c *compiler) cat(f1, f2 frag) frag
func (c *compiler) alt(f1, f2 frag) frag
func (c *compiler) quest(f1 frag, nongreedy bool) frag
func (c *compiler) star(f1 frag, nongreedy bool) frag
func (c *compiler) plus(f1 frag, nongreedy bool) frag
func (c *compiler) empty(op EmptyOp) frag
func (c *compiler) rune(r []rune, flags Flags) frag
type Error struct {
Code ErrorCode
Expr string
}
An Error describes a failure to parse a regular expression and gives the offending expression.
func (e *Error) Error() string
type ErrorCode string
An ErrorCode describes a failure to parse a regular expression.
func (e ErrorCode) String() string
type Flags uint16
Flags control the behavior of the parser and record information about regexp context.
type parser struct {
flags Flags // parse mode flags
stack []*Regexp // stack of parsed expressions
free *Regexp
numCap int // number of capturing groups seen
wholeRegexp string
tmpClass []rune // temporary char class work space
}
func (p *parser) newRegexp(op Op) *Regexp
func (p *parser) reuse(re *Regexp)
func (p *parser) push(re *Regexp) *Regexp
push pushes the regexp re onto the parse stack and returns the regexp.
func (p *parser) maybeConcat(r rune, flags Flags) bool
maybeConcat implements incremental concatenation of literal runes into string nodes. The parser calls this before each push, so only the top fragment of the stack might need processing. Since this is called before a push, the topmost literal is no longer subject to operators like * (Otherwise ab* would turn into (ab)*.) If r >= 0 and there's a node left over, maybeConcat uses it to push r with the given flags. maybeConcat reports whether r was pushed.
func (p *parser) literal(r rune)
literal pushes a literal regexp for the rune r on the stack.
func (p *parser) op(op Op) *Regexp
op pushes a regexp with the given op onto the stack and returns that regexp.
func (p *parser) repeat(op Op, min, max int, before, after, lastRepeat string) (string, error)
repeat replaces the top stack element with itself repeated according to op, min, max. before is the regexp suffix starting at the repetition operator. after is the regexp suffix following after the repetition operator. repeat returns an updated 'after' and an error, if any.
func (p *parser) concat() *Regexp
concat replaces the top of the stack (above the topmost '|' or '(') with its concatenation.
func (p *parser) alternate() *Regexp
alternate replaces the top of the stack (above the topmost '(') with its alternation.
func (p *parser) collapse(subs []*Regexp, op Op) *Regexp
collapse returns the result of applying op to sub. If sub contains op nodes, they all get hoisted up so that there is never a concat of a concat or an alternate of an alternate.
func (p *parser) factor(sub []*Regexp) []*Regexp
factor factors common prefixes from the alternation list sub. It returns a replacement list that reuses the same storage and frees (passes to p.reuse) any removed *Regexps.
For example,
ABC|ABD|AEF|BCX|BCY
simplifies by literal prefix extraction to
A(B(C|D)|EF)|BC(X|Y)
which simplifies by character class introduction to
A(B[CD]|EF)|BC[XY]
func (p *parser) leadingString(re *Regexp) ([]rune, Flags)
leadingString returns the leading literal string that re begins with. The string refers to storage in re or its children.
func (p *parser) removeLeadingString(re *Regexp, n int) *Regexp
removeLeadingString removes the first n leading runes from the beginning of re. It returns the replacement for re.
func (p *parser) leadingRegexp(re *Regexp) *Regexp
leadingRegexp returns the leading regexp that re begins with. The regexp refers to storage in re or its children.
func (p *parser) removeLeadingRegexp(re *Regexp, reuse bool) *Regexp
removeLeadingRegexp removes the leading regexp in re. It returns the replacement for re. If reuse is true, it passes the removed regexp (if no longer needed) to p.reuse.
func (p *parser) parseRepeat(s string) (min, max int, rest string, ok bool)
parseRepeat parses {min} (max=min) or {min,} (max=-1) or {min,max}. If s is not of that form, it returns ok == false. If s has the right form but the values are too big, it returns min == -1, ok == true.
func (p *parser) parsePerlFlags(s string) (rest string, err error)
parsePerlFlags parses a Perl flag setting or non-capturing group or both, like (?i) or (?: or (?i:. It removes the prefix from s and updates the parse state. The caller must have ensured that s begins with "(?".
func (p *parser) parseInt(s string) (n int, rest string, ok bool)
parseInt parses a decimal integer.
func (p *parser) parseVerticalBar() error
parseVerticalBar handles a | in the input.
func (p *parser) swapVerticalBar() bool
If the top of the stack is an element followed by an opVerticalBar swapVerticalBar swaps the two and returns true. Otherwise it returns false.
func (p *parser) parseRightParen() error
parseRightParen handles a ) in the input.
func (p *parser) parseEscape(s string) (r rune, rest string, err error)
parseEscape parses an escape sequence at the beginning of s and returns the rune.
func (p *parser) parseClassChar(s, wholeClass string) (r rune, rest string, err error)
parseClassChar parses a character class character at the beginning of s and returns it.
func (p *parser) parsePerlClassEscape(s string, r []rune) (out []rune, rest string)
parsePerlClassEscape parses a leading Perl character class escape like \d from the beginning of s. If one is present, it appends the characters to r and returns the new slice r and the remainder of the string.
func (p *parser) parseNamedClass(s string, r []rune) (out []rune, rest string, err error)
parseNamedClass parses a leading POSIX named character class like [:alnum:] from the beginning of s. If one is present, it appends the characters to r and returns the new slice r and the remainder of the string.
func (p *parser) appendGroup(r []rune, g charGroup) []rune
func (p *parser) parseUnicodeClass(s string, r []rune) (out []rune, rest string, err error)
parseUnicodeClass parses a leading Unicode character class like \p{Han} from the beginning of s. If one is present, it appends the characters to r and returns the new slice r and the remainder of the string.
func (p *parser) parseClass(s string) (rest string, err error)
parseClass parses a character class at the beginning of s and pushes it onto the parse stack.
type charGroup struct {
sign int
class []rune
}
type ranges struct {
p *[]rune
}
ranges implements sort.Interface on a []rune. The choice of receiver type definition is strange but avoids an allocation since we already have a *[]rune.
func (ra ranges) Less(i, j int) bool
func (ra ranges) Len() int
func (ra ranges) Swap(i, j int)
type Prog struct {
Inst []Inst
Start int // index of start instruction
NumCap int // number of InstCapture insts in re
}
A Prog is a compiled regular expression program.
func Compile(re *Regexp) (*Prog, error)
Compile compiles the regexp into a program to be executed. The regexp should have been simplified already (returned from re.Simplify).
func (p *Prog) String() string
func (p *Prog) skipNop(pc uint32) *Inst
skipNop follows any no-op or capturing instructions.
func (p *Prog) Prefix() (prefix string, complete bool)
Prefix returns a literal string that all matches for the regexp must start with. Complete is true if the prefix is the entire match.
func (p *Prog) StartCond() EmptyOp
StartCond returns the leading empty-width conditions that must be true in any match. It returns ^EmptyOp(0) if no matches are possible.
type InstOp uint8
An InstOp is an instruction opcode.
func (i InstOp) String() string
type EmptyOp uint8
An EmptyOp specifies a kind or mixture of zero-width assertions.
func EmptyOpContext(r1, r2 rune) EmptyOp
EmptyOpContext returns the zero-width assertions satisfied at the position between the runes r1 and r2. Passing r1 == -1 indicates that the position is at the beginning of the text. Passing r2 == -1 indicates that the position is at the end of the text.
type Inst struct {
Op InstOp
Out uint32 // all but InstMatch, InstFail
Arg uint32 // InstAlt, InstAltMatch, InstCapture, InstEmptyWidth
Rune []rune
}
An Inst is a single instruction in a regular expression program.
func (i *Inst) op() InstOp
op returns i.Op but merges all the Rune special cases into InstRune
func (i *Inst) MatchRune(r rune) bool
MatchRune reports whether the instruction matches (and consumes) r. It should only be called when i.Op == InstRune.
func (i *Inst) MatchRunePos(r rune) int
MatchRunePos checks whether the instruction matches (and consumes) r. If so, MatchRunePos returns the index of the matching rune pair (or, when len(i.Rune) == 1, rune singleton). If not, MatchRunePos returns -1. MatchRunePos should only be called when i.Op == InstRune.
func (i *Inst) MatchEmptyWidth(before rune, after rune) bool
MatchEmptyWidth reports whether the instruction matches an empty string between the runes before and after. It should only be called when i.Op == InstEmptyWidth.
func (i *Inst) String() string
type Regexp struct {
Op Op // operator
Flags Flags
Sub []*Regexp // subexpressions, if any
Sub0 [1]*Regexp // storage for short Sub
Rune []rune // matched runes, for OpLiteral, OpCharClass
Rune0 [2]rune // storage for short Rune
Min, Max int // min, max for OpRepeat
Cap int // capturing index, for OpCapture
Name string // capturing name, for OpCapture
}
A Regexp is a node in a regular expression syntax tree.
func literalRegexp(s string, flags Flags) *Regexp
func Parse(s string, flags Flags) (*Regexp, error)
Parse parses a regular expression string s, controlled by the specified Flags, and returns a regular expression parse tree. The syntax is described in the top-level comment.
func simplify1(op Op, flags Flags, sub, re *Regexp) *Regexp
simplify1 implements Simplify for the unary OpStar, OpPlus, and OpQuest operators. It returns the simple regexp equivalent to
Regexp{Op: op, Flags: flags, Sub: {sub}}
under the assumption that sub is already simple, and without first allocating that structure. If the regexp to be returned turns out to be equivalent to re, simplify1 returns re instead.
simplify1 is factored out of Simplify because the implementation for other operators generates these unary expressions. Letting them call simplify1 makes sure the expressions they generate are simple.
func (x *Regexp) Equal(y *Regexp) bool
Equal reports whether x and y have identical structure.
func (re *Regexp) String() string
func (re *Regexp) MaxCap() int
MaxCap walks the regexp to find the maximum capture index.
func (re *Regexp) CapNames() []string
CapNames walks the regexp to find the names of capturing groups.
func (re *Regexp) capNames(names []string)
func (re *Regexp) Simplify() *Regexp
Simplify returns a regexp equivalent to re but without counted repetitions and with various other simplifications, such as rewriting /(?:a+)+/ to /a+/. The resulting regexp will execute correctly but its string representation will not produce the same parse tree, because capturing parentheses may have been duplicated or removed. For example, the simplified form for /(x){1,2}/ is /(x)(x)?/ but both parentheses capture as $1. The returned regexp may share structure with or be the original.
type Op uint8
An Op is a single regular expression operator.
func (i Op) String() string
type parseTest struct {
Regexp string
Dump string
}
func minFoldRune(r rune) rune
minFoldRune returns the minimum rune fold-equivalent to r.
func repeatIsValid(re *Regexp, n int) bool
repeatIsValid reports whether the repetition re is valid. Valid means that the combination of the top-level repetition and any inner repetitions does not exceed n copies of the innermost thing. This function rewalks the regexp tree and is called for every repetition, so we have to worry about inducing quadratic behavior in the parser. We avoid this by only calling repeatIsValid when min or max >= 2. In that case the depth of any >= 2 nesting can only get to 9 without triggering a parse error, so each subtree can only be rewalked 9 times.
func cleanAlt(re *Regexp)
cleanAlt cleans re for eventual inclusion in an alternation.
func isValidCaptureName(name string) bool
isValidCaptureName reports whether name is a valid capture name: [A-Za-z0-9_]+. PCRE limits names to 32 bytes. Python rejects names starting with digits. We don't enforce either of those.
func isCharClass(re *Regexp) bool
can this be represented as a character class? single-rune literal string, char class, ., and .|\n.
func matchRune(re *Regexp, r rune) bool
does re match r?
func mergeCharClass(dst, src *Regexp)
mergeCharClass makes dst = dst|src. The caller must ensure that dst.Op >= src.Op, to reduce the amount of copying.
func unicodeTable(name string) (*unicode.RangeTable, *unicode.RangeTable)
unicodeTable returns the unicode.RangeTable identified by name and the table of additional fold-equivalent code points.
func cleanClass(rp *[]rune) []rune
cleanClass sorts the ranges (pairs of elements of r), merges them, and eliminates duplicates.
func appendLiteral(r []rune, x rune, flags Flags) []rune
appendLiteral returns the result of appending the literal x to the class r.
func appendRange(r []rune, lo, hi rune) []rune
appendRange returns the result of appending the range lo-hi to the class r.
func appendFoldedRange(r []rune, lo, hi rune) []rune
appendFoldedRange returns the result of appending the range lo-hi and its case folding-equivalent runes to the class r.
func appendClass(r []rune, x []rune) []rune
appendClass returns the result of appending the class x to the class r. It assume x is clean.
func appendFoldedClass(r []rune, x []rune) []rune
appendFolded returns the result of appending the case folding of the class x to the class r.
func appendNegatedClass(r []rune, x []rune) []rune
appendNegatedClass returns the result of appending the negation of the class x to the class r. It assumes x is clean.
func appendTable(r []rune, x *unicode.RangeTable) []rune
appendTable returns the result of appending x to the class r.
func appendNegatedTable(r []rune, x *unicode.RangeTable) []rune
appendNegatedTable returns the result of appending the negation of x to the class r.
func negateClass(r []rune) []rune
negateClass overwrites r and returns r's negation. It assumes the class r is already clean.
func checkUTF8(s string) error
func nextRune(s string) (c rune, t string, err error)
func isalnum(c rune) bool
func unhex(c rune) rune
func IsWordChar(r rune) bool
IsWordChar reports whether r is consider a `word character' during the evaluation of the \b and \B zero-width assertions. These assertions are ASCII-only: the word characters are [A-Za-z0-9_].
func bw(b *strings.Builder, args ...string)
func dumpProg(b *strings.Builder, p *Prog)
func u32(i uint32) string
func dumpInst(b *strings.Builder, i *Inst)
func writeRegexp(b *strings.Builder, re *Regexp)
writeRegexp writes the Perl syntax for the regular expression re to b.
func escape(b *strings.Builder, r rune, force bool)
func TestParseSimple(t *testing.T)
func TestParseFoldCase(t *testing.T)
func TestParseLiteral(t *testing.T)
func TestParseMatchNL(t *testing.T)
func TestParseNoMatchNL(t *testing.T)
func testParseDump(t *testing.T, tests []parseTest, flags Flags)
Test Parse -> Dump.
func dump(re *Regexp) string
dump prints a string representation of the regexp showing the structure explicitly.
func dumpRegexp(b *strings.Builder, re *Regexp)
dumpRegexp writes an encoding of the syntax tree for the regexp re to b. It is used during testing to distinguish between parses that might print the same using re's String method.
func mkCharClass(f func(rune) bool) string
func isUpperFold(r rune) bool
func TestFoldConstants(t *testing.T)
func TestAppendRangeCollapse(t *testing.T)
func TestParseInvalidRegexps(t *testing.T)
func TestToStringEquivalentParse(t *testing.T)
func TestCompile(t *testing.T)
func BenchmarkEmptyOpContext(b *testing.B)
func TestSimplify(t *testing.T)
- Variables
- Functions
- func Example()
- func ExampleMatch()
- func ExampleMatchString()
- func ExampleQuoteMeta()
- func ExampleRegexp_Find()
- func ExampleRegexp_FindAll()
- func ExampleRegexp_FindAllSubmatch()
- func ExampleRegexp_FindSubmatch()
- func ExampleRegexp_Match()
- func ExampleRegexp_FindString()
- func ExampleRegexp_FindStringIndex()
- func ExampleRegexp_FindStringSubmatch()
- func ExampleRegexp_FindAllString()
- func ExampleRegexp_FindAllStringSubmatch()
- func ExampleRegexp_FindAllStringSubmatchIndex()
- func ExampleRegexp_FindSubmatchIndex()
- func ExampleRegexp_Longest()
- func ExampleRegexp_MatchString()
- func ExampleRegexp_NumSubexp()
- func ExampleRegexp_ReplaceAll()
- func ExampleRegexp_ReplaceAllLiteralString()
- func ExampleRegexp_ReplaceAllString()
- func ExampleRegexp_ReplaceAllStringFunc()
- func ExampleRegexp_SubexpNames()
- func ExampleRegexp_SubexpIndex()
- func ExampleRegexp_Split()
- func ExampleRegexp_Expand()
- func ExampleRegexp_ExpandString()
- func ExampleRegexp_FindIndex()
- func ExampleRegexp_FindAllSubmatchIndex()
- func ExampleRegexp_FindAllIndex()
var validID = regexp.MustCompile(`^[a-z]+\[[0-9]+\]$`)
Compile the expression once, usually at init time. Use raw strings to avoid having to quote the backslashes.
func Example()
func ExampleMatch()
func ExampleMatchString()
func ExampleQuoteMeta()
func ExampleRegexp_Find()
func ExampleRegexp_FindAll()
func ExampleRegexp_FindAllSubmatch()
func ExampleRegexp_FindSubmatch()
func ExampleRegexp_Match()
func ExampleRegexp_FindString()
func ExampleRegexp_FindStringIndex()
func ExampleRegexp_FindStringSubmatch()
func ExampleRegexp_FindAllString()
func ExampleRegexp_FindAllStringSubmatch()
func ExampleRegexp_FindAllStringSubmatchIndex()
func ExampleRegexp_FindSubmatchIndex()
func ExampleRegexp_Longest()
func ExampleRegexp_MatchString()
func ExampleRegexp_NumSubexp()
func ExampleRegexp_ReplaceAll()
func ExampleRegexp_ReplaceAllLiteralString()
func ExampleRegexp_ReplaceAllString()
func ExampleRegexp_ReplaceAllStringFunc()
func ExampleRegexp_SubexpNames()
func ExampleRegexp_SubexpIndex()
func ExampleRegexp_Split()
func ExampleRegexp_Expand()
func ExampleRegexp_ExpandString()
func ExampleRegexp_FindIndex()
func ExampleRegexp_FindAllSubmatchIndex()
func ExampleRegexp_FindAllIndex()