Skip to content

Instantly share code, notes, and snippets.

@varenc
Forked from roblogic/zshexpn-explained.md
Created November 26, 2022 02:01
Show Gist options
  • Save varenc/1c2304da51c7b524a0c953efb5ae9779 to your computer and use it in GitHub Desktop.
Save varenc/1c2304da51c7b524a0c953efb5ae9779 to your computer and use it in GitHub Desktop.
Regular Expressions in Zsh

The following is taken from a brilliant answer on unix.se. Posting it here for personal reference. The question was:

${var//pattern/replacement} is using zsh wildcard patterns for pattern, the same ones as used for filename generation aka globbing which are a superset of the sh wildcard patterns. The syntax is also affected by the kshglob and extendedglob options. The ${var//pattern/replacement} comes from the Korn shell initially.

I'd recommend enabling extendedglob (set -o extendedglob in your ~/.zshrc) which gives you the most features (more so than standard EREs) at the expense of some backward incompatibility in some corner cases.

You'll find it documented at info zsh 'filename generation'.

A cheat sheet for the mapping between ERE and extended zsh wildcards:

Standard sh ones:

  • . -> ?
  • .* -> *
  • [...] -> [...]

zsh extensions:

  • * -> #
  • + -> ##
  • {x,y} -> (#cx,y)
  • (...|...) -> (...|...)

some extra features not available in standard EREs:

  • ^pattern (negation)
  • x~y (except)
  • <12-234> match decimal number ranges
  • (#i) case insensitive matching
  • (#a2) approximate matching allowing up to 2 errors.
  • many more

Whether wildcard patterns are anchored at start or end of the subject depends on what operator is used.

  • Globs, case patterns, [[ string = pattern ]] and ${var:#pattern} are anchored at both (f*.txt will match on foo.txt, not Xfoo.txtY)
  • ${var#pattern} and ${var##pattern} are anchored at the start
  • ${var%pattern) and ${var%%pattern} are anchored at the end
  • ${var/pattern/repl} and ${var//pattern/repl} are not anchored but can be made so with ${var/#pattern} (start) or ${var/%pattern} (end).

(#s) and (#e) can also be used as the equivalents of ^/$ (ERE) or \A/\z (PCRE).

Whether repeating operators (#, ##, *, (#cx,y), <x-y>) are greedy depends on the operator as well (greedy with ##, %%, //, / not with #, %), that can be changed with the S parameter expansion flag.

So for your examples:

  • regexp-replace nname "[^[:alnum:]]" "_": ${var//[^[:alnum:]]/_}
  • regexp-replace nname "_{2,}" "_": ${var//_(#c2,)/_}
  • regexp-replace nname "_+$" "": ${var%%_#} or ${var/%_#} (here using # for the * equivalent, you can use ## for a + equivalent but that won't make any difference in this case).
  • regexp-replace nname "^_+" "": ${var##_#} or ${var/#_#}

Here, you could combine them with ${${${var//[^[:alnum:]]##/_}#_}%_} (convert sequences of non-alnums to _ and remove an eventual leading or trailing _).

Another approach could be to extract all the sequences of alnums and join them with _, using this hack:

words=()
: ${var//(#m)[[:alnum:]]##/${words[1+$#words]::=$MATCH}}
var=${(j:_:)words}

regexp-replace itself is an autoloadable function that calls [[ $var =~ pattern ]] in a loop. Note that as a result, it doesn't work properly with the ^ anchor or word boundary or look-behind operators (if using the rematchpcre option):

$ a='aaab'; regexp-replace a '^a' x; echo "$a"
xxxb
$ a='abab'; regexp-replace a '\<ab' '<$MATCH>'; echo $a
<ab><ab>

(in the first example, ^a is matched in turn against aaab, aab, ab, b in that loop).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment