Skip to content

Instantly share code, notes, and snippets.

@roblogic
Created November 19, 2020 01:49
Show Gist options
  • Star 12 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save roblogic/63f70f13665c689adca099c8d6d73641 to your computer and use it in GitHub Desktop.
Save roblogic/63f70f13665c689adca099c8d6d73641 to your computer and use it in GitHub Desktop.
Regular Expressions in Zsh

The following is taken from a brilliant answer on unix.se. Posting it here for personal reference. The question was:

${var//pattern/replacement} is using zsh wildcard patterns for pattern, the same ones as used for filename generation aka globbing which are a superset of the sh wildcard patterns. The syntax is also affected by the kshglob and extendedglob options. The ${var//pattern/replacement} comes from the Korn shell initially.

I'd recommend enabling extendedglob (set -o extendedglob in your ~/.zshrc) which gives you the most features (more so than standard EREs) at the expense of some backward incompatibility in some corner cases.

You'll find it documented at info zsh 'filename generation'.

A cheat sheet for the mapping between ERE and extended zsh wildcards:

Standard sh ones:

  • . -> ?
  • .* -> *
  • [...] -> [...]

zsh extensions:

  • * -> #
  • + -> ##
  • {x,y} -> (#cx,y)
  • (...|...) -> (...|...)

some extra features not available in standard EREs:

  • ^pattern (negation)
  • x~y (except)
  • <12-234> match decimal number ranges
  • (#i) case insensitive matching
  • (#a2) approximate matching allowing up to 2 errors.
  • many more

Whether wildcard patterns are anchored at start or end of the subject depends on what operator is used.

  • Globs, case patterns, [[ string = pattern ]] and ${var:#pattern} are anchored at both (f*.txt will match on foo.txt, not Xfoo.txtY)
  • ${var#pattern} and ${var##pattern} are anchored at the start
  • ${var%pattern) and ${var%%pattern} are anchored at the end
  • ${var/pattern/repl} and ${var//pattern/repl} are not anchored but can be made so with ${var/#pattern} (start) or ${var/%pattern} (end).

(#s) and (#e) can also be used as the equivalents of ^/$ (ERE) or \A/\z (PCRE).

Whether repeating operators (#, ##, *, (#cx,y), <x-y>) are greedy depends on the operator as well (greedy with ##, %%, //, / not with #, %), that can be changed with the S parameter expansion flag.

So for your examples:

  • regexp-replace nname "[^[:alnum:]]" "_": ${var//[^[:alnum:]]/_}
  • regexp-replace nname "_{2,}" "_": ${var//_(#c2,)/_}
  • regexp-replace nname "_+$" "": ${var%%_#} or ${var/%_#} (here using # for the * equivalent, you can use ## for a + equivalent but that won't make any difference in this case).
  • regexp-replace nname "^_+" "": ${var##_#} or ${var/#_#}

Here, you could combine them with ${${${var//[^[:alnum:]]##/_}#_}%_} (convert sequences of non-alnums to _ and remove an eventual leading or trailing _).

Another approach could be to extract all the sequences of alnums and join them with _, using this hack:

words=()
: ${var//(#m)[[:alnum:]]##/${words[1+$#words]::=$MATCH}}
var=${(j:_:)words}

regexp-replace itself is an autoloadable function that calls [[ $var =~ pattern ]] in a loop. Note that as a result, it doesn't work properly with the ^ anchor or word boundary or look-behind operators (if using the rematchpcre option):

$ a='aaab'; regexp-replace a '^a' x; echo "$a"
xxxb
$ a='abab'; regexp-replace a '\<ab' '<$MATCH>'; echo $a
<ab><ab>

(in the first example, ^a is matched in turn against aaab, aab, ab, b in that loop).

@roblogic
Copy link
Author

Made this gist so that I could come up with a zsh/regex solution to the Spoonerism problem on codegolf.se 🤓

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment