Accent counting with regular expressions in 7 different scripting languages (number 6 will surprise you)
I was messing around with regular expressions on some default interpreters on my Debian machine, wondering about what the default encoding behaviour for string literals might be. As you do. So, given a string with a "pesky foreign accent" in it, how many characters do various languages think it has?
Parent blog post - unfortunately my self-built blog has too many bugs in it's markdown-parser to present this kind of article natively
My computer is in the UK, it's linux, and it uses a UTF-8 encoded unicode locale
cms@ringo:~/tmp$ echo $LANG
en_GB.UTF-8
cms@ringo:~/tmp$ perl -e 'print "café"=~ /^.{4}$/ ? "four" : "not four" , "\n" '
not four
Let's start out with the OG of regular expression scripting languages. Perl treats things like byte strings unless you explicitly tell it not to in various ways.
cms@ringo:~/tmp$ python -c 'import re; print("four" if re.match(r"^.{4}$", "café") else "not four")'
four
Python eventually decided that strings and bytes are different types, you might remember there was a big fight about that that lasted several decades.
cms@ringo:~/tmp$ ruby -e 'puts "café" =~ /^.{4}$/ ? "four" : "not four"'
four
I thought ruby might do the same thing as perl, but then I remembered, strings in ruby are always encoded, and UTF-8 encoded by default for quite some time.
cms@ringo:~/tmp$ php -r 'echo preg_match("/^.{4}$/", "café") ? "four" : "not four", "\n";'
not four
I am not very confident about PHP (I asked a computer how to do this one, apologies if it's wrong). PHP I think just refuses to treat strings as anything other than byte arrays to this day, and pushes everything into library calls.
update - (Patrick, has helpfully pointed out to me on linkedin that there are also regular expression modifiers to switch into extended character set matching)
cms@ringo:~/tmp$ node -e 'console.log(/^.{4}$/.test("café") ? "four" : "not four")'
four
I think JavaScript is UTF-16, it copied Java I expect, so I imagine that it was originally some kind of mutant UCS-2, but nice modern JavaScript is UTF-16, unless somebody out there knows different.
cms@ringo:~/tmp$ [[ "café" =~ ^.{4}$ ]] && echo "four" || echo "not four"
four
cms@ringo:~/tmp$ LANG=C; [[ "café" =~ ^.{4}$ ]] && echo "four" || echo "not four"
not four
Wait, what? Apparently...
- bash has native regular expressions !
- they are unicode aware
Lets try a truly ancient obsolete language, from way before unicode was invented. So old it has to use an external package to do regular expressions. So weird it's really hard to write one liners in.
cms@ringo:~$ sbcl --noinform --non-interactive --eval '(ql:quickload "cl-ppcre")' --eval '(format t "~a~%" (if (cl-ppcre:scan "^.{4}$" "café") "four" "not four"))'
To load "cl-ppcre":
Load 1 ASDF system:
cl-ppcre
; Loading "cl-ppcre"
..
four
Maybe you find this result surprising? (I don't)
There's no wrong or right behaviour here by the way, in case you're confused. Without explicit instructions about what the data literals are, each one falls back to it's implicit, default behaviour. You may be surprised that the default behaviour varies a bit from language to language. You should be surprised that bash can do this at all. I certainly was!