Skip to content

Instantly share code, notes, and snippets.

@cmstrickland
Last active November 27, 2025 12:12
Show Gist options
  • Select an option

  • Save cmstrickland/d8f834b8b7cc9c38bcbcb349f2353453 to your computer and use it in GitHub Desktop.

Select an option

Save cmstrickland/d8f834b8b7cc9c38bcbcb349f2353453 to your computer and use it in GitHub Desktop.

Accent counting with regular expressions in 7 different scripting languages (number 6 will surprise you)

I was messing around with regular expressions on some default interpreters on my Debian machine, wondering about what the default encoding behaviour for string literals might be. As you do. So, given a string with a "pesky foreign accent" in it, how many characters do various languages think it has?

Parent blog post - unfortunately my self-built blog has too many bugs in it's markdown-parser to present this kind of article natively

Preamble

My computer is in the UK, it's linux, and it uses a UTF-8 encoded unicode locale

cms@ringo:~/tmp$ echo $LANG
en_GB.UTF-8

The examples

1. Perl

cms@ringo:~/tmp$ perl -e 'print "café"=~ /^.{4}$/ ? "four" : "not four" , "\n" '
not four

Let's start out with the OG of regular expression scripting languages. Perl treats things like byte strings unless you explicitly tell it not to in various ways.

2. Python

cms@ringo:~/tmp$ python -c 'import re; print("four" if re.match(r"^.{4}$", "café") else "not four")'
four

Python eventually decided that strings and bytes are different types, you might remember there was a big fight about that that lasted several decades.

3. Ruby

cms@ringo:~/tmp$ ruby -e 'puts "café" =~ /^.{4}$/ ? "four" : "not four"'

four

I thought ruby might do the same thing as perl, but then I remembered, strings in ruby are always encoded, and UTF-8 encoded by default for quite some time.

4. PHP

cms@ringo:~/tmp$ php -r 'echo preg_match("/^.{4}$/", "café") ? "four" : "not four", "\n";'
not four

I am not very confident about PHP (I asked a computer how to do this one, apologies if it's wrong). PHP I think just refuses to treat strings as anything other than byte arrays to this day, and pushes everything into library calls.

update - (Patrick, has helpfully pointed out to me on linkedin that there are also regular expression modifiers to switch into extended character set matching)

5. JavaScript

cms@ringo:~/tmp$ node -e 'console.log(/^.{4}$/.test("café") ? "four" : "not four")'
four

I think JavaScript is UTF-16, it copied Java I expect, so I imagine that it was originally some kind of mutant UCS-2, but nice modern JavaScript is UTF-16, unless somebody out there knows different.

6. Bash

cms@ringo:~/tmp$ [[ "café" =~ ^.{4}$ ]] && echo "four" || echo "not four"
four
cms@ringo:~/tmp$ LANG=C; [[ "café" =~ ^.{4}$ ]] && echo "four" || echo "not four"
not four

Wait, what? Apparently...

  • bash has native regular expressions !
  • they are unicode aware

One last example.

Lets try a truly ancient obsolete language, from way before unicode was invented. So old it has to use an external package to do regular expressions. So weird it's really hard to write one liners in.

cms@ringo:~$ sbcl --noinform --non-interactive --eval '(ql:quickload "cl-ppcre")' --eval '(format t "~a~%" (if (cl-ppcre:scan "^.{4}$" "café") "four" "not four"))' 
To load "cl-ppcre":
  Load 1 ASDF system:
    cl-ppcre
; Loading "cl-ppcre"
..
four

Maybe you find this result surprising? (I don't)

Summing up

There's no wrong or right behaviour here by the way, in case you're confused. Without explicit instructions about what the data literals are, each one falls back to it's implicit, default behaviour. You may be surprised that the default behaviour varies a bit from language to language. You should be surprised that bash can do this at all. I certainly was!

Comments are disabled for this gist.