In writing a string manipulation library I ran into what I think is a bug in how substr
handles malformed UTF8 strings. I was testing corner cases, in particular the following case of a string that ends with the first byte of a UTF8 sequence:
string <- "abc\xEE" # \xEE indicates the start of a 3 byte UTF-8 sequence
Encoding(string) <- "UTF-8"
substr(string, 1, 10)
When run under valgrind with level 2 instrumentation, we get:
> string <- "abc\xEE" # \xEE indicates the start of a 3 byte UTF-8 sequence