-
-
Save tylerneylon/9773800 to your computer and use it in GitHub Desktop.
// This macro tests if a char is a continuation byte in utf8. | |
#define IS_CONT(x) (((x) & 0xc0) == 0x80) | |
// This returns the code point encoded at **s and advances *s to point to the | |
// next character. Thus it can easily be used in a loop. | |
int decode_code_point(char **s) { | |
int k = **s ? __builtin_clz(~(**s << 24)) : 0; // Count # of leading 1 bits. | |
int mask = (1 << (8 - k)) - 1; // All 1s with k leading 0s. | |
int value = **s & mask; | |
// k = 0 for one-byte code points; otherwise, k = #total bytes. | |
for (++(*s), --k; k > 0 && IS_CONT(**s); --k, ++(*s)) { | |
value <<= 6; | |
value += (**s & 0x3F); | |
} | |
return value; | |
} | |
// This assumes that `code` is <= 0x10FFFF and ensures that nothing will be | |
// written at or beyond `end`. It advances *s so it's easy to use in a loop. | |
void encode_code_point(char **s, char *end, int code) { | |
char val[4]; | |
int lead_byte_max = 0x7F; | |
int val_index = 0; | |
while (code > lead_byte_max) { | |
val[val_index++] = (code & 0x3F) | 0x80; | |
code >>= 6; | |
lead_byte_max >>= (val_index == 1 ? 2 : 1); | |
} | |
val[val_index++] = (code & lead_byte_max) | (~lead_byte_max << 1); | |
while (val_index-- && *s < end) { | |
**s = val[val_index]; | |
(*s)++; | |
} | |
} | |
// This returns 0 if no split was needed. | |
int split_into_surrogates(int code, int *surr1, int *surr2) { | |
if (code <= 0xFFFF) return 0; | |
*surr2 = 0xDC00 | (code & 0x3FF); // Save the low 10 bits. | |
code >>= 10; // Drop the low 10 bits. | |
// If `code` now has low bits "uuu uuxx xxxx", then the bits of *surr are | |
// "1101 10ww wwxx xxxx" where wwww = (uuuuu - 1). | |
*surr1 = 0xD800 | ((code & 0x7FF) - 0x40); | |
return 1; | |
} | |
// This expects to be used in a loop and see all code points in *code. Start | |
// *old at 0; this function updates *old for you - don't change it after | |
// initialization. This returns 0 when *code is the 1st of a surrogate pair; | |
// otherwise use *code as the final code point. | |
int join_from_surrogates(int *old, int *code) { | |
if (*old) *code = (((*old & 0x3FF) + 0x40) << 10) + (*code & 0x3FF); | |
*old = ((*code & 0xD800) == 0xD800 ? *code : 0); | |
return !(*old); | |
} |
I have not yet tested with the test file linked below, but I'm leaving it here as a note for future-Tyler to possibly check out. My intention is to keep the decoder lenient, so if the only "errors" I can find in the decoder have to do with accepting ill-formed utf-8, then that is just matching the design principle of this decoder. (But I'd like it to strictly provide correct output on all correctly-formed input.)
@tylerneylon Thank you! What is the license for this code?
there is a sample that show how to use these functions?
@cxw42, I'm a bit late in replying but this is public domain.
@ianciose, I'm not testing the code below (so 🤞 ) but this is the idea of usage. Note that different code conventions may denote the end of the string in different ways. In the code below, I'll end each buffer with a 0 value.
// Convert a utf-8 byte array into unicode code points:
char *utf8_bytes = get_utf8_bytes(); // The end is marked by a zero byte here.
int *code_points = get_unicode_buffer();
while (( *code_points++ = decode_code_point(&utf8_bytes) ));
or
// Convert unicode code points into a utf8 byte array.
int *code_points = get_unicode_code_points(); // The end is marked by a zero value.
int *code_point_cursor = code_points; // Copy of the base pointer so we can move this one around.
char *utf8_bytes = get_utf8_buffer(BUFFER_LEN);
char *end_byte = utf8_bytes + BUFFER_LEN;
do {
encode_code_point(&utf_bytes, end_byte, *code_point_cursor++);
} while (*(code_point_cursor - 1));
@tylerneylon tks for reply!
it is worth noting that decode_code_point will read out of bounds in case the bytestream actually isn't utf8 encoded.
for example, if the first byte is 0xFF it'll read 4 bytes over
Right, @limpkin, decode_code_point()
assumes the input is valid utf-8, and does no error checking. If you want to read potentially-invalid utf-8, you can pad the end of the string with 8 zero bytes, and it will guarantee to not overrun, and to end with a zero return value.
I have been looking for something like this, but I would like to understand what is happening here a bit more. Would you be willing to step through the algorithm?
A small suggestion for your decode_code_point
function:
You can change the for-loop to be like this. With the **s
conditional becoming (unsigned)**s - 0x80 <= 0x3F
, you can make the loop stop when the continuation byte is invalid, reducing the chance of overreading the *s
buffer.
And use **s - 0x80
rather than **s & 0x3F
when adding the code point value.
for (++(*s), --k; k > 0 && (unsigned)**s - 0x80 <= 0x3F; --k, ++(*s)) {
value <<= 6;
value += ((unsigned)**s - 0x80);
}
This isn't a full check of invalid sequences of decode_code_point
, but it would make the function a little saner.
Also, your split_into_surrogates
code is wrong, and I have a corrected one for you:
int split_into_surrogates(int code, int *surr1, int *surr2) {
if (code <= 0xFFFF) return 0;
*surr2 = 0xDC00 | (code & 0x3FF);
code >>= 10;
*surr1 = 0xD800 | (code - 0x40); // Assuming (code - 0x40 <= 0x3FF)
return 1;
}
Hi @Explorer09 , thanks for your helpful suggestions. I appreciate it.
For your comment about decode_code_point
, I tried to add a tiny bit of error-checking similar to what you suggest by adding the new IS_CONT()
macro. Many people may not consider this code readable, but if you know the bit formats, I think it is readable, and I'm trying to hold on to that readability. Hence the macro.
For your comment about the bug — you're right! It was a silly bug. I made a little unit test on my local machine to avoid it in the future, and I used essentially your correction in the new code.
I've also switched to 4-space indents and tried to clarify some of the comments.
@tylerneylon I wrote my own version of UTF-8 decode function before giving suggestion here.
Your code can be quite compact if the only goal is to decode valid UTF-8 sequences (as there is no error checking at all). My version would perform error checking while also be compact to be included in applications.
I'm not ready to open a Gist page for this, but here is a sneak peek:
/*
Copyright 2024 Kang-Che Sung
Permission is hereby granted, free of charge, to any person obtaining a
copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:
The above copyright notice and this permission notice shall be included
in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/
/* SPDX-License-Identifier: MIT */
#include <stdint.h>
uint32_t utf8_to_code_point(const uint8_t **sequence) {
/* We assume ASCII characters appear most frequently. */
if (**sequence <= 0x7F) {
return *(*sequence)++;
}
/* Multibyte sequence. We assume valid sequences appear more
frequently than invalid ones. */
unsigned int max_length;
uint32_t min_code_point;
uint32_t code_point;
if ((**sequence & 0xF0) < 0xE0) {
/* First byte in the range [0x80, 0xBF] is covered here.
It wraps around to an invalid code point. */
max_length = 2;
min_code_point = 0x80;
code_point = (uint32_t)**sequence - 0xC0;
} else if ((**sequence & 0xF0) == 0xE0) {
max_length = 3;
min_code_point = 0x0800;
code_point = (uint32_t)**sequence - 0xE0;
} else {
max_length = 4;
min_code_point = 0x10000;
code_point = (uint32_t)**sequence - 0xF0;
}
while (1) {
(*sequence)++;
if (--max_length == 0) {
break;
}
unsigned int offset = (unsigned int)**sequence - 0x80;
if (offset > 0x3F) {
/* (*sequence) points to the byte where the next sequence
may begin. The next sequence may be valid or invalid. */
return (uint32_t)-1;
}
code_point = (code_point << 6) + offset;
}
if (code_point < min_code_point) {
/* Overlong sequence */
return (uint32_t)-1;
}
if (code_point >= 0xD800 && code_point <= 0xDFFF) {
/* UTF-16 surrogates are invalid in UTF-8. They are used in
CESU-8, which is a different encoding. */
return (uint32_t)-1;
}
/* Any value outside the [0x00, 0x10FFFF] range indicates an
invalid sequence. The check is left for the caller. */
return code_point;
}
@tylerneylon There is reason that I won't check the continuation byte like IS_CONT()
. Naïve programmers will think of a bitwise AND and compare like what you did, but when you are check the valid bits and extract other bits for value at the same time, a subtraction and compare could yield a smaller assembly code.
It also make things simple.
@Explorer09 thanks for the snapshot of your code and the notes. Of course many people will appreciate error-checking code! 👍
@tylerneylon
The interesting thing is, neither ChatGPT or Google Gemini I played with would generate code as compact as mine.
My version compiles to 144 bytes with x86-64 GCC.
ChatGPT
https://chatgpt.com/share/0c457058-2e4a-44bb-93cd-cbd06334039a
Google Gemini
https://g.co/gemini/share/d3c4ed05bbcf
This small, mostly-portably library encodes and decodes utf-8 byte streams.
The only not-super-portable part is the
__builtin_clz
function, which can bereplaced on non-gcc-like compilers (more notes on that below).
The surrogate pair functions are useful for working with systems that stick to
16 bits when representing unicode characters. This happens more than you'd think;
json strings can only encode 16-bit code points, and node.js was built on top of
UCS-2 internal storage, to give two examples.
The function
__builtin_clz
gives a count of leading zeroes; docs are here.Similar functions for other compilers might be found from this page.
The correspondence between surrogates pairs and the code points
they represent is described on the 40th page of this pdf.