Skip to content

Instantly share code, notes, and snippets.

@tylerneylon
Last active May 3, 2023 12:17
Show Gist options
  • Star 7 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tylerneylon/9773800 to your computer and use it in GitHub Desktop.
Save tylerneylon/9773800 to your computer and use it in GitHub Desktop.
C utf-8 encoder/decoder
// Stops at any null characters.
int decode_code_point(char **s) {
int k = **s ? __builtin_clz(~(**s << 24)) : 0; // Count # of leading 1 bits.
int mask = (1 << (8 - k)) - 1; // All 1's with k leading 0's.
int value = **s & mask;
for (++(*s), --k; k > 0 && **s; --k, ++(*s)) { // Note that k = #total bytes, or 0.
value <<= 6;
value += (**s & 0x3F);
}
return value;
}
// Assumes that code is <= 0x10FFFF.
// Ensures that nothing will be written at or beyond end.
void encode_code_point(char **s, char *end, int code) {
char val[4];
int lead_byte_max = 0x7F;
int val_index = 0;
while (code > lead_byte_max) {
val[val_index++] = (code & 0x3F) | 0x80;
code >>= 6;
lead_byte_max >>= (val_index == 1 ? 2 : 1);
}
val[val_index++] = (code & lead_byte_max) | (~lead_byte_max << 1);
while (val_index-- && *s < end) {
**s = val[val_index];
(*s)++;
}
}
// Returns 0 if no split was needed.
int split_into_surrogates(int code, int *surr1, int *surr2) {
if (code <= 0xFFFF) return 0;
*surr2 = 0xDC00 | (code & 0x3FF); // Save the low 10 bits.
code >>= 10; // Drop the low 10 bits.
*surr1 = 0xD800 | (code & 0x03F); // Save the next 6 bits.
*surr1 |= ((code & 0x7C0) - 0x40) << 6; // Insert the last 5 bits less 1.
return 1;
}
// Expects to be used in a loop and see all code points in *code. Start *old at 0;
// this function updates *old for you - don't change it. Returns 0 when *code is
// the 1st of a surrogate pair; otherwise use *code as the final code point.
int join_from_surrogates(int *old, int *code) {
if (*old) *code = (((*old & 0x3FF) + 0x40) << 10) + (*code & 0x3FF);
*old = ((*code & 0xD800) == 0xD800 ? *code : 0);
return !(*old);
}
@tylerneylon
Copy link
Author

This small, mostly-portably library encodes and decodes utf-8 byte streams.
The only not-super-portable part is the __builtin_clz function, which can be
replaced on non-gcc-like compilers (more notes on that below).

The surrogate pair functions are useful for working with systems that stick to
16 bits when representing unicode characters. This happens more than you'd think;
json strings can only encode 16-bit code points, and node.js was built on top of
UCS-2 internal storage, to give two examples.

The function __builtin_clz gives a count of leading zeroes; docs are here.
Similar functions for other compilers might be found from this page.

The correspondence between surrogates pairs and the code points
they represent is described on the 40th page of this pdf.

@tylerneylon
Copy link
Author

I have not yet tested with the test file linked below, but I'm leaving it here as a note for future-Tyler to possibly check out. My intention is to keep the decoder lenient, so if the only "errors" I can find in the decoder have to do with accepting ill-formed utf-8, then that is just matching the design principle of this decoder. (But I'd like it to strictly provide correct output on all correctly-formed input.)

Markus Kuhn's utf-8 stress test file.

@cxw42
Copy link

cxw42 commented Nov 19, 2018

@tylerneylon Thank you! What is the license for this code?

@inaciose
Copy link

there is a sample that show how to use these functions?

@tylerneylon
Copy link
Author

@cxw42, I'm a bit late in replying but this is public domain.

@tylerneylon
Copy link
Author

tylerneylon commented Nov 13, 2019

@ianciose, I'm not testing the code below (so 🤞 ) but this is the idea of usage. Note that different code conventions may denote the end of the string in different ways. In the code below, I'll end each buffer with a 0 value.

// Convert a utf-8 byte array into unicode code points:

char *utf8_bytes = get_utf8_bytes();  // The end is marked by a zero byte here.
int *code_points = get_unicode_buffer();

while (( *code_points++ = decode_code_point(&utf8_bytes) ));

or

// Convert unicode code points into a utf8 byte array.

int *code_points = get_unicode_code_points();  // The end is marked by a zero value.
int *code_point_cursor = code_points;  // Copy of the base pointer so we can move this one around.
char *utf8_bytes = get_utf8_buffer(BUFFER_LEN);
char *end_byte = utf8_bytes + BUFFER_LEN;

do {
    encode_code_point(&utf_bytes, end_byte, *code_point_cursor++);
} while (*(code_point_cursor - 1));

@inaciose
Copy link

@tylerneylon tks for reply!

@limpkin
Copy link

limpkin commented Jan 20, 2020

it is worth noting that decode_code_point will read out of bounds in case the bytestream actually isn't utf8 encoded.
for example, if the first byte is 0xFF it'll read 4 bytes over

@tylerneylon
Copy link
Author

Right, @limpkin, decode_code_point() assumes the input is valid utf-8, and does no error checking. If you want to read potentially-invalid utf-8, you can pad the end of the string with 8 zero bytes, and it will guarantee to not overrun, and to end with a zero return value.

@ericomeehan
Copy link

I have been looking for something like this, but I would like to understand what is happening here a bit more. Would you be willing to step through the algorithm?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment