-
-
Save tylerneylon/9773800 to your computer and use it in GitHub Desktop.
// Stops at any null characters. | |
int decode_code_point(char **s) { | |
int k = **s ? __builtin_clz(~(**s << 24)) : 0; // Count # of leading 1 bits. | |
int mask = (1 << (8 - k)) - 1; // All 1's with k leading 0's. | |
int value = **s & mask; | |
for (++(*s), --k; k > 0 && **s; --k, ++(*s)) { // Note that k = #total bytes, or 0. | |
value <<= 6; | |
value += (**s & 0x3F); | |
} | |
return value; | |
} | |
// Assumes that code is <= 0x10FFFF. | |
// Ensures that nothing will be written at or beyond end. | |
void encode_code_point(char **s, char *end, int code) { | |
char val[4]; | |
int lead_byte_max = 0x7F; | |
int val_index = 0; | |
while (code > lead_byte_max) { | |
val[val_index++] = (code & 0x3F) | 0x80; | |
code >>= 6; | |
lead_byte_max >>= (val_index == 1 ? 2 : 1); | |
} | |
val[val_index++] = (code & lead_byte_max) | (~lead_byte_max << 1); | |
while (val_index-- && *s < end) { | |
**s = val[val_index]; | |
(*s)++; | |
} | |
} | |
// Returns 0 if no split was needed. | |
int split_into_surrogates(int code, int *surr1, int *surr2) { | |
if (code <= 0xFFFF) return 0; | |
*surr2 = 0xDC00 | (code & 0x3FF); // Save the low 10 bits. | |
code >>= 10; // Drop the low 10 bits. | |
*surr1 = 0xD800 | (code & 0x03F); // Save the next 6 bits. | |
*surr1 |= ((code & 0x7C0) - 0x40) << 6; // Insert the last 5 bits less 1. | |
return 1; | |
} | |
// Expects to be used in a loop and see all code points in *code. Start *old at 0; | |
// this function updates *old for you - don't change it. Returns 0 when *code is | |
// the 1st of a surrogate pair; otherwise use *code as the final code point. | |
int join_from_surrogates(int *old, int *code) { | |
if (*old) *code = (((*old & 0x3FF) + 0x40) << 10) + (*code & 0x3FF); | |
*old = ((*code & 0xD800) == 0xD800 ? *code : 0); | |
return !(*old); | |
} |
I have not yet tested with the test file linked below, but I'm leaving it here as a note for future-Tyler to possibly check out. My intention is to keep the decoder lenient, so if the only "errors" I can find in the decoder have to do with accepting ill-formed utf-8, then that is just matching the design principle of this decoder. (But I'd like it to strictly provide correct output on all correctly-formed input.)
@tylerneylon Thank you! What is the license for this code?
there is a sample that show how to use these functions?
@cxw42, I'm a bit late in replying but this is public domain.
@ianciose, I'm not testing the code below (so 🤞 ) but this is the idea of usage. Note that different code conventions may denote the end of the string in different ways. In the code below, I'll end each buffer with a 0 value.
// Convert a utf-8 byte array into unicode code points:
char *utf8_bytes = get_utf8_bytes(); // The end is marked by a zero byte here.
int *code_points = get_unicode_buffer();
while (( *code_points++ = decode_code_point(&utf8_bytes) ));
or
// Convert unicode code points into a utf8 byte array.
int *code_points = get_unicode_code_points(); // The end is marked by a zero value.
int *code_point_cursor = code_points; // Copy of the base pointer so we can move this one around.
char *utf8_bytes = get_utf8_buffer(BUFFER_LEN);
char *end_byte = utf8_bytes + BUFFER_LEN;
do {
encode_code_point(&utf_bytes, end_byte, *code_point_cursor++);
} while (*(code_point_cursor - 1));
@tylerneylon tks for reply!
it is worth noting that decode_code_point will read out of bounds in case the bytestream actually isn't utf8 encoded.
for example, if the first byte is 0xFF it'll read 4 bytes over
Right, @limpkin, decode_code_point()
assumes the input is valid utf-8, and does no error checking. If you want to read potentially-invalid utf-8, you can pad the end of the string with 8 zero bytes, and it will guarantee to not overrun, and to end with a zero return value.
I have been looking for something like this, but I would like to understand what is happening here a bit more. Would you be willing to step through the algorithm?
This small, mostly-portably library encodes and decodes utf-8 byte streams.
The only not-super-portable part is the
__builtin_clz
function, which can bereplaced on non-gcc-like compilers (more notes on that below).
The surrogate pair functions are useful for working with systems that stick to
16 bits when representing unicode characters. This happens more than you'd think;
json strings can only encode 16-bit code points, and node.js was built on top of
UCS-2 internal storage, to give two examples.
The function
__builtin_clz
gives a count of leading zeroes; docs are here.Similar functions for other compilers might be found from this page.
The correspondence between surrogates pairs and the code points
they represent is described on the 40th page of this pdf.