Skip to content

Instantly share code, notes, and snippets.

@tylerneylon
Last active May 27, 2024 16:21
Show Gist options
  • Save tylerneylon/9773800 to your computer and use it in GitHub Desktop.
Save tylerneylon/9773800 to your computer and use it in GitHub Desktop.
C utf-8 encoder/decoder
// This macro tests if a char is a continuation byte in utf8.
#define IS_CONT(x) (((x) & 0xc0) == 0x80)
// This returns the code point encoded at **s and advances *s to point to the
// next character. Thus it can easily be used in a loop.
int decode_code_point(char **s) {
int k = **s ? __builtin_clz(~(**s << 24)) : 0; // Count # of leading 1 bits.
int mask = (1 << (8 - k)) - 1; // All 1s with k leading 0s.
int value = **s & mask;
// k = 0 for one-byte code points; otherwise, k = #total bytes.
for (++(*s), --k; k > 0 && IS_CONT(**s); --k, ++(*s)) {
value <<= 6;
value += (**s & 0x3F);
}
return value;
}
// This assumes that `code` is <= 0x10FFFF and ensures that nothing will be
// written at or beyond `end`. It advances *s so it's easy to use in a loop.
void encode_code_point(char **s, char *end, int code) {
char val[4];
int lead_byte_max = 0x7F;
int val_index = 0;
while (code > lead_byte_max) {
val[val_index++] = (code & 0x3F) | 0x80;
code >>= 6;
lead_byte_max >>= (val_index == 1 ? 2 : 1);
}
val[val_index++] = (code & lead_byte_max) | (~lead_byte_max << 1);
while (val_index-- && *s < end) {
**s = val[val_index];
(*s)++;
}
}
// This returns 0 if no split was needed.
int split_into_surrogates(int code, int *surr1, int *surr2) {
if (code <= 0xFFFF) return 0;
*surr2 = 0xDC00 | (code & 0x3FF); // Save the low 10 bits.
code >>= 10; // Drop the low 10 bits.
// If `code` now has low bits "uuu uuxx xxxx", then the bits of *surr are
// "1101 10ww wwxx xxxx" where wwww = (uuuuu - 1).
*surr1 = 0xD800 | ((code & 0x7FF) - 0x40);
return 1;
}
// This expects to be used in a loop and see all code points in *code. Start
// *old at 0; this function updates *old for you - don't change it after
// initialization. This returns 0 when *code is the 1st of a surrogate pair;
// otherwise use *code as the final code point.
int join_from_surrogates(int *old, int *code) {
if (*old) *code = (((*old & 0x3FF) + 0x40) << 10) + (*code & 0x3FF);
*old = ((*code & 0xD800) == 0xD800 ? *code : 0);
return !(*old);
}
@tylerneylon
Copy link
Author

This small, mostly-portably library encodes and decodes utf-8 byte streams.
The only not-super-portable part is the __builtin_clz function, which can be
replaced on non-gcc-like compilers (more notes on that below).

The surrogate pair functions are useful for working with systems that stick to
16 bits when representing unicode characters. This happens more than you'd think;
json strings can only encode 16-bit code points, and node.js was built on top of
UCS-2 internal storage, to give two examples.

The function __builtin_clz gives a count of leading zeroes; docs are here.
Similar functions for other compilers might be found from this page.

The correspondence between surrogates pairs and the code points
they represent is described on the 40th page of this pdf.

@tylerneylon
Copy link
Author

I have not yet tested with the test file linked below, but I'm leaving it here as a note for future-Tyler to possibly check out. My intention is to keep the decoder lenient, so if the only "errors" I can find in the decoder have to do with accepting ill-formed utf-8, then that is just matching the design principle of this decoder. (But I'd like it to strictly provide correct output on all correctly-formed input.)

Markus Kuhn's utf-8 stress test file.

@cxw42
Copy link

cxw42 commented Nov 19, 2018

@tylerneylon Thank you! What is the license for this code?

@inaciose
Copy link

there is a sample that show how to use these functions?

@tylerneylon
Copy link
Author

@cxw42, I'm a bit late in replying but this is public domain.

@tylerneylon
Copy link
Author

tylerneylon commented Nov 13, 2019

@ianciose, I'm not testing the code below (so 🤞 ) but this is the idea of usage. Note that different code conventions may denote the end of the string in different ways. In the code below, I'll end each buffer with a 0 value.

// Convert a utf-8 byte array into unicode code points:

char *utf8_bytes = get_utf8_bytes();  // The end is marked by a zero byte here.
int *code_points = get_unicode_buffer();

while (( *code_points++ = decode_code_point(&utf8_bytes) ));

or

// Convert unicode code points into a utf8 byte array.

int *code_points = get_unicode_code_points();  // The end is marked by a zero value.
int *code_point_cursor = code_points;  // Copy of the base pointer so we can move this one around.
char *utf8_bytes = get_utf8_buffer(BUFFER_LEN);
char *end_byte = utf8_bytes + BUFFER_LEN;

do {
    encode_code_point(&utf_bytes, end_byte, *code_point_cursor++);
} while (*(code_point_cursor - 1));

@inaciose
Copy link

@tylerneylon tks for reply!

@limpkin
Copy link

limpkin commented Jan 20, 2020

it is worth noting that decode_code_point will read out of bounds in case the bytestream actually isn't utf8 encoded.
for example, if the first byte is 0xFF it'll read 4 bytes over

@tylerneylon
Copy link
Author

Right, @limpkin, decode_code_point() assumes the input is valid utf-8, and does no error checking. If you want to read potentially-invalid utf-8, you can pad the end of the string with 8 zero bytes, and it will guarantee to not overrun, and to end with a zero return value.

@ericomeehan
Copy link

I have been looking for something like this, but I would like to understand what is happening here a bit more. Would you be willing to step through the algorithm?

@Explorer09
Copy link

A small suggestion for your decode_code_point function:

You can change the for-loop to be like this. With the **s conditional becoming (unsigned)**s - 0x80 <= 0x3F, you can make the loop stop when the continuation byte is invalid, reducing the chance of overreading the *s buffer.
And use **s - 0x80 rather than **s & 0x3F when adding the code point value.

for (++(*s), --k; k > 0 && (unsigned)**s - 0x80 <= 0x3F; --k, ++(*s)) {
  value <<= 6;
  value += ((unsigned)**s - 0x80);
}

This isn't a full check of invalid sequences of decode_code_point, but it would make the function a little saner.

@Explorer09
Copy link

Also, your split_into_surrogates code is wrong, and I have a corrected one for you:

int split_into_surrogates(int code, int *surr1, int *surr2) {
  if (code <= 0xFFFF) return 0;
  *surr2 = 0xDC00 | (code & 0x3FF);
  code >>= 10;
  *surr1 = 0xD800 | (code - 0x40); // Assuming (code - 0x40 <= 0x3FF)
  return 1;
}

@tylerneylon
Copy link
Author

Hi @Explorer09 , thanks for your helpful suggestions. I appreciate it.

For your comment about decode_code_point, I tried to add a tiny bit of error-checking similar to what you suggest by adding the new IS_CONT() macro. Many people may not consider this code readable, but if you know the bit formats, I think it is readable, and I'm trying to hold on to that readability. Hence the macro.

For your comment about the bug — you're right! It was a silly bug. I made a little unit test on my local machine to avoid it in the future, and I used essentially your correction in the new code.

I've also switched to 4-space indents and tried to clarify some of the comments.

@Explorer09
Copy link

@tylerneylon I wrote my own version of UTF-8 decode function before giving suggestion here.

Your code can be quite compact if the only goal is to decode valid UTF-8 sequences (as there is no error checking at all). My version would perform error checking while also be compact to be included in applications.

I'm not ready to open a Gist page for this, but here is a sneak peek:

/*
Copyright 2024 Kang-Che Sung

Permission is hereby granted, free of charge, to any person obtaining a
copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be included
in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/
/* SPDX-License-Identifier: MIT */
#include <stdint.h>

uint32_t utf8_to_code_point(const uint8_t **sequence) {
    /* We assume ASCII characters appear most frequently. */
    if (**sequence <= 0x7F) {
        return *(*sequence)++;
    }
    /* Multibyte sequence. We assume valid sequences appear more
       frequently than invalid ones. */
    unsigned int max_length;
    uint32_t min_code_point;
    uint32_t code_point;
    if ((**sequence & 0xF0) < 0xE0) {
        /* First byte in the range [0x80, 0xBF] is covered here.
           It wraps around to an invalid code point. */
        max_length = 2;
        min_code_point = 0x80;
        code_point = (uint32_t)**sequence - 0xC0;
    } else if ((**sequence & 0xF0) == 0xE0) {
        max_length = 3;
        min_code_point = 0x0800;
        code_point = (uint32_t)**sequence - 0xE0; 
    } else {
        max_length = 4;
        min_code_point = 0x10000;
        code_point = (uint32_t)**sequence - 0xF0; 
    }
    while (1) {
        (*sequence)++;
        if (--max_length == 0) {
            break;
        }
        unsigned int offset = (unsigned int)**sequence - 0x80;
        if (offset > 0x3F) {
            /* (*sequence) points to the byte where the next sequence
               may begin. The next sequence may be valid or invalid. */
            return (uint32_t)-1;
        }
        code_point = (code_point << 6) + offset;
    }

    if (code_point < min_code_point) {
        /* Overlong sequence */
        return (uint32_t)-1;
    }
    if (code_point >= 0xD800 && code_point <= 0xDFFF) {
        /* UTF-16 surrogates are invalid in UTF-8. They are used in
           CESU-8, which is a different encoding. */
        return (uint32_t)-1;
    }
    /* Any value outside the [0x00, 0x10FFFF] range indicates an
       invalid sequence. The check is left for the caller. */
    return code_point;
}

@Explorer09
Copy link

@tylerneylon There is reason that I won't check the continuation byte like IS_CONT(). Naïve programmers will think of a bitwise AND and compare like what you did, but when you are check the valid bits and extract other bits for value at the same time, a subtraction and compare could yield a smaller assembly code.
It also make things simple.

@tylerneylon
Copy link
Author

@Explorer09 thanks for the snapshot of your code and the notes. Of course many people will appreciate error-checking code! 👍

@Explorer09
Copy link

@tylerneylon
The interesting thing is, neither ChatGPT or Google Gemini I played with would generate code as compact as mine.
My version compiles to 144 bytes with x86-64 GCC.

ChatGPT
https://chatgpt.com/share/0c457058-2e4a-44bb-93cd-cbd06334039a

Google Gemini
https://g.co/gemini/share/d3c4ed05bbcf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment