Skip to content

Instantly share code, notes, and snippets.

@tylerneylon
Last active May 27, 2024 16:21
Show Gist options
  • Save tylerneylon/9773800 to your computer and use it in GitHub Desktop.
Save tylerneylon/9773800 to your computer and use it in GitHub Desktop.
C utf-8 encoder/decoder
// This macro tests if a char is a continuation byte in utf8.
#define IS_CONT(x) (((x) & 0xc0) == 0x80)
// This returns the code point encoded at **s and advances *s to point to the
// next character. Thus it can easily be used in a loop.
int decode_code_point(char **s) {
int k = **s ? __builtin_clz(~(**s << 24)) : 0; // Count # of leading 1 bits.
int mask = (1 << (8 - k)) - 1; // All 1s with k leading 0s.
int value = **s & mask;
// k = 0 for one-byte code points; otherwise, k = #total bytes.
for (++(*s), --k; k > 0 && IS_CONT(**s); --k, ++(*s)) {
value <<= 6;
value += (**s & 0x3F);
}
return value;
}
// This assumes that `code` is <= 0x10FFFF and ensures that nothing will be
// written at or beyond `end`. It advances *s so it's easy to use in a loop.
void encode_code_point(char **s, char *end, int code) {
char val[4];
int lead_byte_max = 0x7F;
int val_index = 0;
while (code > lead_byte_max) {
val[val_index++] = (code & 0x3F) | 0x80;
code >>= 6;
lead_byte_max >>= (val_index == 1 ? 2 : 1);
}
val[val_index++] = (code & lead_byte_max) | (~lead_byte_max << 1);
while (val_index-- && *s < end) {
**s = val[val_index];
(*s)++;
}
}
// This returns 0 if no split was needed.
int split_into_surrogates(int code, int *surr1, int *surr2) {
if (code <= 0xFFFF) return 0;
*surr2 = 0xDC00 | (code & 0x3FF); // Save the low 10 bits.
code >>= 10; // Drop the low 10 bits.
// If `code` now has low bits "uuu uuxx xxxx", then the bits of *surr are
// "1101 10ww wwxx xxxx" where wwww = (uuuuu - 1).
*surr1 = 0xD800 | ((code & 0x7FF) - 0x40);
return 1;
}
// This expects to be used in a loop and see all code points in *code. Start
// *old at 0; this function updates *old for you - don't change it after
// initialization. This returns 0 when *code is the 1st of a surrogate pair;
// otherwise use *code as the final code point.
int join_from_surrogates(int *old, int *code) {
if (*old) *code = (((*old & 0x3FF) + 0x40) << 10) + (*code & 0x3FF);
*old = ((*code & 0xD800) == 0xD800 ? *code : 0);
return !(*old);
}
@Explorer09
Copy link

Also, your split_into_surrogates code is wrong, and I have a corrected one for you:

int split_into_surrogates(int code, int *surr1, int *surr2) {
  if (code <= 0xFFFF) return 0;
  *surr2 = 0xDC00 | (code & 0x3FF);
  code >>= 10;
  *surr1 = 0xD800 | (code - 0x40); // Assuming (code - 0x40 <= 0x3FF)
  return 1;
}

@tylerneylon
Copy link
Author

Hi @Explorer09 , thanks for your helpful suggestions. I appreciate it.

For your comment about decode_code_point, I tried to add a tiny bit of error-checking similar to what you suggest by adding the new IS_CONT() macro. Many people may not consider this code readable, but if you know the bit formats, I think it is readable, and I'm trying to hold on to that readability. Hence the macro.

For your comment about the bug — you're right! It was a silly bug. I made a little unit test on my local machine to avoid it in the future, and I used essentially your correction in the new code.

I've also switched to 4-space indents and tried to clarify some of the comments.

@Explorer09
Copy link

@tylerneylon I wrote my own version of UTF-8 decode function before giving suggestion here.

Your code can be quite compact if the only goal is to decode valid UTF-8 sequences (as there is no error checking at all). My version would perform error checking while also be compact to be included in applications.

I'm not ready to open a Gist page for this, but here is a sneak peek:

/*
Copyright 2024 Kang-Che Sung

Permission is hereby granted, free of charge, to any person obtaining a
copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be included
in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/
/* SPDX-License-Identifier: MIT */
#include <stdint.h>

uint32_t utf8_to_code_point(const uint8_t **sequence) {
    /* We assume ASCII characters appear most frequently. */
    if (**sequence <= 0x7F) {
        return *(*sequence)++;
    }
    /* Multibyte sequence. We assume valid sequences appear more
       frequently than invalid ones. */
    unsigned int max_length;
    uint32_t min_code_point;
    uint32_t code_point;
    if ((**sequence & 0xF0) < 0xE0) {
        /* First byte in the range [0x80, 0xBF] is covered here.
           It wraps around to an invalid code point. */
        max_length = 2;
        min_code_point = 0x80;
        code_point = (uint32_t)**sequence - 0xC0;
    } else if ((**sequence & 0xF0) == 0xE0) {
        max_length = 3;
        min_code_point = 0x0800;
        code_point = (uint32_t)**sequence - 0xE0; 
    } else {
        max_length = 4;
        min_code_point = 0x10000;
        code_point = (uint32_t)**sequence - 0xF0; 
    }
    while (1) {
        (*sequence)++;
        if (--max_length == 0) {
            break;
        }
        unsigned int offset = (unsigned int)**sequence - 0x80;
        if (offset > 0x3F) {
            /* (*sequence) points to the byte where the next sequence
               may begin. The next sequence may be valid or invalid. */
            return (uint32_t)-1;
        }
        code_point = (code_point << 6) + offset;
    }

    if (code_point < min_code_point) {
        /* Overlong sequence */
        return (uint32_t)-1;
    }
    if (code_point >= 0xD800 && code_point <= 0xDFFF) {
        /* UTF-16 surrogates are invalid in UTF-8. They are used in
           CESU-8, which is a different encoding. */
        return (uint32_t)-1;
    }
    /* Any value outside the [0x00, 0x10FFFF] range indicates an
       invalid sequence. The check is left for the caller. */
    return code_point;
}

@Explorer09
Copy link

@tylerneylon There is reason that I won't check the continuation byte like IS_CONT(). Naïve programmers will think of a bitwise AND and compare like what you did, but when you are check the valid bits and extract other bits for value at the same time, a subtraction and compare could yield a smaller assembly code.
It also make things simple.

@tylerneylon
Copy link
Author

@Explorer09 thanks for the snapshot of your code and the notes. Of course many people will appreciate error-checking code! 👍

@Explorer09
Copy link

@tylerneylon
The interesting thing is, neither ChatGPT or Google Gemini I played with would generate code as compact as mine.
My version compiles to 144 bytes with x86-64 GCC.

ChatGPT
https://chatgpt.com/share/0c457058-2e4a-44bb-93cd-cbd06334039a

Google Gemini
https://g.co/gemini/share/d3c4ed05bbcf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment