Skip to content

Instantly share code, notes, and snippets.

@mreiferson
Last active August 29, 2015 14:15
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mreiferson/f370d576ef6b4b8ae8dc to your computer and use it in GitHub Desktop.
Save mreiferson/f370d576ef6b4b8ae8dc to your computer and use it in GitHub Desktop.
C Coding Test

UTF-8 validation test

UTF-8 is used for encoding text, which means, mapping the bytes in a file to meaningful characters. ASCII is the classic, once ubiquitous encoding in the western world. It uses a single byte per character, and uses only the lower 7 bits of that byte. It can only represent the characters on a US keyboard (as well as tab and carriage return etc). Other encodings exist which also use a single byte per character, but use all 8 bits, like ISO-8859-1, or use multible bytes per character, like UCS-32 which always uses 4 bytes per character. These encodings can represent many more characters, such as accented ones.

UTF-8 uses a variable number of bytes per character - one, or two, or even more. Even better, all the single-byte characters are ASCII, so all ASCII encoded text is also valid UTF-8 encoded text.

In this test, you'll write a function which takes a UTF-8 encoded string, and decides whether it's valid UTF-8. This is simpler than it sounds: all you have to do is make sure it consists of valid groups of bytes, and doesn't end in the middle of a group. (These groups are called codepoints.)

The rules for these groups of bytes are as follows:

Single byte groups match the pattern 0xxxxxxx , where x means "either 0 or 1". For example, in both ASCII and UTF-8, the character K is represented by a single byte with the value 75 in decimal, or 0x4B in hexadecimal, or 01110101 in binary, which fits with the constraint above for a single-byte group.

Multi-byte groups have a start byte which indicates the length of the group, and then the correct number of continuation bytes. For example, the start byte for a 2-byte group is 110xxxxx, and the continuation byte is of the form 10xxxxxx. To get the actual value of the group to figure out what accented character that represented, you would take all the bits that go where the x are, and put them together. Don't worry about that: for this problem, you don't have to find the values encoded, just whether the groups of bytes are valid.

NULL bytes have the same meaning in ASCII and UTF-8, so in C, you can NULL terminate UTF-8 strings just like you can ASCII strings.

Overview of byte groups aka codepoints:

1 byte group: 0xxxxxxx
2 byte group: 110xxxxx 10xxxxxx
3 byte group: 1110xxxx 10xxxxxx 10xxxxxx
4 byte group: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
5 byte group: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
6 byte group: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

As you can see, the start-byte of a 3-byte group is 1110xxxx, and all continuation bytes are the same for all sizes of byte groups.

YOUR TASK:

Fill in the function test_utf8() below to make it return INVALID if called with an invalid NULL-terminated byte sequence.

SUGGESTIONS:

Comment your code a bit, compile and run it, feel free to make the test harness code more elegant.

/* test_utf8.c */
#include <stdio.h>

#define VALID 0
#define INVALID 1

int test_utf8(const unsigned char *str)
{
    /* your code goes here, replace this faulty implementation */

    if ( str[0] & 0x80 ) {
        return INVALID;
    }

    return VALID;
}

/* "K", should be valid */
const unsigned char test1[] = { 0x4B, 0x00 };

/* "hey" with accented e, should be valid */
const unsigned char test2[] = { 0x68, 0xC3, 0xA8, 0x79, 0x00 };

/* junk, should fail */
const unsigned char test3[] = { 0x5A, 0xC3, 0xC3, 0xE9, 0x5A, 0x00 };

/* a random-ish sequence I think is valid */
const unsigned char test4[] = { 0xF4, 0xAF, 0xA7, 0xB2, 0xE6, 0xA1, 0xB3, 0x00 };

/* junk, should fail */
const unsigned char test5[] = { 0x5A, 0x79, 0xF4, 0xAF, 0xA7, 0x00, };

const unsigned char *tests[] = {test1, test2, test3, test4, test5};

int main() {
    int i;
    for (i = 0; i < 5; i++) {
        printf("test%d: %s\n", i+1, test_utf8(tests[i]) == VALID ? "VALID" : "INVALID");
    }
    return 0;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment