mreiferson/gist:f370d576ef6b4b8ae8dc Secret

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    UTF-8 validation test

UTF-8 is used for encoding text, which means, mapping the bytes in a file to meaningful characters.
ASCII is the classic, once ubiquitous encoding in the western world. It uses a single byte per
character, and uses only the lower 7 bits of that byte. It can only represent the characters on a
US keyboard (as well as tab and carriage return etc). Other encodings exist which also use a single
byte per character, but use all 8 bits, like ISO-8859-1, or use multible bytes per character, like
UCS-32 which always uses 4 bytes per character. These encodings can represent many more characters,
such as accented ones.
UTF-8 uses a variable number of bytes per character - one, or two, or even more. Even better, all
the single-byte characters are ASCII, so all ASCII encoded text is also valid UTF-8 encoded text.
In this test, you'll write a function which takes a UTF-8 encoded string, and decides whether it's
valid UTF-8. This is simpler than it sounds: all you have to do is make sure it consists of valid
groups of bytes, and doesn't end in the middle of a group. (These groups are called codepoints.)
The rules for these groups of bytes are as follows:
Single byte groups match the pattern  0xxxxxxx , where
x means "either 0 or 1". For example, in both ASCII and UTF-8, the character K is represented by
a single byte with the value 75 in decimal, or 0x4B in hexadecimal, or 01110101 in binary, which
fits with the constraint above for a single-byte group.
Multi-byte groups have a start byte which indicates the length of the group, and then the correct
number of continuation bytes. For example, the start byte for a 2-byte group is 110xxxxx, and the
continuation byte is of the form 10xxxxxx. To get the actual value of the group to figure out what
accented character that represented, you would take all the bits that go where the x are, and put
them together. Don't worry about that: for this problem, you don't have to find the values encoded,
just whether the groups of bytes are valid.
NULL bytes have the same meaning in ASCII and UTF-8, so in C, you can NULL terminate UTF-8 strings
just like you can ASCII strings.
Overview of byte groups aka codepoints:
1 byte group: 0xxxxxxx
2 byte group: 110xxxxx 10xxxxxx
3 byte group: 1110xxxx 10xxxxxx 10xxxxxx
4 byte group: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
5 byte group: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
6 byte group: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

As you can see, the start-byte of a 3-byte group is 1110xxxx, and all continuation bytes are the
same for all sizes of byte groups.
YOUR TASK:

Fill in the function test_utf8() below to make it return INVALID if called with an invalid
NULL-terminated byte sequence.
SUGGESTIONS:

Comment your code a bit, compile and run it, feel free to make the test harness code more elegant.
/* test_utf8.c */
#include <stdio.h>

#define VALID 0
#define INVALID 1

int test_utf8(const unsigned char *str)
{
    /* your code goes here, replace this faulty implementation */

    if ( str[0] & 0x80 ) {
        return INVALID;
    }

    return VALID;
}

/* "K", should be valid */
const unsigned char test1[] = { 0x4B, 0x00 };

/* "hey" with accented e, should be valid */
const unsigned char test2[] = { 0x68, 0xC3, 0xA8, 0x79, 0x00 };

/* junk, should fail */
const unsigned char test3[] = { 0x5A, 0xC3, 0xC3, 0xE9, 0x5A, 0x00 };

/* a random-ish sequence I think is valid */
const unsigned char test4[] = { 0xF4, 0xAF, 0xA7, 0xB2, 0xE6, 0xA1, 0xB3, 0x00 };

/* junk, should fail */
const unsigned char test5[] = { 0x5A, 0x79, 0xF4, 0xAF, 0xA7, 0x00, };

const unsigned char *tests[] = {test1, test2, test3, test4, test5};

int main() {
    int i;
    for (i = 0; i < 5; i++) {
        printf("test%d: %s\n", i+1, test_utf8(tests[i]) == VALID ? "VALID" : "INVALID");
    }
    return 0;
}