UTF-8
is used for encoding text, which means, mapping the bytes in a file to meaningful characters.
ASCII
is the classic, once ubiquitous encoding in the western world. It uses a single byte per
character, and uses only the lower 7 bits of that byte. It can only represent the characters on a
US keyboard (as well as tab and carriage return etc). Other encodings exist which also use a single
byte per character, but use all 8 bits, like ISO-8859-1
, or use multible bytes per character, like
UCS-32
which always uses 4 bytes per character. These encodings can represent many more characters,
such as accented ones.
UTF-8
uses a variable number of bytes per character - one, or two, or even more. Even better, all
the single-byte characters are ASCII
, so all ASCII
encoded text is also valid UTF-8
encoded text.
In this test, you'll write a function which takes a UTF-8
encoded string, and decides whether it's
valid UTF-8
. This is simpler than it sounds: all you have to do is make sure it consists of valid
groups of bytes, and doesn't end in the middle of a group. (These groups are called codepoints.)
The rules for these groups of bytes are as follows:
Single byte groups match the pattern 0xxxxxxx
, where
x
means "either 0
or 1
". For example, in both ASCII
and UTF-8
, the character K
is represented by
a single byte with the value 75
in decimal, or 0x4B
in hexadecimal, or 01110101
in binary, which
fits with the constraint above for a single-byte group.
Multi-byte groups have a start byte which indicates the length of the group, and then the correct
number of continuation bytes. For example, the start byte for a 2-byte group is 110xxxxx
, and the
continuation byte is of the form 10xxxxxx
. To get the actual value of the group to figure out what
accented character that represented, you would take all the bits that go where the x
are, and put
them together. Don't worry about that: for this problem, you don't have to find the values encoded,
just whether the groups of bytes are valid.
NULL
bytes have the same meaning in ASCII
and UTF-8
, so in C
, you can NULL
terminate UTF-8
strings
just like you can ASCII
strings.
Overview of byte groups aka codepoints:
1 byte group: 0xxxxxxx
2 byte group: 110xxxxx 10xxxxxx
3 byte group: 1110xxxx 10xxxxxx 10xxxxxx
4 byte group: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
5 byte group: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
6 byte group: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
As you can see, the start-byte of a 3-byte group is 1110xxxx
, and all continuation bytes are the
same for all sizes of byte groups.
Fill in the function test_utf8()
below to make it return INVALID
if called with an invalid
NULL
-terminated byte sequence.
Comment your code a bit, compile and run it, feel free to make the test harness code more elegant.
/* test_utf8.c */
#include <stdio.h>
#define VALID 0
#define INVALID 1
int test_utf8(const unsigned char *str)
{
/* your code goes here, replace this faulty implementation */
if ( str[0] & 0x80 ) {
return INVALID;
}
return VALID;
}
/* "K", should be valid */
const unsigned char test1[] = { 0x4B, 0x00 };
/* "hey" with accented e, should be valid */
const unsigned char test2[] = { 0x68, 0xC3, 0xA8, 0x79, 0x00 };
/* junk, should fail */
const unsigned char test3[] = { 0x5A, 0xC3, 0xC3, 0xE9, 0x5A, 0x00 };
/* a random-ish sequence I think is valid */
const unsigned char test4[] = { 0xF4, 0xAF, 0xA7, 0xB2, 0xE6, 0xA1, 0xB3, 0x00 };
/* junk, should fail */
const unsigned char test5[] = { 0x5A, 0x79, 0xF4, 0xAF, 0xA7, 0x00, };
const unsigned char *tests[] = {test1, test2, test3, test4, test5};
int main() {
int i;
for (i = 0; i < 5; i++) {
printf("test%d: %s\n", i+1, test_utf8(tests[i]) == VALID ? "VALID" : "INVALID");
}
return 0;
}