Skip to content

Instantly share code, notes, and snippets.

@aranjello
Created November 1, 2021 22:58
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save aranjello/cdbecfc2cd5601f7cb19b9c26652cb33 to your computer and use it in GitHub Desktop.
Save aranjello/cdbecfc2cd5601f7cb19b9c26652cb33 to your computer and use it in GitHub Desktop.
Simple converter from utf-8 unicode character to its corresponding decimal value
int getDecimalOfUtf8(char * utfPointer){
int outPut = utfPointer[0]; //sets the final output number to the value of the first byte of the utf-8 code
int numStarting1 = 0; //tracks number of 0's at beggining of character to see if utf-8
while ((utfPointer[0] << numStarting1 & 0x80) ? 1 : 0) //shifts bits of first byte of utfPointer and checks if the MSB is 1 or 0
{
numStarting1++; //increments numStarting1
}
//At this point numStarting1 could be returned to know the length in bytes of the unicode character
if(numStarting1 != 0){ //If there are no starting 1s then it is a regular ascii character
int mask = 32; //Create a mask to keep only the bits needed from the first byte of the utf-8 code
for(int i = 0; i < numStarting1-2; i++){ //For each additional 1 at the beggining of the first byte of the utf code less bits are needed and the mask needs to be smaller
mask /= 2;
}
mask -= 1; //The final masked value needs to have one subtracted so there are 1's in the right places for the mask (ie 32 = 100000 and 31 = 11111)
outPut = utfPointer[0] & mask; //Mask the value in input to keep only the bits important for the utf-8 code
for (int x = 1; x < numStarting1; x++) //The number of 0s at the begging of the first byte of the utf-8 code tell us how many additional bytes we need for the full code
{
outPut = outPut << 6; //We shift the output value by 6 as the first two bits of each additional byte are for indexing
outPut = outPut | (utfPointer[x] & 0x3F); //The shifted value is then bitwise ored with last 6 bits of the next bytes in the pointer
}
}
return outPut; //Once all of the additional bytes have been used up the final value is returned as output
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment