Created
November 1, 2021 22:58
-
-
Save aranjello/cdbecfc2cd5601f7cb19b9c26652cb33 to your computer and use it in GitHub Desktop.
Simple converter from utf-8 unicode character to its corresponding decimal value
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
int getDecimalOfUtf8(char * utfPointer){ | |
int outPut = utfPointer[0]; //sets the final output number to the value of the first byte of the utf-8 code | |
int numStarting1 = 0; //tracks number of 0's at beggining of character to see if utf-8 | |
while ((utfPointer[0] << numStarting1 & 0x80) ? 1 : 0) //shifts bits of first byte of utfPointer and checks if the MSB is 1 or 0 | |
{ | |
numStarting1++; //increments numStarting1 | |
} | |
//At this point numStarting1 could be returned to know the length in bytes of the unicode character | |
if(numStarting1 != 0){ //If there are no starting 1s then it is a regular ascii character | |
int mask = 32; //Create a mask to keep only the bits needed from the first byte of the utf-8 code | |
for(int i = 0; i < numStarting1-2; i++){ //For each additional 1 at the beggining of the first byte of the utf code less bits are needed and the mask needs to be smaller | |
mask /= 2; | |
} | |
mask -= 1; //The final masked value needs to have one subtracted so there are 1's in the right places for the mask (ie 32 = 100000 and 31 = 11111) | |
outPut = utfPointer[0] & mask; //Mask the value in input to keep only the bits important for the utf-8 code | |
for (int x = 1; x < numStarting1; x++) //The number of 0s at the begging of the first byte of the utf-8 code tell us how many additional bytes we need for the full code | |
{ | |
outPut = outPut << 6; //We shift the output value by 6 as the first two bits of each additional byte are for indexing | |
outPut = outPut | (utfPointer[x] & 0x3F); //The shifted value is then bitwise ored with last 6 bits of the next bytes in the pointer | |
} | |
} | |
return outPut; //Once all of the additional bytes have been used up the final value is returned as output | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment