Skip to content

Instantly share code, notes, and snippets.

@wch

wch/string.md Secret

Last active April 9, 2024 22:48
Show Gist options
  • Save wch/90a2d7446e00c9a5aef09d0f2fe01c73 to your computer and use it in GitHub Desktop.
Save wch/90a2d7446e00c9a5aef09d0f2fe01c73 to your computer and use it in GitHub Desktop.
Claude AI for converting between vector of bytes to `std::u16string`

To create a UTF-16 string in C++ from an array of bytes with a byte order mark (BOM), you can use the following code:

#include <iostream>
#include <string>
#include <vector>

std::u16string createUTF16String(const std::vector<unsigned char>& bytes) {
    if (bytes.size() < 2) {
        throw std::runtime_error("Invalid byte array. Size must be at least 2 bytes.");
    }

    // Check byte order mark (BOM)
    bool isLittleEndian = (bytes[0] == 0xFF && bytes[1] == 0xFE);
    bool isBigEndian = (bytes[0] == 0xFE && bytes[1] == 0xFF);

    if (!isLittleEndian && !isBigEndian) {
        throw std::runtime_error("Invalid byte order mark (BOM).");
    }

    std::u16string result;
    for (size_t i = 2; i < bytes.size(); i += 2) {
        char16_t codeUnit;
        if (isLittleEndian) {
            codeUnit = static_cast<char16_t>(bytes[i] | (bytes[i + 1] << 8));
        } else {
            codeUnit = static_cast<char16_t>((bytes[i] << 8) | bytes[i + 1]);
        }
        result.push_back(codeUnit);
    }

    return result;
}

int main() {
    std::vector<unsigned char> bytes = {0xFF, 0xFE, 0x48, 0x00, 0x65, 0x00, 0x6C, 0x00, 0x6C, 0x00, 0x6F, 0x00};
    std::u16string utf16String = createUTF16String(bytes);
    std::cout << std::hex;
    for (char16_t c : utf16String) {
        std::cout << "0x" << static_cast<int>(c) << ' ';
    }
    std::cout << std::endl;
    return 0;
}

In this code:

  1. The createUTF16String function takes a std::vector<unsigned char> representing the array of bytes with a BOM.

  2. It first checks if the byte array has at least 2 bytes, which are necessary for the BOM.

  3. It then checks the byte order mark (BOM) to determine the endianness. If the first two bytes are 0xFF and 0xFE, it indicates little-endian encoding. If the first two bytes are 0xFE and 0xFF, it indicates big-endian encoding. If neither of these conditions is met, an exception is thrown.

  4. The function then iterates over the byte array starting from index 2 (after the BOM) and combines every two bytes into a char16_t code unit based on the endianness.

  5. Each code unit is appended to the std::u16string result.

  6. Finally, the std::u16string is returned.

In the main function:

  1. An example byte array with a BOM and UTF-16 encoded characters is created.

  2. The createUTF16String function is called with the byte array to create the UTF-16 string.

  3. The resulting UTF-16 string is printed as hexadecimal values.

Output:

0x48 0x65 0x6c 0x6c 0x6f

The output shows the UTF-16 code units of the string "Hello" in hexadecimal format.

Note: This code assumes that the byte array contains valid UTF-16 encoded characters and that the size of the byte array is a multiple of 2 (excluding the BOM). If the byte array is not properly formatted, the code may throw an exception or produce unexpected results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment