Skip to content

Instantly share code, notes, and snippets.

@bradleypeabody
Last active November 29, 2023 22:08
Show Gist options
  • Save bradleypeabody/185b1d7ed6c0c2ab6cec to your computer and use it in GitHub Desktop.
Save bradleypeabody/185b1d7ed6c0c2ab6cec to your computer and use it in GitHub Desktop.
golang, convert UTF-16 to UTF-8 string
package main
// http://play.golang.org/p/fVf7duRtdH
import "fmt"
import "unicode/utf16"
import "unicode/utf8"
import "bytes"
func main() {
b := []byte{
0xff, // BOM
0xfe, // BOM
'T',
0x00,
'E',
0x00,
'S',
0x00,
'T',
0x00,
0x6C,
0x34,
'\n',
0x00,
}
s, err := DecodeUTF16(b)
if err != nil {
panic(err)
}
fmt.Println(s)
}
func DecodeUTF16(b []byte) (string, error) {
if len(b)%2 != 0 {
return "", fmt.Errorf("Must have even length byte slice")
}
u16s := make([]uint16, 1)
ret := &bytes.Buffer{}
b8buf := make([]byte, 4)
lb := len(b)
for i := 0; i < lb; i += 2 {
u16s[0] = uint16(b[i]) + (uint16(b[i+1]) << 8)
r := utf16.Decode(u16s)
n := utf8.EncodeRune(b8buf, r[0])
ret.Write(b8buf[:n])
}
return ret.String(), nil
}
@ik5
Copy link

ik5 commented Aug 19, 2015

This code is for little endian .
For big endian, change the code of line 50 like so:
u16s[0] = uint16(b[i+1]) + (uint16(b[i]) << 8)

I'm looking for an idea to figure out from the BOM (two first bytes) the endianness, so it will be automatic using this code.

@ping1990
Copy link

Thanks very much! helped a lot~

@bassu
Copy link

bassu commented Apr 10, 2016

@ik5: It's simple. Set i := 2 and read first two bytes outside the loop.
For others, see BOM Faq at Unicode.org, if you haven't already, which of course is quite a delightful read this evening 🚶

@Tanz0rz
Copy link

Tanz0rz commented Sep 28, 2016

Life saver! You are amazing!

@vinniyo
Copy link

vinniyo commented Feb 28, 2017

Thank you!

@akirabbq
Copy link

akirabbq commented Nov 10, 2017

Incorrect result when decoding any surrogate pair, should take care of the high/low surrogate range.

A quick fix to increase u16s size to 2 u16s := make([]uint16, 2) and:

	if u16s[0] >= 0xD800 && u16s[0] <= 0xE000 {
		log.Println("lead")
		i = i + 2
		u16s[1] = uint16(b[i]) + (uint16(b[i+1]) << 8)
	}

@juergenhoetzel
Copy link

golang already has support for decoding []byte into []uint16 (respecting the endianness):

func DecodeUtf16(b []byte, order binary.ByteOrder) (string, error) {
	ints := make([]uint16, len(b)/2)
	if err := binary.Read(bytes.NewReader(b), order, &ints); err != nil {
		return "", err
	}
	return string(utf16.Decode(ints)), nil
}

@akirabbq @ik5
complete solution (which also works with surrogate pairs): utf16.go

@sail1972
Copy link

sail1972 commented Feb 20, 2019

from the blog of http://angelonotes.blogspot.com/2015/09/golang-utf16-utf8.html

bs_UTF16LE, _, _ := transform.Bytes(unicode.UTF16(unicode.LittleEndian, unicode.IgnoreBOM).NewEncoder(), []byte("測試"))
bs_UTF16BE, _, _ := transform.Bytes(unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM).NewEncoder(), []byte("測試"))
bs_UTF8LE, _, _ := transform.Bytes(unicode.UTF16(unicode.LittleEndian, unicode.IgnoreBOM).NewDecoder(), bs_UTF16LE)
bs_UTF8BE, _, _ := transform.Bytes(unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM).NewDecoder(), bs_UTF16BE)

@samaita
Copy link

samaita commented Mar 17, 2020

Saved me a lot, thank you

@wenjy
Copy link

wenjy commented Oct 13, 2020

Thanks very much!

@maxneo4
Copy link

maxneo4 commented Mar 11, 2021

golang already has support for decoding []byte into []uint16 (respecting the endianness):

func DecodeUtf16(b []byte, order binary.ByteOrder) (string, error) {
	ints := make([]uint16, len(b)/2)
	if err := binary.Read(bytes.NewReader(b), order, &ints); err != nil {
		return "", err
	}
	return string(utf16.Decode(ints)), nil
}

@akirabbq @ik5
complete solution (which also works with surrogate pairs): utf16.go

You send just function I need to convert clob Oracle data to string.

@korau
Copy link

korau commented Jan 7, 2022

Life saver, thanks

@marians
Copy link

marians commented Sep 28, 2022

This helped me to decode UTF-16LE to UTF-8: https://blog.fearcat.in/a?ID=00001-1bd90844-ce0c-4fac-9b8f-fe3d8a30451d

decoder := unicode.UTF16(unicode.LittleEndian, unicode.IgnoreBOM).NewDecoder()
utf8bytes, err := decoder.Bytes(data) // data contains UTF16LE as read from a file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment