Skip to content

Instantly share code, notes, and snippets.

@sevaa
Last active December 16, 2024 06:30
Show Gist options
  • Save sevaa/f084a0a5a994c3bc28e518d5c708d5f6 to your computer and use it in GitHub Desktop.
Save sevaa/f084a0a5a994c3bc28e518d5c708d5f6 to your computer and use it in GitHub Desktop.
Converting an NVARCHAR string to a UTF-8 VARBINARY data block in pure Transact-SQL
create function [dbo].[ToUTF8](@s nvarchar(max))
returns varbinary(max)
as
begin
declare @i int = 1, @n int = datalength(@s)/2, @r varbinary(max) = 0x, @c int, @c2 int, @d varbinary(4)
while @i <= @n
begin
set @c = unicode(substring(@s, @i, 1))
if (@c & 0xFC00) = 0xD800
begin
set @i += 1
if @i > @n
return cast(cast('Malformed UTF-16 - two nchar sequence cut short' as int) as varbinary)
set @c2 = unicode(substring(@s, @i, 1))
if (@c2 & 0xFC00) <> 0xDC00
return cast(cast('Malformed UTF-16 - continuation missing in a two nchar sequence' as int) as varbinary)
set @c = (((@c & 0x3FF) * 0x400) | (@c2 & 0x3FF)) + 0x10000
end
if @c < 0x80
set @d = cast(@c as binary(1))
if @c >= 0x80 and @c < 0x800
set @d = cast(((@c * 4) & 0xFF00) | (@c & 0x3F) | 0xC080 as binary(2))
if @c >= 0x800 and @c < 0x10000
set @d = cast(((@c * 0x10) & 0xFF0000) | ((@c * 4) & 0x3F00) | (@c & 0x3F) | 0xe08080 as binary(3))
if @c >= 0x10000
set @d = cast(((@c * 0x40) & 0xFF000000) | ((@c * 0x10) & 0x3F0000) | ((@c * 4) & 0x3F00) | (@c & 0x3F) | 0xf0808080 as binary(4))
set @r += @d
set @i += 1
end
return @r
end
@sevaa
Copy link
Author

sevaa commented Oct 11, 2021

I've added some checks for malformed UTF-16 surrogate pairs. The function will now throw a reasonably descriptive error instead of producing garbage outputs.

@sevaa
Copy link
Author

sevaa commented Nov 7, 2021

Since MSSQL 2019, there is native support for UTF-8. The following will do the same job, most likely much faster:

declare @s nvarchar(100) = N'Москва 東京 🙃', @t varchar(200) 
set @t = CAST(@s AS varchar) collate Latin1_General_100_CI_AI_SC_UTF8
print @t

@komarik0
Copy link

komarik0 commented Apr 21, 2023

The most performant way I could find for MSSQL 2012+, MSSQL 2019-
Works much much faster for long strings

CREATE FUNCTION fn_to_utf8_bytes_4000(@s NVARCHAR(4000))
RETURNS VARBINARY(8000)
AS
BEGIN
	DECLARE @multiByteChar NVARCHAR(150) = N'%[^' + NCHAR(1) + '-' + NCHAR(127) + N']%'
	DECLARE @index INT = PATINDEX(@multibytechar COLLATE Latin1_General_BIN, @s);
	IF @index = 0
		RETURN CAST(CAST(@s AS VARCHAR(8000)) AS VARBINARY(8000))

	DECLARE @result VARBINARY(8000) = CAST(CAST(SUBSTRING(@s, 1, @index - 1) AS VARCHAR(8000)) AS VARBINARY(8000))

	SET @s = SUBSTRING(@s, @index, 4000)
	DECLARE @c INT = UNICODE(@s COLLATE Latin1_General_100_CS_AS_KS_WS_SC)

	WHILE @c > 0x80
	BEGIN
multibyte:
		IF @c >= 0x10000
		BEGIN
			SET @result += CAST(((@c * 0x40) & 0xFF000000) | ((@c * 0x10) & 0x3F0000) | ((@c * 4) & 0x3F00) | (@c & 0x3F) | 0xf0808080 AS VARBINARY(4))
			SET @s = SUBSTRING(@s, 3, 4000)
		END
		ELSE
		BEGIN
			SET @result += CASE
				WHEN @c < 0x800 THEN CAST(((@c * 4) & 0xFF00) | (@c & 0x3F) | 0xC080 AS VARBINARY(2))
				ELSE CAST(((@c * 0x10) & 0xFF0000) | ((@c * 4) & 0x3F00) | (@c & 0x3F) | 0xe08080 AS VARBINARY(3))
			END
			SET @s = SUBSTRING(@s, 2, 4000)
		END
		SET @c = UNICODE(@s COLLATE Latin1_General_100_CS_AS_KS_WS_SC)
	END

	SET @index = PATINDEX(@multibytechar COLLATE Latin1_General_BIN, @s);
	IF @index = 0
		RETURN @result + CAST(CAST(@s AS VARCHAR(8000)) AS VARBINARY(8000))

	SET @result += CAST(CAST(SUBSTRING(@s, 1, @index - 1) AS VARCHAR(8000)) AS VARBINARY(8000))
	SET @s = SUBSTRING(@s, @index, 4000)
	SET @c = UNICODE(@s COLLATE Latin1_General_100_CS_AS_KS_WS_SC)
	GOTO multibyte

	RETURN @result;
END

CREATE FUNCTION fn_to_utf8_bytes_max(@s NVARCHAR(MAX))
RETURNS VARBINARY(MAX)
AS
BEGIN
	DECLARE @size INT = LEN(@s)
	DECLARE @halfsize INT

	IF @size <= 4000
		RETURN dbo.fn_to_utf8_bytes_4000(@s);

	SET @halfsize = @size / 2
	SET @halfsize = IIF(UNICODE(SUBSTRING(@s, @halfsize, 1)) & 0xFC00 = 0xD800, @halfsize - 1, @halfsize)

	RETURN dbo.fn_to_utf8_bytes_max(SUBSTRING(@s, 1, @halfsize)) + dbo.fn_to_utf8_bytes_max(SUBSTRING(@s, @halfsize + 1, @size - @halfsize));
END

@sevaa
Copy link
Author

sevaa commented Apr 21, 2023

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment