andrewgross/unicode_export_bug.txt Secret

## unicode_export_bug.txt
Alright, I have a test case.  This is going to get in the weeds.  I am going to focus only on the bad column, called "sku_name" in our DB.

1. The data that Redshift is exporting is truncated.

Exported Version:

卡汶妮 2014秋冬新款女装修身显瘦镂空针

2. The data I see in my DB viewer in truncated

Viewer Version:

卡汶妮 2014秋冬新款女装修身显瘦镂空针

3. Python (or another language) gets the full version.

Using Psycopg2 to run the query to pull the data gives the full version:

卡汶妮 2014秋冬新款女装修身显瘦镂空针织打底衫 多色高弹 紫色 均码

Raw Bytes (this will be useful later):

\xe5\x8d\xa1\xe6\xb1\xb6\xe5\xa6\xae 2014\xe7\xa7\x8b\xe5\x86\xac\xe6\x96\xb0\xe6\xac\xbe\xe5\xa5\xb3\xe8\xa3\x85\xe4\xbf\xae\xe8\xba\xab\xe6\x98\xbe\xe7\x98\xa6\xe9\x95\x82\xe7\xa9\xba\xe9\x92\x88\x00\xe7\xbb\x87\x00\xe6\x89\x93\xe5\xba\x95\xe8\xa1\xab \xe5\xa4\x9a\x00\xe8\x89\xb2\x00\xe9\xab\x98\x00\xe5\xbc\xb9 \xe7\xb4\xab\xe8\x89\xb2 \xe5\x9d\x87\xe7\xa0\x81

So, the Python version is quite a bit longer, but why?  Well, if we line up the strings and check which byte Redshift stops with, we get the following:

Redshift Bytes:

\xe5\x8d\xa1\xe6\xb1\xb6\xe5\xa6\xae 2014\xe7\xa7\x8b\xe5\x86\xac\xe6\x96\xb0\xe6\xac\xbe\xe5\xa5\xb3\xe8\xa3\x85\xe4\xbf\xae\xe8\xba\xab\xe6\x98\xbe\xe7\x98\xa6\xe9\x95\x82\xe7\xa9\xba\xe9\x92\x88

Missing Bytes after Redshift stops(just the first few)

\x00\xe7\xbb\x87\x00\xe6\x89\x93\xe5\xba\x95\xe8\xa1\xab

So, Redshift stops exporting the data at `\x00`, the NULL byte. The NULL byte signifies the end of the string in many languages and representations, however, in UNICODE, it is legal to have the NULL byte inside your string.  It is a valid code point U+000000.  As far as I can tell, this is whats happening during the Redshift unload:

1. Redshift grabs the data for unloading.  The data is still correct at this point.
2. Redshift needs to quote the data. It adds these around the data, which it still sees as unicode.
3. Redshift needs to escape the data, so it walks the data, sees the NULL byte in the unicode, and tries to escape it to prevent misinterpretation.
4. Redshift goes to export this data, but forgets that escaping applies to NULL bytes as well.  It gets to the NULL byte and assumes it has finished reading the data, and writes it to the file.  This is why the exported data has a trailing "\" but not final quote.  Even without escaping, any time the exporting code sees the NULL byte is stops reading.  This is incorrect behavior.


##### Steps to Reproduce ######


I have attached a sample data file, unicode_test.json, for testing.

Schema:

CREATE TABLE sandbox.unicode_test (
  "simple_string" VARCHAR(400),
  "unicode_string" VARCHAR(400),
  "simple_string2" VARCHAR(400)
);

Load Command:

COPY sandbox.unicode_test
FROM 's3://myredshift/unicode_test.json'
IAM_ROLE 'arn:aws:iam::1234567890:role/myredshift'
JSON 'auto';

When I query this data I see the same issue I had before, where the representation is truncated in my DB viewer, but fine when queried with Python.

Exporting the Data:

UNLOAD ('select * from sandbox.unicode_test')
to 's3://myredshift/unicode_test/'
IAM_ROLE 'arn:aws:iam::1234567890:role/myredshift'
NULL AS 'XXNULLXX'
DELIMITER '|' ALLOWOVERWRITE ESCAPE ADDQUOTES;

This will produce a data file like the following:

"foo"|"卡汶妮 2014秋冬新款女装修身显瘦镂空针\|"bar"

This has the same trailing '\' in the middle column, and no closing " character.  This is the exact behavior we saw.  When loading this data with CSV parsing tools it will not properly find 3 columns, but only 2.

More fun with Exporting.

Even though I think the escaping process is the problem, I was wrong!  It is whatever is writing out the null bytes.  Even if we dont escape or add quotes, we always get corrupted data.

Unload command:

UNLOAD ('select * from sandbox.unicode_test')
to 's3://myredshift/unicode_test/'
IAM_ROLE 'arn:aws:iam::1234567890:role/myredshift'
NULL AS 'XXNULLXX'
DELIMITER '|' ALLOWOVERWRITE ADDQUOTES;

Result:

"foo"|"卡汶妮 2014秋冬新款女装修身显瘦镂空针                                           |"bar"

Unload Command:

UNLOAD ('select * from sandbox.unicode_test')
to 's3://myredshift/unicode_test/'
IAM_ROLE 'arn:aws:iam::1234567890:role/myredshift'
NULL AS 'XXNULLXX'
DELIMITER '|' ALLOWOVERWRITE;

Result:

foo|卡汶妮 2014秋冬新款女装修身显瘦镂空针                                            |bar

Both of these will produce broken data.

Note, when checking this out.  Be careful of copying the unicode strings.  I had a lot of trouble producing this bug report because copying the unicode strips the NULL bytes in many cases.  Let me know your success with reproducing this bug, and ideally when the exporter will be fixed to handle NULL bytes.

Andrew
	Alright, I have a test case. This is going to get in the weeds. I am going to focus only on the bad column, called "sku_name" in our DB.

	1. The data that Redshift is exporting is truncated.

	Exported Version:

	卡汶妮 2014秋冬新款女装修身显瘦镂空针

	2. The data I see in my DB viewer in truncated

	Viewer Version:

	卡汶妮 2014秋冬新款女装修身显瘦镂空针

	3. Python (or another language) gets the full version.

	Using Psycopg2 to run the query to pull the data gives the full version:

	卡汶妮 2014秋冬新款女装修身显瘦镂空针织打底衫多色高弹紫色均码

	Raw Bytes (this will be useful later):

	\xe5\x8d\xa1\xe6\xb1\xb6\xe5\xa6\xae 2014\xe7\xa7\x8b\xe5\x86\xac\xe6\x96\xb0\xe6\xac\xbe\xe5\xa5\xb3\xe8\xa3\x85\xe4\xbf\xae\xe8\xba\xab\xe6\x98\xbe\xe7\x98\xa6\xe9\x95\x82\xe7\xa9\xba\xe9\x92\x88\x00\xe7\xbb\x87\x00\xe6\x89\x93\xe5\xba\x95\xe8\xa1\xab \xe5\xa4\x9a\x00\xe8\x89\xb2\x00\xe9\xab\x98\x00\xe5\xbc\xb9 \xe7\xb4\xab\xe8\x89\xb2 \xe5\x9d\x87\xe7\xa0\x81

	So, the Python version is quite a bit longer, but why? Well, if we line up the strings and check which byte Redshift stops with, we get the following:

	Redshift Bytes:

	\xe5\x8d\xa1\xe6\xb1\xb6\xe5\xa6\xae 2014\xe7\xa7\x8b\xe5\x86\xac\xe6\x96\xb0\xe6\xac\xbe\xe5\xa5\xb3\xe8\xa3\x85\xe4\xbf\xae\xe8\xba\xab\xe6\x98\xbe\xe7\x98\xa6\xe9\x95\x82\xe7\xa9\xba\xe9\x92\x88

	Missing Bytes after Redshift stops(just the first few)

	\x00\xe7\xbb\x87\x00\xe6\x89\x93\xe5\xba\x95\xe8\xa1\xab

	So, Redshift stops exporting the data at `\x00`, the NULL byte. The NULL byte signifies the end of the string in many languages and representations, however, in UNICODE, it is legal to have the NULL byte inside your string. It is a valid code point U+000000. As far as I can tell, this is whats happening during the Redshift unload:

	1. Redshift grabs the data for unloading. The data is still correct at this point.
	2. Redshift needs to quote the data. It adds these around the data, which it still sees as unicode.
	3. Redshift needs to escape the data, so it walks the data, sees the NULL byte in the unicode, and tries to escape it to prevent misinterpretation.
	4. Redshift goes to export this data, but forgets that escaping applies to NULL bytes as well. It gets to the NULL byte and assumes it has finished reading the data, and writes it to the file. This is why the exported data has a trailing "\" but not final quote. Even without escaping, any time the exporting code sees the NULL byte is stops reading. This is incorrect behavior.


	##### Steps to Reproduce ######


	I have attached a sample data file, unicode_test.json, for testing.

	Schema:

	CREATE TABLE sandbox.unicode_test (
	"simple_string" VARCHAR(400),
	"unicode_string" VARCHAR(400),
	"simple_string2" VARCHAR(400)
	);

	Load Command:

	COPY sandbox.unicode_test
	FROM 's3://myredshift/unicode_test.json'
	IAM_ROLE 'arn:aws:iam::1234567890:role/myredshift'
	JSON 'auto';

	When I query this data I see the same issue I had before, where the representation is truncated in my DB viewer, but fine when queried with Python.

	Exporting the Data:

	UNLOAD ('select * from sandbox.unicode_test')
	to 's3://myredshift/unicode_test/'
	IAM_ROLE 'arn:aws:iam::1234567890:role/myredshift'
	NULL AS 'XXNULLXX'
	DELIMITER '\|' ALLOWOVERWRITE ESCAPE ADDQUOTES;

	This will produce a data file like the following:

	"foo"\|"卡汶妮 2014秋冬新款女装修身显瘦镂空针\\|"bar"

	This has the same trailing '\' in the middle column, and no closing " character. This is the exact behavior we saw. When loading this data with CSV parsing tools it will not properly find 3 columns, but only 2.

	More fun with Exporting.

	Even though I think the escaping process is the problem, I was wrong! It is whatever is writing out the null bytes. Even if we dont escape or add quotes, we always get corrupted data.

	Unload command:

	UNLOAD ('select * from sandbox.unicode_test')
	to 's3://myredshift/unicode_test/'
	IAM_ROLE 'arn:aws:iam::1234567890:role/myredshift'
	NULL AS 'XXNULLXX'
	DELIMITER '\|' ALLOWOVERWRITE ADDQUOTES;

	Result:

	"foo"\|"卡汶妮 2014秋冬新款女装修身显瘦镂空针 \|"bar"

	Unload Command:

	UNLOAD ('select * from sandbox.unicode_test')
	to 's3://myredshift/unicode_test/'
	IAM_ROLE 'arn:aws:iam::1234567890:role/myredshift'
	NULL AS 'XXNULLXX'
	DELIMITER '\|' ALLOWOVERWRITE;

	Result:

	foo\|卡汶妮 2014秋冬新款女装修身显瘦镂空针 \|bar

	Both of these will produce broken data.

	Note, when checking this out. Be careful of copying the unicode strings. I had a lot of trouble producing this bug report because copying the unicode strips the NULL bytes in many cases. Let me know your success with reproducing this bug, and ideally when the exporter will be fixed to handle NULL bytes.

	Andrew