-
-
Save andrewgross/89edecf757c5df2415fe2537835a8f61 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Alright, I have a test case. This is going to get in the weeds. I am going to focus only on the bad column, called "sku_name" in our DB. | |
1. The data that Redshift is exporting is truncated. | |
Exported Version: | |
卡汶妮 2014秋冬新款女装修身显瘦镂空针 | |
2. The data I see in my DB viewer in truncated | |
Viewer Version: | |
卡汶妮 2014秋冬新款女装修身显瘦镂空针 | |
3. Python (or another language) gets the full version. | |
Using Psycopg2 to run the query to pull the data gives the full version: | |
卡汶妮 2014秋冬新款女装修身显瘦镂空针织打底衫 多色高弹 紫色 均码 | |
Raw Bytes (this will be useful later): | |
\xe5\x8d\xa1\xe6\xb1\xb6\xe5\xa6\xae 2014\xe7\xa7\x8b\xe5\x86\xac\xe6\x96\xb0\xe6\xac\xbe\xe5\xa5\xb3\xe8\xa3\x85\xe4\xbf\xae\xe8\xba\xab\xe6\x98\xbe\xe7\x98\xa6\xe9\x95\x82\xe7\xa9\xba\xe9\x92\x88\x00\xe7\xbb\x87\x00\xe6\x89\x93\xe5\xba\x95\xe8\xa1\xab \xe5\xa4\x9a\x00\xe8\x89\xb2\x00\xe9\xab\x98\x00\xe5\xbc\xb9 \xe7\xb4\xab\xe8\x89\xb2 \xe5\x9d\x87\xe7\xa0\x81 | |
So, the Python version is quite a bit longer, but why? Well, if we line up the strings and check which byte Redshift stops with, we get the following: | |
Redshift Bytes: | |
\xe5\x8d\xa1\xe6\xb1\xb6\xe5\xa6\xae 2014\xe7\xa7\x8b\xe5\x86\xac\xe6\x96\xb0\xe6\xac\xbe\xe5\xa5\xb3\xe8\xa3\x85\xe4\xbf\xae\xe8\xba\xab\xe6\x98\xbe\xe7\x98\xa6\xe9\x95\x82\xe7\xa9\xba\xe9\x92\x88 | |
Missing Bytes after Redshift stops(just the first few) | |
\x00\xe7\xbb\x87\x00\xe6\x89\x93\xe5\xba\x95\xe8\xa1\xab | |
So, Redshift stops exporting the data at `\x00`, the NULL byte. The NULL byte signifies the end of the string in many languages and representations, however, in UNICODE, it is legal to have the NULL byte inside your string. It is a valid code point U+000000. As far as I can tell, this is whats happening during the Redshift unload: | |
1. Redshift grabs the data for unloading. The data is still correct at this point. | |
2. Redshift needs to quote the data. It adds these around the data, which it still sees as unicode. | |
3. Redshift needs to escape the data, so it walks the data, sees the NULL byte in the unicode, and tries to escape it to prevent misinterpretation. | |
4. Redshift goes to export this data, but forgets that escaping applies to NULL bytes as well. It gets to the NULL byte and assumes it has finished reading the data, and writes it to the file. This is why the exported data has a trailing "\" but not final quote. Even without escaping, any time the exporting code sees the NULL byte is stops reading. This is incorrect behavior. | |
##### Steps to Reproduce ###### | |
I have attached a sample data file, unicode_test.json, for testing. | |
Schema: | |
CREATE TABLE sandbox.unicode_test ( | |
"simple_string" VARCHAR(400), | |
"unicode_string" VARCHAR(400), | |
"simple_string2" VARCHAR(400) | |
); | |
Load Command: | |
COPY sandbox.unicode_test | |
FROM 's3://myredshift/unicode_test.json' | |
IAM_ROLE 'arn:aws:iam::1234567890:role/myredshift' | |
JSON 'auto'; | |
When I query this data I see the same issue I had before, where the representation is truncated in my DB viewer, but fine when queried with Python. | |
Exporting the Data: | |
UNLOAD ('select * from sandbox.unicode_test') | |
to 's3://myredshift/unicode_test/' | |
IAM_ROLE 'arn:aws:iam::1234567890:role/myredshift' | |
NULL AS 'XXNULLXX' | |
DELIMITER '|' ALLOWOVERWRITE ESCAPE ADDQUOTES; | |
This will produce a data file like the following: | |
"foo"|"卡汶妮 2014秋冬新款女装修身显瘦镂空针\|"bar" | |
This has the same trailing '\' in the middle column, and no closing " character. This is the exact behavior we saw. When loading this data with CSV parsing tools it will not properly find 3 columns, but only 2. | |
More fun with Exporting. | |
Even though I think the escaping process is the problem, I was wrong! It is whatever is writing out the null bytes. Even if we dont escape or add quotes, we always get corrupted data. | |
Unload command: | |
UNLOAD ('select * from sandbox.unicode_test') | |
to 's3://myredshift/unicode_test/' | |
IAM_ROLE 'arn:aws:iam::1234567890:role/myredshift' | |
NULL AS 'XXNULLXX' | |
DELIMITER '|' ALLOWOVERWRITE ADDQUOTES; | |
Result: | |
"foo"|"卡汶妮 2014秋冬新款女装修身显瘦镂空针 |"bar" | |
Unload Command: | |
UNLOAD ('select * from sandbox.unicode_test') | |
to 's3://myredshift/unicode_test/' | |
IAM_ROLE 'arn:aws:iam::1234567890:role/myredshift' | |
NULL AS 'XXNULLXX' | |
DELIMITER '|' ALLOWOVERWRITE; | |
Result: | |
foo|卡汶妮 2014秋冬新款女装修身显瘦镂空针 |bar | |
Both of these will produce broken data. | |
Note, when checking this out. Be careful of copying the unicode strings. I had a lot of trouble producing this bug report because copying the unicode strips the NULL bytes in many cases. Let me know your success with reproducing this bug, and ideally when the exporter will be fixed to handle NULL bytes. | |
Andrew |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment