Skip to content

Instantly share code, notes, and snippets.

@andrewgross
Created April 17, 2018 16:45
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save andrewgross/89edecf757c5df2415fe2537835a8f61 to your computer and use it in GitHub Desktop.
Save andrewgross/89edecf757c5df2415fe2537835a8f61 to your computer and use it in GitHub Desktop.
Alright, I have a test case. This is going to get in the weeds. I am going to focus only on the bad column, called "sku_name" in our DB.
1. The data that Redshift is exporting is truncated.
Exported Version:
卡汶妮 2014秋冬新款女装修身显瘦镂空针
2. The data I see in my DB viewer in truncated
Viewer Version:
卡汶妮 2014秋冬新款女装修身显瘦镂空针
3. Python (or another language) gets the full version.
Using Psycopg2 to run the query to pull the data gives the full version:
卡汶妮 2014秋冬新款女装修身显瘦镂空针织打底衫 多色高弹 紫色 均码
Raw Bytes (this will be useful later):
\xe5\x8d\xa1\xe6\xb1\xb6\xe5\xa6\xae 2014\xe7\xa7\x8b\xe5\x86\xac\xe6\x96\xb0\xe6\xac\xbe\xe5\xa5\xb3\xe8\xa3\x85\xe4\xbf\xae\xe8\xba\xab\xe6\x98\xbe\xe7\x98\xa6\xe9\x95\x82\xe7\xa9\xba\xe9\x92\x88\x00\xe7\xbb\x87\x00\xe6\x89\x93\xe5\xba\x95\xe8\xa1\xab \xe5\xa4\x9a\x00\xe8\x89\xb2\x00\xe9\xab\x98\x00\xe5\xbc\xb9 \xe7\xb4\xab\xe8\x89\xb2 \xe5\x9d\x87\xe7\xa0\x81
So, the Python version is quite a bit longer, but why? Well, if we line up the strings and check which byte Redshift stops with, we get the following:
Redshift Bytes:
\xe5\x8d\xa1\xe6\xb1\xb6\xe5\xa6\xae 2014\xe7\xa7\x8b\xe5\x86\xac\xe6\x96\xb0\xe6\xac\xbe\xe5\xa5\xb3\xe8\xa3\x85\xe4\xbf\xae\xe8\xba\xab\xe6\x98\xbe\xe7\x98\xa6\xe9\x95\x82\xe7\xa9\xba\xe9\x92\x88
Missing Bytes after Redshift stops(just the first few)
\x00\xe7\xbb\x87\x00\xe6\x89\x93\xe5\xba\x95\xe8\xa1\xab
So, Redshift stops exporting the data at `\x00`, the NULL byte. The NULL byte signifies the end of the string in many languages and representations, however, in UNICODE, it is legal to have the NULL byte inside your string. It is a valid code point U+000000. As far as I can tell, this is whats happening during the Redshift unload:
1. Redshift grabs the data for unloading. The data is still correct at this point.
2. Redshift needs to quote the data. It adds these around the data, which it still sees as unicode.
3. Redshift needs to escape the data, so it walks the data, sees the NULL byte in the unicode, and tries to escape it to prevent misinterpretation.
4. Redshift goes to export this data, but forgets that escaping applies to NULL bytes as well. It gets to the NULL byte and assumes it has finished reading the data, and writes it to the file. This is why the exported data has a trailing "\" but not final quote. Even without escaping, any time the exporting code sees the NULL byte is stops reading. This is incorrect behavior.
##### Steps to Reproduce ######
I have attached a sample data file, unicode_test.json, for testing.
Schema:
CREATE TABLE sandbox.unicode_test (
"simple_string" VARCHAR(400),
"unicode_string" VARCHAR(400),
"simple_string2" VARCHAR(400)
);
Load Command:
COPY sandbox.unicode_test
FROM 's3://myredshift/unicode_test.json'
IAM_ROLE 'arn:aws:iam::1234567890:role/myredshift'
JSON 'auto';
When I query this data I see the same issue I had before, where the representation is truncated in my DB viewer, but fine when queried with Python.
Exporting the Data:
UNLOAD ('select * from sandbox.unicode_test')
to 's3://myredshift/unicode_test/'
IAM_ROLE 'arn:aws:iam::1234567890:role/myredshift'
NULL AS 'XXNULLXX'
DELIMITER '|' ALLOWOVERWRITE ESCAPE ADDQUOTES;
This will produce a data file like the following:
"foo"|"卡汶妮 2014秋冬新款女装修身显瘦镂空针\|"bar"
This has the same trailing '\' in the middle column, and no closing " character. This is the exact behavior we saw. When loading this data with CSV parsing tools it will not properly find 3 columns, but only 2.
More fun with Exporting.
Even though I think the escaping process is the problem, I was wrong! It is whatever is writing out the null bytes. Even if we dont escape or add quotes, we always get corrupted data.
Unload command:
UNLOAD ('select * from sandbox.unicode_test')
to 's3://myredshift/unicode_test/'
IAM_ROLE 'arn:aws:iam::1234567890:role/myredshift'
NULL AS 'XXNULLXX'
DELIMITER '|' ALLOWOVERWRITE ADDQUOTES;
Result:
"foo"|"卡汶妮 2014秋冬新款女装修身显瘦镂空针 |"bar"
Unload Command:
UNLOAD ('select * from sandbox.unicode_test')
to 's3://myredshift/unicode_test/'
IAM_ROLE 'arn:aws:iam::1234567890:role/myredshift'
NULL AS 'XXNULLXX'
DELIMITER '|' ALLOWOVERWRITE;
Result:
foo|卡汶妮 2014秋冬新款女装修身显瘦镂空针 |bar
Both of these will produce broken data.
Note, when checking this out. Be careful of copying the unicode strings. I had a lot of trouble producing this bug report because copying the unicode strips the NULL bytes in many cases. Let me know your success with reproducing this bug, and ideally when the exporter will be fixed to handle NULL bytes.
Andrew
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment