Skip to content

Instantly share code, notes, and snippets.

@gordol
Created November 4, 2020 22:31
Show Gist options
  • Save gordol/85226ac917de7984010c507591078792 to your computer and use it in GitHub Desktop.
Save gordol/85226ac917de7984010c507591078792 to your computer and use it in GitHub Desktop.
Corrupt data landing in s3 via aws_s3.query_export_to_s3 extension
-- When using aws_s3.query_export_to_s3, on Aurora Postgres, version 11, data is reliably and predictably corrupted at certain positions of the payloads that land in s3.
-- It seems that when data falls over a certain byte length, it is reset and resent mid-stream to s3.
-- We have been able to reproduce this positively, as well as reproduce a negative as well; ensuring that data falls at powers of 2 on even byte lengths seems to prevent this issue from occurring.
-- As the length of data increases, as does the frequency of the corruption. For small data sets, it may never surface.
-- You will see upon running this SQL that it occurs once in 100,000 records with 128+9 byte length records, twice for 256+9, and 4 times for 512+9, and so-on, where each record looks like {"A":""}\n where \n is a new line control character, amounting to 9 extra bytes on top of the value in the JSON objects.
-- these queries will result in broken json in s3
select * from aws_s3.query_export_to_s3(
'select row_to_json(test) from (SELECT array_to_string(ARRAY(SELECT ''A'' FROM generate_series(1,128)), '''') A FROM generate_series(1,100000)) test',
aws_commons.create_s3_uri(
'example-bucket/debug',
'test_data-128.jsonl',
'us-east-1'),
options :='format text');
select * from aws_s3.query_export_to_s3(
'select row_to_json(test) from (SELECT array_to_string(ARRAY(SELECT ''A'' FROM generate_series(1,256)), '''') A FROM generate_series(1,100000)) test',
aws_commons.create_s3_uri(
'example-bucket/debug',
'test_data-256.jsonl',
'us-east-1'),
options :='format text');
select * from aws_s3.query_export_to_s3(
'select row_to_json(test) from (SELECT array_to_string(ARRAY(SELECT ''A'' FROM generate_series(1,512)), '''') A FROM generate_series(1,100000)) test',
aws_commons.create_s3_uri(
'example-bucket/debug',
'test_data-512.jsonl',
'us-east-1'),
options :='format text');
-- these queries will NOT result in broken json in s3
select * from aws_s3.query_export_to_s3(
'select row_to_json(test) from (SELECT array_to_string(ARRAY(SELECT ''A'' FROM generate_series(1,128-9)), '''') A FROM generate_series(1,100000)) test',
aws_commons.create_s3_uri(
'example-bucket/debug',
'test_data-128-9.jsonl',
'us-east-1'),
options :='format text');
select * from aws_s3.query_export_to_s3(
'select row_to_json(test) from (SELECT array_to_string(ARRAY(SELECT ''A'' FROM generate_series(1,256-9)), '''') A FROM generate_series(1,100000)) test',
aws_commons.create_s3_uri(
'example-bucket/debug',
'test_data-256-9.jsonl',
'us-east-1'),
options :='format text');
select * from aws_s3.query_export_to_s3(
'select row_to_json(test) from (SELECT array_to_string(ARRAY(SELECT ''A'' FROM generate_series(1,512-9)), '''') A FROM generate_series(1,100000)) test',
aws_commons.create_s3_uri(
'example-bucket/debug',
'test_data-512-9.jsonl',
'us-east-1'),
options :='format text');
@gordol
Copy link
Author

gordol commented Nov 5, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment