Skip to content

Instantly share code, notes, and snippets.

@iconara
Last active September 4, 2023 13:39
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save iconara/4969c247e8abb69600cdbe6f4b20f50b to your computer and use it in GitHub Desktop.
Save iconara/4969c247e8abb69600cdbe6f4b20f50b to your computer and use it in GitHub Desktop.
Athena metadata file parser
#!/usr/bin/env ruby
# This code parses the .csv.metadata files written by Athena and produces a
# structure similar to what you get from the GetQueryResults API call.
#
# I have reverse engineered the format and I'm not sure about all the details,
# but it seems to correspond to the GetQueryResults API call well. Some things,
# like nullability, the difference between name and label, and the schema_name
# and table_name fields, I haven't been able to figure out because they seem
# not to be used, or never takes any other values.
#
# The format looks something like this:
# * A header starting with 0x0a followed by a Pascal string (one byte for the
# size followed by the bytes of the string). The string seems to include the
# date and time of the execution, and a random string. I'm not sure if this
# corresponds to the query execution ID somehow.
# * One or more structures corresponding to the columns of the result. The
# structure starts with 0x22 followed by the size of the structure as one byte.
# Next comes a field ID as one byte, and the field contents. The field contents
# are determined by the field ID. The name, label, and type are Pascal strings,
# the precision is a varint, the scale, nullability, and case insensitivity
# are just single bytes.
# * I have not seen the schema name or table name fields in use, and haven't
# been able to provoke their use either, I can only guess that those would be
# fields 0x12 and 0x1a and have values that are Pascal strings.
# * I haven't been able to make Athena produce different values for name and
# label, so I can't actually tell which field is which, I have made a guess
# that name comes first, but it could be the other way around.
# * I haven't been able to make Athena set any other value for nullable than
# "UNKNOWN", corresponding to 0x03, so I can't tell what "NO_NULLS" and
# "NULLABLE" would correspond to. My guess is 0x01 and 0x02 respectively.
class Parser
FIELDS = {
0x0a => :catalog,
0x12 => :schema_name, # guess, never used
0x1a => :table_name, # guess, never used
0x22 => :name, # guess, never different from label
0x2a => :label, # guess, never different from name
0x32 => :type,
0x38 => :precision,
0x40 => :scale,
0x48 => :nullable,
0x50 => :case_sensitive,
}
NULLABILITY = {
0x01 => :no_nulls, # guess, I have never actually seen this value
0x02 => :nullable, # guess, I have never actually seen this value
0x03 => :unknown,
}
def initialize(buffer)
@buffer = buffer
@offset = 0
end
def parse
name = read_name
columns = []
while @offset < @buffer.bytesize
columns << read_column
end
{:name => name, :columns => columns}
end
private def expect(n)
actual = @buffer[@offset].ord
unless actual == n
raise format('Expected 0x%02x at position %d, but was 0x%02x', n, @offset, actual)
end
end
private def read_name
expect(10)
@offset += 1
read_string
end
private def read_column
expect(34)
@offset += 1
size = read_byte
end_offset = @offset + size
fields = []
while @offset < end_offset
field = read_field
fields << field
end
fields.to_h
end
private def read_field
id = read_byte
field = FIELDS[id]
case field
when :catalog, :name, :label, :type
str = read_string
[field, str]
when :nullable
raw_value = read_byte
value = NULLABILITY[raw_value] || raw_value
[field, value]
when :scale
[field, read_byte]
when :case_sensitive
[field, read_byte == 1]
when :precision
value = read_varint
[field, value]
else
raise format('Unsupported field id: 0x%x', id)
end
end
private def read_byte
value = @buffer[@offset, 1].ord
@offset += 1
value
end
private def read_string
size = read_byte
str = @buffer[@offset, size]
@offset += size
str
end
private def read_varint
b = read_byte
n = b & 0b01111111
i = 1
until b & 0b10000000 == 0
b = read_byte
n |= ((b & 0b01111111) << (i * 7))
i += 1
end
n
end
end
if File.identical?(__FILE__, $0)
require 'json'
puts(JSON.pretty_generate(Parser.new(ARGF.read.force_encoding('ascii-8bit')).parse))
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment