Skip to content

Instantly share code, notes, and snippets.

@ArthurRocha
Last active July 3, 2020 12:49
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ArthurRocha/9c95d827de3731216c22244682b2d4c2 to your computer and use it in GitHub Desktop.
Save ArthurRocha/9c95d827de3731216c22244682b2d4c2 to your computer and use it in GitHub Desktop.
AWS Athena create table statement for Application Load Balancer logs (partitioned)
CREATE EXTERNAL TABLE IF NOT EXISTS {{DATABASE_NAME.TABLE_NAME}} (
type string,
time string,
elb string,
client_ip string,
client_port string,
target string,
request_processing_time int,
target_processing_time int,
response_processing_time int,
elb_status_code int,
target_status_code string,
received_bytes int,
sent_bytes int,
request_verb string,
request_url string,
request_proto string,
user_agent string,
ssl_cipher string,
ssl_protocol string,
target_group_arn string,
trace_id string
)
PARTITIONED BY(year string, month string, day string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = '([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*) ([-0-9]*) ([-0-9]*) ([-0-9]*) ([-0-9]*) ([^ ]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) ([^ ]*) ([^ ]*)\" \"([^\"]*)\" ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*)'
) LOCATION 's3://{{BUCKET_NAME}}/AWSLogs/{{ACCOUNT_ID}}/elasticloadbalancing/us-west-2/';
ALTER TABLE {{DATABASE_NAME.TABLE_NAME}} add partition (year="2017", month="02", day="21")
location "s3://{{BUCKET_NAME}}/AWSLogs/{{ACCOUNT_ID}}/elasticloadbalancing/us-west-2/2017/02/21";
ALTER TABLE {{DATABASE_NAME.TABLE_NAME}} drop partition (year="2017", month="02", day="21")
@krynble
Copy link

krynble commented May 17, 2019

Can I suggest a little update?

Your gist is awesome!

CREATE EXTERNAL TABLE IF NOT EXISTS <tablename> (
	type string,
	time string,
	elb string,
	client_ip string,
	client_port int,
	target_ip string,
	target_port int,
	request_processing_time double,
	target_processing_time double,
	response_processing_time double,
	elb_status_code string,
	target_status_code string,
	received_bytes bigint,
	sent_bytes bigint,
	request_method string,
	request_url string,
	http_version string,
	user_agent string,
	ssl_cipher string,
	ssl_protocol string,
	target_group_arn string,
	trace_id string,
	domain_name string,
	chosen_cert_arn string, 
	matched_rule_priority string,
	request_creation_time string,
	actions_executed string,
	unknown1 string,
	unknown1 string
 )
PARTITIONED BY(year string, month string, day string) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = 
'([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*):([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) ([-0-9]*) ([-0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) ([^ ]*) ([^ ]*)\" (\"[^\"]*\") ([A-Z0-9-]+) ([A-Za-z0-9.-]*) ([^ ]*) (\"[^\"]*\") (\"[^\"]*\") (\"[^\"]*\") ([-.0-9]*) ([^ ]*) (\"[^\"]*\") (\"[^\"]*\") (\"[^\"]*\")' )
LOCATION 's3://<bucket>/AWSLogs/<acctId>/elasticloadbalancing/<region>';

Highlights: split client and target ip and ports, added more fields that weren't matched and

@lylecpa
Copy link

lylecpa commented Aug 15, 2019

need to make 2nd unknown1 and unknown2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment