Skip to content

Instantly share code, notes, and snippets.

@farski
Last active November 21, 2021 14:29
Show Gist options
  • Save farski/c796da7bb31e8e835288360b628c9c78 to your computer and use it in GitHub Desktop.
Save farski/c796da7bb31e8e835288360b628c9c78 to your computer and use it in GitHub Desktop.

More snippets

Access Logs

The included SQL and CloudFormation YAML snippets create AWS Glue tables for access logs from Application Load Balancers, which can be queried using Amazon Athena. The snippets are intended to be equivalent.

The tables that are created utilize partition projections, which allows for efficient, cost-effective querying of the log data in Amazon S3, without the need to manual create new partitions. Instead, the partitions are created based on the filesnames of the log files. Using the year, month, and day portions of the filenames, daily partitions are available when querying the log data.

Date-based partitions have a specific range in the configuration, which should be set to something appropriate for your use case.

When querying a table that includes the partitions, the SELECT should include a WHERE clause that filters data covered by the range of the partition. For example:

SELECT *
FROM __DATABASE_NAME__.__TABLE_NAME__ 
WHERE "date" >= date_format(current_date - interval '7' day, '%Y/%m/%d')

The date field in the WHERE clause matches the date projection configured on the table. The name of this projection is arbitrary.

Note that while ELB access logs include a time field, you must still use the date projection fields to utilize partitioning. You can, for example, use both the date and time fields in a WHERE clause to access certain partitions, and also filter more specifically by the time.

-- See: https://gist.github.com/farski/c796da7bb31e8e835288360b628c9c78
CREATE EXTERNAL TABLE IF NOT EXISTS __DATABASE_NAME__.__TABLE_NAME__ (
type string,
`time` string,
elb string,
client_ip string,
client_port int,
target_ip string,
target_port int,
request_processing_time double,
target_processing_time double,
response_processing_time double,
elb_status_code string,
target_status_code string,
received_bytes bigint,
sent_bytes bigint,
request_method string,
request_uri string,
request_http_version string,
user_agent string,
ssl_cipher string,
ssl_protocol string,
target_group_arn string,
trace_id string,
domain_name string,
chosen_cert_arn string,
matched_rule_priority string,
request_creation_time string,
actions_executed string,
redirect_url string,
lambda_error_reason string,
target_port_list string,
target_status_code_list string,
classification string,
classification_reason string
)
COMMENT "TKTKTK A description of the table"
PARTITIONED BY (
`date` string
)
ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.RegexSerDe"
WITH SERDEPROPERTIES (
"serialization.format" = "1",
"input.regex" = '([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) ([^ ]*) (- |[^ ]*)\" \"([^\"]*)\" ([A-Z0-9-]+) ([A-Za-z0-9.-]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^\"]*)\" ([-.0-9]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^ ]*)\" \"([^\s]+?)\" \"([^\s]+)\" \"([^ ]*)\" \"([^ ]*)\"'
)
STORED AS
INPUTFORMAT "org.apache.hadoop.mapred.TextInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
LOCATION "s3://__BUCKET_AND_PREFIX__/AWSLogs/__ACCOUNT_ID__/elasticloadbalancing/__REGION__/"
TBLPROPERTIES (
"projection.enabled" = "true",
"projection.date.type" = "date",
"projection.date.range" = "2021/01/01,NOW",
"projection.date.format" = "yyyy/MM/dd",
"projection.date.interval" = "1",
"projection.date.interval.unit" = "DAYS",
"storage.location.template" = "s3://__BUCKET_AND_PREFIX__/AWSLogs/__ACCOUNT_ID__/elasticloadbalancing/__REGION__/${date}"
);
# See: https://gist.github.com/farski/c796da7bb31e8e835288360b628c9c78
AccessLogsGlueTable:
Type: AWS::Glue::Table
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseName: __DATABASE_NAME__
TableInput:
Description: TKTKTK A description of the table
Name: __TABLE_NAME__
Parameters:
# The name of the projection (e.g., date in projection.date.format) is arbitrary, but
# does need to match in the projection definition, the storage location, the PartitionKeys,
# and any queries made against the table that should honor the partitions.
projection.enabled: "true"
projection.date.type: date
projection.date.range: 2021/01/01,NOW
projection.date.format: yyyy/MM/dd
projection.date.interval: "1"
projection.date.interval.unit: DAYS
storage.location.template: s3://__BUCKET_AND_PREFIX__/AWSLogs/__ACCOUNT_ID__/elasticloadbalancing/__REGION__/${date}
PartitionKeys:
- Name: date
Type: string
StorageDescriptor:
Columns:
- Name: type
Type: string
Comment: http, https, h2, grpcs, ws, or wss
- Name: time
Type: string
Comment: The time the response was generated (YYYY-MM-DDTHH:MM:SS.SSSSSSZ)
- Name: elb
Type: string
Comment: The resource ID of the load balancer
- Name: client_ip
Type: string
Comment: The IP of the requesting client
- Name: client_port
Type: int
Comment: The response return port of the requesting client
- Name: target_ip
Type: string
Comment:
- Name: target_port
Type: int
Comment:
- Name: request_processing_time
Type: double
Comment:
- Name: target_processing_time
Type: double
Comment:
- Name: response_processing_time
Type: double
Comment:
- Name: elb_status_code
Type: string
Comment: The status code of the response
- Name: target_status_code
Type: string
Comment: The status code of the response from the target, or a hyphen
- Name: received_bytes
Type: bigint
Comment: The size of the request
- Name: sent_bytes
Type: bigint
Comment: The size of the response
- Name: request_method
Type: string
Comment: The HTTP method of the request
- Name: request_uri
Type: string
Comment:
- Name: request_http_version
Type: string
Comment:
- Name: user_agent
Type: string
Comment:
- Name: ssl_cipher
Type: string
Comment:
- Name: ssl_protocol
Type: string
Comment:
- Name: target_group_arn
Type: string
Comment:
- Name: trace_id
Type: string
Comment: The value of the X-Amzn-Trace-Id header
- Name: domain_name
Type: string
Comment:
- Name: chosen_cert_arn
Type: string
Comment:
- Name: matched_rule_priority
Type: string
Comment:
- Name: request_creation_time
Type: string
Comment:
- Name: actions_executed
Type: string
Comment:
- Name: redirect_url
Type: string
Comment:
- Name: lambda_error_reason
Type: string
Comment:
- Name: target_port_list
Type: string
Comment:
- Name: target_status_code_list
Type: string
Comment:
- Name: classification
Type: string
Comment:
- Name: classification_reason
Type: string
Comment:
InputFormat: org.apache.hadoop.mapred.TextInputFormat
Location: !Sub s3://${AccessLogsBucket}/AWSLogs/${AWS::AccountId}/elasticloadbalancing/${AWS::Region}/
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
SerdeInfo:
Parameters:
input.regex: >-
([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) ([^ ]*) (- |[^ ]*)\" \"([^\"]*)\" ([A-Z0-9-]+) ([A-Za-z0-9.-]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^\"]*)\" ([-.0-9]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^ ]*)\" \"([^\s]+?)\" \"([^\s]+)\" \"([^ ]*)\" \"([^ ]*)\"
serialization.format: "1"
SerializationLibrary: org.apache.hadoop.hive.serde2.RegexSerDe
TableType: EXTERNAL_TABLE
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment