Skip to content

Instantly share code, notes, and snippets.

@emmanuelnk
Last active April 26, 2023 16:25
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save emmanuelnk/33bf26dbc28d0b054d92c5a74c6da07e to your computer and use it in GitHub Desktop.
Save emmanuelnk/33bf26dbc28d0b054d92c5a74c6da07e to your computer and use it in GitHub Desktop.
Useful Athena Queries

A collection of useful Athena Queries

Web Application Firewall

Create WAF logs with projection (no need to partition)

  • See the docs for reference

  • Replace the values BUCKET_NAME, AWS_ACCOUNT_NUMBER, AWS_REGION, WEB_ACL_NAME accordingly

  • Replace 'projection.date.range' = '2023/01/01/00/00,NOW', with the date your WAF logging begins (find in S3)

    Query
    CREATE EXTERNAL TABLE `waf_logs`(
      `timestamp` bigint,
      `formatversion` int,
      `webaclid` string,
      `terminatingruleid` string,
      `terminatingruletype` string,
      `action` string,
      `terminatingrulematchdetails` array <
                                        struct <
                                            conditiontype: string,
                                            sensitivitylevel: string,
                                            location: string,
                                            matcheddata: array < string >
                                              >
                                         >,
      `httpsourcename` string,
      `httpsourceid` string,
      `rulegrouplist` array <
                          struct <
                              rulegroupid: string,
                              terminatingrule: struct <
                                                  ruleid: string,
                                                  action: string,
                                                  rulematchdetails: array <
                                                                       struct <
                                                                           conditiontype: string,
                                                                           sensitivitylevel: string,
                                                                           location: string,
                                                                           matcheddata: array < string >
                                                                              >
                                                                        >
                                                    >,
                              nonterminatingmatchingrules: array <
                                                                  struct <
                                                                      ruleid: string,
                                                                      action: string,
                                                                      rulematchdetails: array <
                                                                                           struct <
                                                                                               conditiontype: string,
                                                                                               sensitivitylevel: string,
                                                                                               location: string,
                                                                                               matcheddata: array < string >
                                                                                                  >
                                                                                           >
                                                                        >
                                                                 >,
                              excludedrules: string
                                >
                           >,
    `ratebasedrulelist` array <
                             struct <
                                 ratebasedruleid: string,
                                 limitkey: string,
                                 maxrateallowed: int
                                   >
                              >,
      `nonterminatingmatchingrules` array <
                                        struct <
                                            ruleid: string,
                                            action: string,
                                            rulematchdetails: array <
                                                                 struct <
                                                                     conditiontype: string,
                                                                     sensitivitylevel: string,
                                                                     location: string,
                                                                     matcheddata: array < string >
                                                                        >
                                                                 >,
                                            captcharesponse: struct <
                                                                responsecode: string,
                                                                solvetimestamp: string
                                                                 >
                                              >
                                         >,
      `requestheadersinserted` array <
                                    struct <
                                        name: string,
                                        value: string
                                          >
                                     >,
      `responsecodesent` string,
      `httprequest` struct <
                        clientip: string,
                        country: string,
                        headers: array <
                                    struct <
                                        name: string,
                                        value: string
                                          >
                                     >,
                        uri: string,
                        args: string,
                        httpversion: string,
                        httpmethod: string,
                        requestid: string
                          >,
      `labels` array <
                   struct <
                       name: string
                         >
                    >,
      `captcharesponse` struct <
                            responsecode: string,
                            solvetimestamp: string,
                            failureReason: string
                              >
    )
    PARTITIONED BY ( 
    `region` string, 
    `date` string) 
    ROW FORMAT SERDE 
      'org.openx.data.jsonserde.JsonSerDe' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.mapred.TextInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION
      's3://BUCKET_NAME/AWSLogs/AWS_ACCOUNT_NUMBER/WAFLogs/AWS_REGION/WEB_ACL_NAME/'
    TBLPROPERTIES(
     'projection.enabled' = 'true',
     'projection.region.type' = 'enum',
     'projection.region.values' = 'AWS_REGION',
     'projection.date.type' = 'date',
     'projection.date.range' = '2023/01/01/00/00,NOW',
     'projection.date.format' = 'yyyy/MM/dd/HH/mm',
     'projection.date.interval' = '1',
     'projection.date.interval.unit' = 'HOURS',
     'storage.location.template' = 's3://BUCKET_NAME/AWSLogs/AWS_ACCOUNT_NUMBER/WAFLogs/${region}/WEB_ACL_NAME/${date}')

Application Load Balancer Access Logs

Create ALB logs with projection

  • See the docs here

  • Replace the values BUCKET_NAME, AWS_ACCOUNT_NUMBER, AWS_REGION accordingly

  • Replace 'projection.date.range' = '2022/01/01,NOW', with the date your ALB logging begins (find in S3)

    Query
    CREATE EXTERNAL TABLE IF NOT EXISTS alb_logs_projected (
               type string,
               time string,
               elb string,
               client_ip string,
               client_port int,
               target_ip string,
               target_port int,
               request_processing_time double,
               target_processing_time double,
               response_processing_time double,
               elb_status_code int,
               target_status_code string,
               received_bytes bigint,
               sent_bytes bigint,
               request_verb string,
               request_url string,
               request_proto string,
               user_agent string,
               ssl_cipher string,
               ssl_protocol string,
               target_group_arn string,
               trace_id string,
               domain_name string,
               chosen_cert_arn string,
               matched_rule_priority string,
               request_creation_time string,
               actions_executed string,
               redirect_url string,
               lambda_error_reason string,
               target_port_list string,
               target_status_code_list string,
               classification string,
               classification_reason string
               )
               PARTITIONED BY
               (
                `date` STRING
               )
               ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
               WITH SERDEPROPERTIES (
               'serialization.format' = '1',
               'input.regex' = 
               '([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) (.*) (- |[^ ]*)\" \"([^\"]*)\" ([A-Z0-9-_]+) ([A-Za-z0-9.-]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^\"]*)\" ([-.0-9]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^ ]*)\" \"([^\s]+?)\" \"([^\s]+)\" \"([^ ]*)\" \"([^ ]*)\"')
               LOCATION 's3://BUCKET_NAME/AWSLogs/AWS_ACCOUNT_NUMBER/elasticloadbalancing/AWS_REGION/'
               TBLPROPERTIES
               (
                "projection.enabled" = "true",
                "projection.date.type" = "date",
                "projection.date.range" = "2022/01/01,NOW",
                "projection.date.format" = "yyyy/MM/dd",
                "projection.date.interval" = "1",
                "projection.date.interval.unit" = "DAYS",
                "storage.location.template" = "s3://BUCKET_NAME/AWSLogs/AWS_ACCOUNT_NUMBER/elasticloadbalancing/AWS_REGION/${date}"
               )

Get last requests of a stopped instance/task

Query
# Get the IP address of a crashed task from the Stopped Tasks Widget on the Production-API-Segmented CW Dashboard
SELECT *
FROM alb_logs_projected
  WHERE date = '2022/10/17' 
  AND target_processing_time > 10 # seconds - requests that take long to process can be problematic
  AND received_bytes > 8000 # bytes- large payloads can typically cause OutOfMemory crashes on ecs tasks
-- AND request_url LIKE '%some_url' 
-- AND regexp_like(request_verb, 'PUT') 
AND regexp_like(target_ip, '172.31.29.76') # check ip of failed task
-- AND regexp_like(target_ip, '172.31.29.76|172.31.14.108|172.31.18.14|172.31.11.49|172.31.27.153') # check multiple ip
ORDER BY time DESC

Get the amount of data sent and received per url in an ALB

Query
SELECT
  request_verb,
  request_url,
  count(*) as count,
  elb_status_code,
  count(*)/(12 * 60) as count_per_min,
  ceil(avg(received_bytes/1024)) as avg_received_kb, 
  ceil(avg(sent_bytes/1024)) as avg_sent_kb, 
  ceil(avg(target_processing_time)) as avg_target_processing_time_secs,
  ceil(max(received_bytes/1024)) as max_received_kb, 
  ceil(max(sent_bytes/1024)) as max_sent_kb, 
  ceil(max(target_processing_time)) as max_target_processing_time_secs,
  ceil((ceil(avg(target_processing_time)) * count(*))/3600) as total_processing_time_hrs
FROM alb_logs_projected
WHERE date like '2023/01/%'
GROUP BY 1, 2, 4
ORDER BY 3 DESC

Cloudtrail

Create projected cloudtrail logs

Query
CREATE EXTERNAL TABLE cloudtrail_logs_pp(
    eventVersion STRING,
    userIdentity STRUCT<
        type: STRING,
        principalId: STRING,
        arn: STRING,
        accountId: STRING,
        invokedBy: STRING,
        accessKeyId: STRING,
        userName: STRING,
        sessionContext: STRUCT<
            attributes: STRUCT<
                mfaAuthenticated: STRING,
                creationDate: STRING>,
            sessionIssuer: STRUCT<
                type: STRING,
                principalId: STRING,
                arn: STRING,
                accountId: STRING,
                userName: STRING>,
            ec2RoleDelivery:string,
            webIdFederationData:map<string,string>
        >
    >,
    eventTime STRING,
    eventSource STRING,
    eventName STRING,
    awsRegion STRING,
    sourceIpAddress STRING,
    userAgent STRING,
    errorCode STRING,
    errorMessage STRING,
    requestparameters STRING,
    responseelements STRING,
    additionaleventdata STRING,
    requestId STRING,
    eventId STRING,
    readOnly STRING,
    resources ARRAY<STRUCT<
        arn: STRING,
        accountId: STRING,
        type: STRING>>,
    eventType STRING,
    apiVersion STRING,
    recipientAccountId STRING,
    serviceEventDetails STRING,
    sharedEventID STRING,
    vpcendpointid STRING,
    tlsDetails struct<
        tlsVersion:string,
        cipherSuite:string,
        clientProvidedHostHeader:string>
  )
PARTITIONED BY (
   `account` STRING,
   `region` STRING,
   `timestamp` STRING)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://CLOUDTRAIL_BUCKET_NAME/AWSLogs/o-l09l6qorsp'
TBLPROPERTIES (
  'projection.enabled'='true', 
  'projection.account.type'='enum',
  'projection.account.values'='ACCOUNT_1,ACCOUNT_2,ACCOUNT_3,ACCOUNT_4,ACCOUNT_5,ACCOUNT_6,ACCOUNT_7',
  'projection.region.type'='enum',
  'projection.region.values'='ca-central-1,us-east-1,us-east-2,eu-central-1',
  'projection.timestamp.format'='yyyy/MM/dd', 
  'projection.timestamp.interval'='1', 
  'projection.timestamp.interval.unit'='DAYS', 
  'projection.timestamp.range'='2023/01/01,NOW', 
  'projection.timestamp.type'='date', 
  'storage.location.template'='s3://CLOUDTRAIL_BUCKET_NAME/AWSLogs/o-l09l6xxxx/${account}/CloudTrail/${region}/${timestamp}')

Notes

  • Choose the right projection type based on the S3 location. Since I was logging cloudtrail across an organization, the s3 location turned out different than the one in the AWS docs. An Id like o-l09l6xxxx is added to the S3 path and I needed to acccount for that.
  • It's super important to note that even though SQL is case insensitive, everything in storage.location.template has to be in lower case, no exceptions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment