Skip to content

Instantly share code, notes, and snippets.

@farski
Last active November 18, 2021 17:56
Show Gist options
  • Save farski/2b86511188079c09c28d6cb3feff41c0 to your computer and use it in GitHub Desktop.
Save farski/2b86511188079c09c28d6cb3feff41c0 to your computer and use it in GitHub Desktop.

More snippets

Organization Trail Logs

Organization trails are a mechanism for capturing AWS CloudTrail logs across entire AWS Organizations. They are similar to standard CloudTrail logs, but have a slightly different file organization structure in S3, which affects their table definitions.

The included SQL and CloudFormation YAML snippets create AWS Glue tables for organization trail logs, which can be queried using Amazon Athena. The snippets are intended to be equivalent.

The tables that are created utilize partition projections, which allows for efficient, cost-effective querying of the log data in Amazon S3, without the need to manual create new partitions. Instead, the partitions are created based on the filesnames of the log files. Using various portions of the filenames, daily partitions are available when querying the log data for specific regions and accounts.

When using a snippet to create a table, you must set the values of the region and account partition projections. Each of these expects a comma-separeted list (e.g., us-east-1,us-west-2 or 111122223333,888899990000), and determines which regions and accounts within the organization to create partitions for. There is no limit, but Athena recommends keeping each partition to 12 or fewer. This allows you to create a single Glue table for multiple regions and accounts, rather than needing to create individual tables for each.

When querying a table that includes the partitions, the SELECT should include a WHERE clause that filters data covered by the range of the partition. For example:

SELECT *
FROM __DATABASE_NAME__.__TABLE_NAME__ 
WHERE "date" >= date_format(current_date - interval '7' day, '%Y/%m/%d')
  AND region IN ('us-east-1,us-west-2')
  AND account IN ('111122223333')

The date, region, and account fields in the WHERE clause match the projections configured on the table. The name of these projections is arbitrary.

Note that when records from CloudTrail logs include fields similar to the projections, such as awsRegion or eventTime, you must still use the specific projection fields to utilize partitioning. You can, for example, use borh the date and eventTime fields in a WHERE clause to access certain partitions, and also filter more specifically by the event time.

-- See: https://gist.github.com/farski/2b86511188079c09c28d6cb3feff41c0
CREATE EXTERNAL TABLE IF NOT EXISTS __DATABASE_NAME__.__TABLE_NAME__ (
eventVersion STRING,
userIdentity struct<
type: string,
principalId: string,
arn: string,
accountId: string,
invokedBy: string,
accessKeyId: string,
userName: string,
sessionContext: struct<
attributes: struct<
mfaAuthenticated: string,
creationDate: string>,
sessionIssuer: struct<
type: string,
principalId: string,
arn: string,
accountId: string,
userName: string>>>,
eventTime string,
eventSource string,
eventName string,
awsRegion string,
sourceIpAddress string,
userAgent string,
errorCode string,
errorMessage string,
requestParameters string,
responseElements string,
additionalEventData string,
requestId string,
eventId string,
readOnly string,
resources ARRAY<struct<
arn: string,
accountId: string,
type: string>>,
eventType string,
apiVersion string,
recipientAccountId string,
serviceEventDetails string,
sharedEventID string,
vpcEndpointId string
)
COMMENT "TKTKTK A description of the table"
PARTITIONED BY (
`date` string,
region string,
account string
)
ROW FORMAT SERDE "com.amazon.emr.hive.serde.CloudTrailSerde"
STORED AS
INPUTFORMAT "com.amazon.emr.cloudtrail.CloudTrailInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
LOCATION "s3://__BUCKET__/AWSLogs/__ORGANIZATION_ID_"
TBLPROPERTIES (
"projection.enabled" = "true",
"projection.date.type" = "date",
"projection.date.range" = "2021/01/01,NOW",
"projection.date.format" = "yyyy/MM/dd",
"projection.date.interval" = "1",
"projection.date.interval.unit" = "DAYS",
"projection.region.type" = "enum",
"projection.region.values" = "__LIST,OF,REGIONS__",
"projection.account.type" = "enum",
"projection.account.values" = "__LIST_OF_ACCOUNT_IDS__",
"storage.location.template" = "s3://__BUCKET__/AWSLogs/__ORGANIZATION_ID_/${account}/CloudTrail/${region}/${date}"
);
# See: https://gist.github.com/farski/2b86511188079c09c28d6cb3feff41c0
OrganizationTrailTable:
Type: AWS::Glue::Table
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseName: __DATABASE_NAME__
TableInput:
Description: TKTKTK A description of the table
Name: __TABLE_NAME__
Parameters:
projection.enabled: "true"
projection.date.type: date
projection.date.range: 2021/01/01,NOW
projection.date.format: yyyy/MM/dd
projection.date.interval: "1"
projection.date.interval.unit: DAYS
projection.region.type: enum
projection.region.values: __LIST_OF_REGIONS__
projection.account.type: enum
projection.account.values: __LIST_OF_ACCOUNT_IDS__
storage.location.template: !Sub s3://__BUCKET__/AWSLogs/__ORGANIZATION_ID_/${!account}/CloudTrail/${!region}/${!date}
PartitionKeys:
- Name: date
Type: string
- Name: region
Type: string
- Name: account
Type: string
StorageDescriptor:
Columns:
- Name: eventVersion
Type: string
- Name: userIdentity
Type: struct<type:string,principalId:string,arn:string,accountId:string,invokedBy:string,accessKeyId:string,userName:string,sessionContext:struct<attributes:struct<mfaAuthenticated:string,creationDate:string>,sessionIssuer:struct<type:string,principalId:string,arn:string,accountId:string,userName:string>>>
- Name: eventTime
Type: string
- Name: eventSource
Type: string
- Name: eventName
Type: string
- Name: awsRegion
Type: string
- Name: sourceIpAddress
Type: string
- Name: userAgent
Type: string
- Name: errorCode
Type: string
- Name: errorMessage
Type: string
- Name: requestParameters
Type: string
- Name: responseElements
Type: string
- Name: additionalEventData
Type: string
- Name: requestId
Type: string
- Name: eventId
Type: string
- Name: readOnly
Type: string
- Name: resources
Type: array<struct<arn:string,accountId:string,type:string>>
- Name: eventType
Type: string
- Name: apiVersion
Type: string
- Name: recipientAccountId
Type: string
- Name: serviceEventDetails
Type: string
- Name: sharedEventID
Type: string
- Name: vpcEndpointId
Type: string
InputFormat: com.amazon.emr.cloudtrail.CloudTrailInputFormat
Location: !Sub s3://__BUCKET__/AWSLogs/__ORGANIZATION_ID_
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
SerdeInfo:
SerializationLibrary: com.amazon.emr.hive.serde.CloudTrailSerde
TableType: EXTERNAL_TABLE
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment