Skip to content

Instantly share code, notes, and snippets.

@Ajido
Last active November 26, 2019 11:47
Show Gist options
  • Save Ajido/884cf4ad2ff027e945513dbd345e0f4f to your computer and use it in GitHub Desktop.
Save Ajido/884cf4ad2ff027e945513dbd345e0f4f to your computer and use it in GitHub Desktop.
Amazon Athena メモ

データベース作って

CREATE DATABASE s3_logs;

テーブル作って

CREATE EXTERNAL TABLE IF NOT EXISTS s3_logs.mybucket(
    BucketOwner STRING,
    Bucket STRING,
    RequestDateTime DATE,
    RemoteIP STRING,
    Requester STRING,
    RequestID STRING,
    Operation STRING,
    Key STRING,
    RequestURI_operation STRING,
    RequestURI_key STRING,
    RequestURI_httpProtoversion STRING,
    HTTPstatus STRING,
    ErrorCode STRING,
    BytesSent STRING,
    ObjectSize STRING,
    TotalTime STRING,
    TurnAroundTime STRING,
    Referrer STRING,
    UserAgent STRING,
    VersionId STRING
) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
    'serialization.format' = '1',
    'input.regex' = '([^ ]*) ([^ ]*) \\[(.*?)\\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \\\"([^ ]*) ([^ ]*) (- |[^ ]*)\\\" (-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\") ([^ ]*)$'
) LOCATION 's3://mybucket/logs/';

S3のオペレーション別集計

SELECT count(*) AS count,
         operation,
         useragent
FROM mybucket
WHERE requestdatetime LIKE '%/Dec/2016:%'
GROUP BY  operation, useragent
ORDER BY  count DESC;
@Ajido
Copy link
Author

Ajido commented Dec 17, 2016

パーティション区切って集計範囲指定しないとお金が 💸 でも日付がパース出来ない

https://aws.amazon.com/jp/blogs/big-data/analyzing-data-in-s3-using-amazon-athena/
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-TimestampstimestampTimestamps

...
    RequestDateTime TIMESTAMP,
...
WITH SERDEPROPERTIES (
    'timestamp.formats' = 'dd/MMM/yyyy:HH:mm:ss Z',
....

-- 日付 → 17/Dec/2016:02:54:51 +0000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment