Skip to content

Instantly share code, notes, and snippets.

@JoeShi
Created April 25, 2019 10:17
Show Gist options
  • Save JoeShi/d374824880fa19d196ac87f153c1abcd to your computer and use it in GitHub Desktop.
Save JoeShi/d374824880fa19d196ac87f153c1abcd to your computer and use it in GitHub Desktop.

创建 HIVE 表

CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs (
  DateObject Date,
  Time STRING,
  Location STRING,
  Bytes INT,
  RequestIP STRING,
  Method STRING,
  Host STRING,
  Uri STRING,
  Status INT,
  Referrer STRING,
  OS String,
  Browser String, 
  BrowserVersion String
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "^(?!#)([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+[^\(]+[\(]([^\;]+).*\%20([^\/]+)[\/](.*)$"
) LOCATION 's3://cn-northwest-1.elasticmapreduce.samples/cloudfront/data';

执行查询

INSERT OVERWRITE DIRECTORY 's3://<bucket-name>/hive/cloudfront_etl/' 
SELECT os, COUNT(*) count 
FROM cloudfront_logs WHERE dateobject BETWEEN '2014-07-05' AND '2014-08-05' GROUP BY os;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment