Skip to content

Instantly share code, notes, and snippets.

@chrisdpa-tvx
Last active March 20, 2022 06:28
Show Gist options
  • Star 17 You must be signed in to star a gist
  • Fork 4 You must be signed in to fork a gist
  • Save chrisdpa-tvx/96ad6099da868bf83579fcb0d8caa00c to your computer and use it in GitHub Desktop.
Save chrisdpa-tvx/96ad6099da868bf83579fcb0d8caa00c to your computer and use it in GitHub Desktop.
Create an Athena database, table, and query

All Your Data Does Not Belong In a Database

Businesses are machines producing mountains of data about sales, usage, customer, costs, etc... Traditionally data processing is highly centralised with teams of staff and computer running hot a whirling ready to process. We can do better than moving the mountain of data into the corporate data machine - so long as that machinary is light enough to be moved to the data.

Don't move the mountain - Bring the processing to the data

We've had this problem; a huge directory of files in CSV format, conataining vital information for our business. But it's in CSV, requires analysis, and don't you don't feel like learning sed/grep/awk today - besides it's 2017 and no-one thinks those tools are easy to use.

It would be great to get those files into a database and run SQL queires over the data. Databases are propriety; they take your data and squirel it away in a format no one else can read - how rude! Not only that, running databases is expensive - look at all those yachts. There are costs associated with software, compute resources, storage, administration, staffing, updates, upgrades, etc...

Database software is often free but at times corporate policy can dictate the use of paid databases or restrict clients based on installation policy. Databases are "always on" taking up compute resources. They're available 24/7 and great for transactional systems but for scheduled reporting it's overkill. Reports run once a daymonth|quarter so why pay for all that uptime?

At the end of the day your boss just wants to know how much the company spent on a resource for the month. This is when Athena can be extremely helpful.

Athena Introduction

It's new, it's shiny, and a handy tool to add to your AWS knowledge.

Athena is Amazon's recipe to provide SQL queries (or any function availabe in Preso) over data stored in flat files - provided you store those files in their object storage service S3. There's no need to load files into a database - just create a simple data definition and away you go.

Create an Athena "database"

First you will need to create a database that Athena uses to access your data. It's still a database but data is stored in text files in S3 - I'm using Boto3 and Python to automate my infrastructure. Athena in still fresh has yet to be added to Cloudformation.

Create database command

import boto3

ath = boto3.client('athena')

ath.start_query_execution(
    QueryString='create database mangolassi',
    ResultConfiguration={'OutputLocation': 's3://mango-lassi-costings/queries/'})

Data Definition Files

You need to tell Athena about the data you will query against. This example creates an external table that is an Athena representation of our billing and cloudfront data.

My personal preference is to use string column data types in staging tables. That way I can cast the string to the desired type as needed and get results faster - get it working then make it right

Prepare external table file, billing_data.ddl and save the file in the same folder as your python bootstrap code. This file is comma seperated values. Tell Athena to use the serial decoder OpenCSVSerde.

CREATE EXTERNAL TABLE IF NOT EXISTS
mangolassi.billing_data (
  InvoiceID string,
  PayerAccountId string,
  LinkedAccountId string,
  RecordType string,
  ProductName string,
  RateId string,
  SubscriptionId string,
  PricingPlanId string,
  UsageType string,
  Operation string,
  AvailabilityZone string,
  ReservedInstance string,
  ItemDescription string,
  UsageStartDate string,
  UsageEndDate string,
  UsageQuantity string,
  BlendedRate string,
  BlendedCost string,
  UnBlendedRate string,
  UnBlendedCost string
) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
  'separatorChar' = ',',
  'quoteChar' = '\"',
  'escapeChar' = '\\'
)
STORED AS TEXTFILE
LOCATION 's3://your-bucket-name/billing_line_items/';

Execute create table for billing data:

import boto3

ath = boto3.client('athena')

with open('billing_data.ddl') as ddl:
    ath.start_query_execution(
        QueryString=ddl.read(),
        ResultConfiguration={'OutputLocation': 's3://your-bucket-name/queries/'})

The contents of the ELB data is very similar, except this data is tab seperated:

CREATE EXTERNAL TABLE IF NOT EXISTS
  mangolassi.cloudfront_data (
    rec_date string,
    rec_time string,
    x_edge_location string,
    sc_bytes string,
    c_ip string,
    cs_method string,
    cs_Host string,
    cs_uri_stem string,
    sc_status string,
    cs_Referer string,
    cs_User_Agent_ string,
    cs_uri_query string,
    cs_Cookie string,
    x_edge_result_type string,
    x_edge_request_id string,
    x_host_header string,
    cs_protocol string,
    cs_bytes string,
    time_taken string,
    x_forwarded_for string,
    ssl_protocol string,
    ssl_cipher string,
    x_edge_response_result_type string,
    cs_protocol_version string
  ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
  WITH SERDEPROPERTIES (
    'serialization.format' = '    ',
    'field.delim' = ' '
  ) LOCATION 's3://cloudfront-data-bucket/'
  TBLPROPERTIES ('has_encrypted_data'='false');

Athen uses the contents of the files in the s3 bucket LOCATION 's3://spotdatafeed/' as the data for your table testing_athena_example.testing_spotfleet_data:

spotfleet-data head xxxxxxxxxxxxx.2017-06-13-00.002.ix3h0TZJ
#Version: 1.0
#Fields: Timestamp UsageType Operation InstanceID MyBidID MyMaxPrice MarketPrice Charge Version
2017-06-13 00:24:46 UTC EU-SpotUsage:c3.2xlarge RunInstances:SV002  i-0351541d28xxxxxxx sir-zcf8xxxx    0.336 USD   0.103 USD   0.103 USD   2
2017-06-13 00:36:06 UTC EU-SpotUsage:c3.2xlarge RunInstances:SV002  i-036f477716xxxxxxx sir-6jkixxxx    0.468 USD   0.104 USD   0.104 USD   2
2017-06-13 00:24:47 UTC EU-SpotUsage:c3.2xlarge RunInstances:SV002  i-038ec39e52xxxxxxx sir-6q5gxxxx    0.336 USD   0.103 USD   0.103 USD   2
2017-06-13 00:39:45 UTC EU-SpotUsage:c3.2xlarge RunInstances:SV002  i-0448a64d03xxxxxxx sir-cqf8xxxx    0.833 USD   0.104 USD   0.104 USD   2

Run queries against your Athena database

Athena queries files using SQL commands in a Presto setting Now it's easy to run SQL queries against your database:

➜  aws athena start-query-execution \
   --query-string 'SELECT * FROM testing_athena_example.testing_spotfleet_data limit 10' \
   --result-configuration 'OutputLocation=s3://aws-athena-query-results'
{
    "QueryExecutionId": "13623b03-702c-4660-9bf1-xx"
}

And see the results:

➜  spotfleet-data sts aws athena get-query-results --query-execution-id 13623b03-702c-4660-9bf1-xx
{
    "ResultSet": {
        "Rows": [
            {
                "Data": [
                    {
                        "VarCharValue": "created"
                    },
                    ...
                    ...
                    ...

The result is also saved to S3:

➜  aws s3 ls aws-athena-query-results
2017-06-14 11:12:45       1345 13623b03-702c-4660-9bf1-8db8482a9922.csv
2017-06-14 11:12:45        474 13623b03-702c-4660-9bf1-8db8482a9922.csv.metadata

You can copy that file easily:

➜  aws s3 cp s3://aws-athena-query-results/13623b03-702c-4660-9bf1-8db8482a9922.csv 13623b03-702c-4660-9bf1-8db8482a9922.csv
download: s3://aws-athena-query-results/13623b03-702c-4660-9bf1-8db8482a9922.csv to ./13623b03-702c-4660-9bf1-8db8482a9922.csv

And the contents:

"created","usagetype","operation","instanceid","mybidid","mymaxprice","marketprice","charge","version"
"#Version: 1.0",,,,,,,,
"#Fields: Timestamp UsageType Operation InstanceID MyBidID MyMaxPrice MarketPrice Charge Version",,,,,,,,
"2016-09-23 12:29:16 UTC","EU-SpotUsage:m4.large","RunInstances:SV008","i-1f11x","sir-x","0.269 USD","0.016 USD","0.016 USD","1"
"2016-09-23 12:10:08 UTC","EU-SpotUsage:m4.large","RunInstances:SV008","i-358x","sir-x","0.389 USD","0.016 USD","0.016 USD","1"
"2016-09-23 12:11:31 UTC","EU-SpotUsage:m4.large","RunInstances:SV008","i-501x","sir-x","0.389 USD","0.016 USD","0.016 USD","1"
...

Athena Queries

Once you have your database and external tables defined and working it's super easy to query the data. You can use standard SQL and any of the functions or operators defined by Presto: https://prestodb.io/docs/current/functions.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment