Skip to content

Instantly share code, notes, and snippets.

@dacort
Created July 22, 2019 16:17
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dacort/b7f1c4dce5d180d088c245230ca2b5af to your computer and use it in GitHub Desktop.
Save dacort/b7f1c4dce5d180d088c245230ca2b5af to your computer and use it in GitHub Desktop.
Athena CTAS Update

Athena CTAS Update

An approach for updating partitions in an existing table using CTAS queries.

Overview

With the release of CTAS functionality for Athena, you're now able to create derivative tables in Athena with different data formats or S3 locations.

Sometimes, though, you want to be able to add data to a partition of an existing table. As long as that partition doesn't already exist, you can do this with Athena by using CTAS with a temporary table.

Sample Tables

For this example, we'll use the Amazon Reviews open dataset. This dataset is provided as both TSV and Paruqet datasets, but we'll use TSV as our source and show how to insert into a partitioned Parquet dataset.

Create the source table

CREATE EXTERNAL TABLE amazon_reviews_tsv(
  marketplace string,
  customer_id string,
  review_id string,
  product_id string,
  product_parent string,
  product_title string,
  product_category string,
  star_rating int,
  helpful_votes int,
  total_votes int,
  vine string,
  verified_purchase string,
  review_headline string,
  review_body string,
  review_date bigint,
  year int)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '\t'
  ESCAPED BY '\\'
  LINES TERMINATED BY '\n'
LOCATION 's3://amazon-reviews-pds/tsv/'
TBLPROPERTIES ("skip.header.line.count"="1");

Run a simple query to ensure the CREATE statement worked:

SELECT * FROM amazon_reviews_tsv WHERE "$path" LIKE '%.tsv' LIMIT 10

Create our destination table

CTAS queries can create the table for you, but since we want a partitioned output table, we're going to create it ahead of time.

For this example, we're going to partition by product_category.

Note: Replace the S3 bucket with your own.

CREATE EXTERNAL TABLE amazon_reviews_parquet(
  marketplace string,
  customer_id string,
  review_id string,
  product_id string,
  product_parent string,
  product_title string,
  star_rating int,
  helpful_votes int,
  total_votes int,
  vine string,
  verified_purchase string,
  review_headline string,
  review_body string,
  review_date bigint,
  year int)
PARTITIONED BY ( `product_category` string )
STORED AS PARQUET
LOCATION 's3://<bucket>/amazon_reviews_partitioned';

OK! Now we have source TSV table and a Parquet destination table with no data in it. Let's fix that.

Converting data

Here's the process we'll follow for inserting partitioned data into our destination.

  1. Use a CTAS query to convert the TSV data to Parquet with a temporary table
    1. Note that the S3 location will be a partition inside the location above
  2. Add the new folder as a partition inside the original table
  3. Delete the temporary table

Convert data with CTAS

  • We can specify any S3 location, so we'll use a partition inside the amazon_reviews_parquet table.
  • We have to manually specify the column list because we are extracting the product_category column as the partition and it has to be last.
  • The S3 location here is hard-coded with the product category we're querying for.
CREATE TABLE tmp_ctas_001 WITH (
    format='PARQUET',
    external_location = 's3://<bucket>/amazon_reviews_partitioned/product_category=Toys/'
)
AS SELECT marketplace, customer_id, review_id, product_id, product_parent, product_title, star_rating, helpful_votes, total_votes, vine, verified_purchase, review_headline, review_body, review_date, year, product_category
FROM amazon_reviews_tsv
WHERE product_category = 'Toys'

This query will take about a minute to run and create ~750 MiB of data in your S3 bucket.

Add resulting folder as partition

Now that we've converted the data to Parquet and specified an S3 location as a partition inside the amazon_reviews_parquet table, we manually add this partition.

ALTER TABLE amazon_reviews_parquet ADD PARTITION (product_category='Toys')

Drop the tmp table

This only drops the table definition, not the underlying data.

DROP TABLE tmp_ctas_001

Query original table

You should now be able to query the original table.

SELECT * FROM amazon_reviews_parquet LIMIT 10
SELECT COUNT(*) FROM amazon_reviews_parquet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment