Skip to content

Instantly share code, notes, and snippets.

@tonys-code-base
Created March 10, 2020 04:21
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tonys-code-base/2464d66287c27ed01491bc118f1e4775 to your computer and use it in GitHub Desktop.
Save tonys-code-base/2464d66287c27ed01491bc118f1e4775 to your computer and use it in GitHub Desktop.
AWS S3: Simple Data Mining Techniques

AWS S3: Simple Data Mining Techniques

Recently, I was involved in a piece of work to move some fairly large on-site database tables to AWS S3.

Part of the post-upload verification included reconciling record counts and visual inspection of sample data to ensure format was as expected.

Ideally, AWS Athena would have been the user-friendly way of achieving this, however, there were some organisational constraints on accessing additional AWS services, including Athena.

This article focuses on a few alternative methods that can be used to perform simple mining/verification of data in S3 buckets.

Methods Covered

AWS CLI command:

  • aws s3api select-object-content

Examples of executing this command from different OS shells (Windows Powershell, Command Prompt and Linux Bash) will also be included, with the aim of demonstrating some subtle differences that cover:

  • Use of non displayable command parameters (such as tabs)
  • Piping commands and limitations

Prerequisites

Windows 10

Install the AWS CLI, which can be downloaded from here.

Some examples using Windows 10 make use of gzip. If you are interested in running these, then install Git Bash for Windows, which includes the binaries required.

Linux

Use the procedure for your specific Linux distribution to install the awscli tools. Ubuntu is used for examples in this article. Installation on Ubuntu can be accomplished via pip :

pip3 install awscli

After the AWS CLI is installed, configuration with AWS security credentials is required. Quick config instructions are provided in the AWS Documentation.

Creating the Test File (.gz)

A test file, used for the examples, was generated using Mockaroo.

The file contains 1001 mocked-up records (including header), with the fields shown below.

 first_name
 last_name 
 email 
 gender 
 ip_address

Fields are TAB delimited and the file is saved as MOCK_DATA.txt

After saving, the file was gzipped with a final name MOCK_DATA.txt.gz. The final .gz can be downloaded from here.

Filed was upload to the target S3 bucket using:

aws s3 cp "MOCK_DATA.txt.gz" s3://test-bucket/

Input File Field Delimiters

Windows Command Prompt

Given that the fields in the test file are tab delimited, trying to enter a tab as part of command from the Windows Command Prompt triggers command Auto-Completion.

To get around this, some options are:

Read Delimiter from a File into a Variable

Create a text file fielddelim.txt containing a tab enclosed in double quotes

The contents of the file can be read and saved into a variable as follows:

c:\>set /P fielddelim=<fielddelim.txt

Variable fielddelim will resolve to a tab literal and can be referenced using %fielddelim% at the Windows Command prompt

Disable cmd.exe File and Directory Name Completion

At the command prompt, you can disable file and directory name completion by entering the following:

c:\>cmd /f:off

Windows Powershell

In Powershell, a tab literal can be represented by `t

Linux Bash

In Bash, a tab can be echoed via echo -e t or \t

Continuation of Long Commands

From a Windows Command, a long command can be continued onto the next line by entering ^ at the end of the line.

For Powershell, commands can be continued by ` at the end of a line.

In Bash, a backslash \ is used at the end of a line for continuation.

AWS CLI: select-object-content

With the file now in at target s3 bucket s3://test-bucket/MOCK_DATA.txt.gz, we can start carrying some verification.

The AWS CLI provides the [**select-object-content**](https://docs.aws.amazon.com/cli/latest/reference/s3api/select-object-content.html) command for running simple SQL against bucket objects.

AWS CLI select-object-command Syntax

Example 1 : Get Record Count

Windows Command Prompt

c:\>set /P fielddelim=<fielddelim.txt
c:\>aws s3api select-object-content ^
--bucket test-bucket ^
--key MOCK_DATA.txt.gz ^
--expression-type SQL ^
--expression "select count(*) from s3object" ^
--input-serialization CSV={"FieldDelimiter"='%fielddelim%'},"CompressionType"="GZIP" ^
--output-serialization CSV={"FieldDelimiter"='%fielddelim%'} ^
"results.csv" && type "results.csv" && del "results.csv"

Powershell Command

PS C:\>aws s3api select-object-content `
--bucket test-bucket `
--key MOCK_DATA.txt.gz `
--expression-type SQL `
--expression "select count(*)from s3object" `
--input-serialization `
 "CSV={FieldDelimiter=`'`t`'},CompressionType='GZIP'" `
--output-serialization "CSV={FieldDelimiter=`'`t`'}" results.csv `
; if($?) {cat "results.csv"} `
; if($?) {del "results.csv"}

Bash Command

ubuntu:~$ aws s3api select-object-content \
--bucket test-bucket \
--key MOCK_DATA.txt.gz \
--expression-type SQL \
--expression "select count(*)from s3object" \
--input-serialization \
"CSV={FieldDelimiter=$(echo -e "'\t'")},CompressionType='GZIP'" \
--output-serialization "CSV={FieldDelimiter=$(echo -e "'\t'")}" \
/dev/stdout 2>&1

All the above return a record count of 1001.

Example 2 : Display 10 Records

Windows Command Prompt

c:\>set /P fielddelim=<fielddelim.txt
c:\>aws s3api select-object-content ^
--bucket test-bucket ^
--key MOCK_DATA.txt.gz ^
--expression-type SQL ^
--expression "select * from s3object limit 10" ^
--input-serialization CSV={"FieldDelimiter"='%fielddelim%'},"CompressionType"="GZIP" ^
--output-serialization CSV={"FieldDelimiter"='%fielddelim%'} ^
"results.csv" && type "results.csv" && del "results.csv"

Powershell Command

PS C:\>aws s3api select-object-content `
--bucket test-bucket `
--key MOCK_DATA.txt.gz `
--expression-type SQL `
--expression "select * from s3object limit 10" `
--input-serialization `
 "CSV={FieldDelimiter=`'`t`'},CompressionType='GZIP'" `
--output-serialization "CSV={FieldDelimiter=`'`t`'}" results.csv `
; if($?) {cat "results.csv"} `
; if($?) {del "results.csv"}

Bash Command

ubuntu:~$ aws s3api select-object-content \
--bucket test-bucket \
--key MOCK_DATA.txt.gz \
--expression-type SQL \
--expression "select * from s3object limit 10" \
--input-serialization \
"CSV={FieldDelimiter=$(echo -e "'\t'")},CompressionType='GZIP'" \
--output-serialization "CSV={FieldDelimiter=$(echo -e "'\t'")}" \
/dev/stdout 2>&1

All above commands return:

first_name last_name email gender ip_address
Joel cedar joel@dailymail.co.uk Male 140.6.125.251
Taby zaytnee Taby@iset.com Female 49.124.33.169
injah colosol ijaja2@xmnr.com Male 235.73.125.24
Helli Jennions hjennions3@slideshare.net Female 10.90.32.40
Jeannette Gilgryst jgilgryst4@un.org Female 236.3.7.10
Billie Baake bbaake5@storify.com Male 223.249.14.251
Meier Wethey mwethey6@sourceforge.net Male 155.85.167.90
Modestine Talbot mtalbot7@comcast.net Female 230.78.193.26
Dorelia Ilchenko dilchenko8@topsy.com Female 25.50.29.33

Example 3: List first_name, email, gender where gender is equal to “Male” and first_name starts with "Ja".

Windows Command Prompt

aws s3api select-object-content ^
--bucket test-bucket ^
--key MOCK_DATA.txt.gz ^
--expression-type SQL ^ 
--expression "select _3 from s3object where _4 = 'Male' and _1 LIKE 'Ja%'" ^
--input-serialization CSV={"FieldDelimiter"='%fielddelim%'},"CompressionType"="GZIP" ^
--output-serialization CSV={"FieldDelimiter"='%fielddelim%'} ^
"results.csv" && type "results.csv" && del "results.csv"

Powershell Command

aws s3api select-object-content `
--bucket test-bucket `
--key MOCK_DATA.txt.gz `
--expression-type SQL `
--expression "select _3 from s3object where _4 = 'Male' and _1 LIKE 'Ja%'" `
--input-serialization `
 "CSV={FieldDelimiter=`'`t`'},CompressionType='GZIP'" `
--output-serialization "CSV={FieldDelimiter=`'`t`'}" results.csv `
; if($?) {cat "results.csv"} `
; if($?) {del "results.csv"}

Bash Command

ubuntu:~$ aws s3api select-object-content \
--bucket test-bucket \
--key MOCK_DATA.txt.gz \
--expression-type SQL \
--expression "select _3 from s3object where _4 = 'Male' and _1 LIKE 'Ja%'" \
--input-serialization \
"CSV={FieldDelimiter=$(echo -e "'\t'")},CompressionType='GZIP'" \
--output-serialization "CSV={FieldDelimiter=$(echo -e "'\t'")}" \
/dev/stdout 2>&1

The output of each of the above commands is identical:

first_name email gender
Jay jkerrodcc@lycos.com Male
Jayson jcaldicote2@netvibes.com Male
Jamill jcamseyea@redcross.org Male
Jasun jbarajasez@vistaprint.com Male
Jammal jsutheringtongg@java.com Male
Jamill jbalaizotmc@cargocollective.com Male
Jarrett jdurdanpe@thetimes.co.uk Male

Note: *The fields referenced in the select statements are the corresponding references to the CSV columns in the input file:

_1 = first_name
_2 = last_name
_3 = email
_4 = gender
_5 = ip_address

Further Reading

AWS Documentation

Limitations

The following is quoted from the AWS Documentation:

“Amazon S3 Select and Glacier Select queries currently do not support subqueries or joins.”

Other limits/requirements are mentioned in the documentation, which can be found at the link above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment