the lesser known details
- Improving ingestion performance
- Why is pagination important?
- S3 as a SQL database
- Divide and conquer you files
- Mountpoint: the good, the bad and the awful
Start a tmux
session, configure you region, and create your own bucket:
tmux
export AWS_DEFAULT_REGION=us-east-1
export BUCKET_NAME=s3workshop$RANDOM
echo Your bucket name is going to be $BUCKET_NAME. Take note of it.
aws s3 mb s3://$BUCKET_NAME
Get the required data to test the upload performance and start a new pane
with htop
:
mkdir data
aws s3 cp s3://s3workshopdata/covers.zip data/
unzip data/covers.zip 01*.jpg 02*.jpg -d covers/
cd covers
ls . | wc -l
tmux splitw -t "$session_uuid:" -d "htop"
Test how much time does it take to upload those files. Note that s3 cp
will launch a thread for each processor in the VM:
time aws s3 cp --recursive . s3://$BUCKET_NAME/covers-simple/ > /dev/null
Test now how does it takes if we divide all files in groups (using the first three letters of the name as the prefix) and then upload two files in parallel, trying to saturate the compute capacity:
time ls . \
| cut -c1-3 \
| uniq \
| xargs -P 6 -I '{}' \
sh -c 'aws s3 cp --recursive \
--exclude "*" --include "{}*" \
. s3://$BUCKET_NAME/covers-parallel/ > /dev/null'
Optionally, you can close the additional pane:
tmux kill-pane -a -t "$session_uuid:"
S3 operations doesn't support typical Linux *
for working with files: only
prefixes are directly implemented in the path argument. Additionally, object
keys are not hierarchical: the /
is considered just like any other character.
But it influences the response provided by the AWS CLI, as you will see:
aws s3 ls s3://$BUCKET_NAME/covers-simple # (ok, but not what you intended)
aws s3 ls s3://$BUCKET_NAME/covers-simple/ # (ok)
aws s3 ls s3://$BUCKET_NAME/covers-simple/01* # (what?)
aws s3 ls s3://$BUCKET_NAME/covers-simple/01 # (ok, surprisingly)
aws s3 ls s3://$BUCKET_NAME/covers-simple/01*.jpg # (double what?)
Check how the most basic list query works: it returns a number of items (max 1000), and a pagination token.
aws s3api list-objects-v2 \
--bucket $BUCKET_NAME \
--prefix covers-simple/ \
--max-items 3
Save the token to be able to get the next bunch of files:
TOKEN=$(aws s3api list-objects-v2 \
--bucket $BUCKET_NAME \
--prefix covers-simple/ \
--max-items 3 \
| tee first-page.json \
| jq .NextToken -r)
echo Next token is $TOKEN.
cat first-page.json
Use the saved token to continue with the process of getting all the names:
token=$(aws s3api list-objects-v2 \
--bucket $BUCKET_NAME \
--prefix covers-simple/ \
--max-items 3 \
--starting-token $TOKEN \
| tee second-page.json \
| jq .NextToken -r)
cat second-page.json
How do you think this can impact the management of big amounts of files? To solve this problem, on approach consist in using S3 Inventory. It will generate a daily report with information regarding all the files included in the selection criteria.
To further explore your data distribution, it is recommended to take advantage of the Storage Lense service, that automatically aggregates information and presents it using different dashboards.
Get the dataset that we will use for exploring S3 Select:
cd
wget https://github.com/uchidalab/book-dataset/raw/master/Task2/book32-listing.csv
ls -lh # Around 38MB
cat book32-listing.csv | wc -l # Aprox. 200K lines
head book32-listing.csv # It is a CSV file
The file is codified using ISO8859-1
instead of the supported UTF-8
, so we
need to convert it:
file -i book32-listing.csv # Unknown codification
iconv -f latin1 -t UTF-8 book32-listing.csv > book32-listing-utf8.csv
file -i book32-listing-utf8.csv # Now it is UTF-8
We can move it to S3:
aws s3 cp book32-listing-utf8.csv s3://$BUCKET_NAME/
Now we can directly query the data (almost) for free! And the result will be easily manipulated using javascript.
SQL='SELECT * FROM s3object LIMIT 10'
aws s3api select-object-content \
--bucket $BUCKET_NAME \
--key book32-listing-utf8.csv \
--expression "$SQL" \
--expression-type 'SQL' \
--input-serialization '{"CSV": {}, "CompressionType": "NONE"}' \
--output-serialization '{"JSON": {}}' "output.json" && \
cat output.json | jq
Of course, it is possible to use projections. But aliases cannot be
included in the where
clause:
SQL='SELECT _1 AS amazon_index, _2 AS filename, _3 AS image_url,
_4 AS title, _5 AS author,_6 AS category_id, _7 AS category
FROM s3object
LIMIT 3
'
aws s3api select-object-content \
--bucket $BUCKET_NAME \
--key book32-listing-utf8.csv \
--expression "$SQL" \
--expression-type 'SQL' \
--input-serialization '{"CSV": {}, "CompressionType": "NONE"}' \
--output-serialization '{"JSON": {}}' "output.json" && \
cat output.json | jq
Many operations will require proper configuration of the serializers, for example to ensure the process understand that the strings are quoted. Next query will fail:
SQL='SELECT count(*) FROM s3object'
aws s3api select-object-content \
--bucket $BUCKET_NAME \
--key book32-listing-utf8.csv \
--expression "$SQL" \
--expression-type 'SQL' \
--input-serialization '{"CSV": {}, "CompressionType": "NONE"}' \
--output-serialization '{"JSON": {}}' "output.json" # Fail
But this one will work:
SQL='SELECT count(*) FROM s3object'
aws s3api select-object-content \
--bucket $BUCKET_NAME \
--key book32-listing-utf8.csv \
--expression "$SQL" \
--expression-type 'SQL' \
--input-serialization '{"CSV": { "AllowQuotedRecordDelimiter" : true }, "CompressionType": "NONE"}' \
--output-serialization '{"JSON": {}}' "output.json" && \
cat output.json | jq
It is not possible to use GROUP BY
, but you can include
WHERE
clauses in aggregations:
SQL='SELECT COUNT(*) AS amount
FROM s3object
WHERE _7 = '\''History'\'''
echo $SQL
aws s3api select-object-content \
--bucket $BUCKET_NAME \
--key book32-listing-utf8.csv \
--expression "$SQL" \
--expression-type 'SQL' \
--input-serialization '{"CSV": { "AllowQuotedRecordDelimiter" : true }, "CompressionType": "NONE"}' \
--output-serialization '{"JSON": {}}' "output.json" && \
cat output.json
Other aggregation operations are available, like MIN
, MAX
or even SUM
. But
casting is mandatory, as every field is considered a string by default:
SQL='SELECT MIN(CAST(_6 as INT)) AS minimum,
MAX(CAST(_6 AS INT)) AS maximum
FROM s3object
'
aws s3api select-object-content \
--bucket $BUCKET_NAME \
--key book32-listing-utf8.csv \
--expression "$SQL" \
--expression-type 'SQL' \
--input-serialization '{"CSV": { "AllowQuotedRecordDelimiter" : true }, "CompressionType": "NONE"}' \
--output-serialization '{"JSON": {}}' "output.json" && \
cat output.json
S3 Select works fine with compressed data, that can save up to 80% of storage cost. First, let's compress our file and upload the new version:
gzip -k book32-listing-utf8.csv
ls *.csv* -lh # Aprox 30% of the original size
aws s3 cp book32-listing-utf8.csv.gz s3://$BUCKET_NAME/
Now it is possible to query that compressed file:
SQL='SELECT MIN(CAST(_6 as INT)) AS minimum,
MAX(CAST(_6 AS INT)) AS maximum
FROM s3object
'
aws s3api select-object-content \
--bucket $BUCKET_NAME \
--key book32-listing-utf8.csv.gz \
--expression "$SQL" \
--expression-type 'SQL' \
--input-serialization '{"CSV": {"AllowQuotedRecordDelimiter" : true }, "CompressionType": "GZIP"}' \
--output-serialization '{"JSON": {}}' "output.json" && \
cat output.json
Columnar file formats may provide better performance. Let's get a version of our file in parquet.
aws s3 cp s3://s3workshopdata/book32-listing-utf8.parquet .
We can check the content of the file with the parq command. Apologies, as you will see how I duplicated some lines with the conversion:
parq book32-listing-utf8.parquet
parq book32-listing-utf8.parquet --head 10 # Columnar format!!
Upload the file to your bucket, so you can play with it:
aws s3 cp book32-listing-utf8.parquet s3://$BUCKET_NAME/
S3 Select is perfectly compatible with parquet, and now it understands the type of each field and the associated name:
SQL='SELECT * FROM s3object LIMIT 3'
aws s3api select-object-content \
--bucket $BUCKET_NAME \
--key book32-listing-utf8.parquet \
--expression "$SQL" \
--expression-type 'SQL' \
--input-serialization '{"Parquet": { }, "CompressionType": "NONE"}' \
--output-serialization '{"JSON": {}}' "output.json" && \
cat output.json | jq
An example with aggregated data:
SQL='SELECT MIN(category_id) AS minimum,
MAX(category_id) AS maximum
FROM s3object
'
aws s3api select-object-content \
--bucket $BUCKET_NAME \
--key book32-listing-utf8.parquet \
--expression "$SQL" \
--expression-type 'SQL' \
--input-serialization '{"Parquet": { }, "CompressionType": "NONE"}' \
--output-serialization '{"JSON": {}}' "output.json" && \
cat output.json | jq
And a quite fast free text search (sorry for the duplications, my fault):
SQL='SELECT amazon_index, filename, image_url,
title, author, category_id, category
FROM s3object
WHERE author like '\''%Javi%'\''
'
aws s3api select-object-content \
--bucket $BUCKET_NAME \
--key book32-listing-utf8.parquet \
--expression "$SQL" \
--expression-type 'SQL' \
--input-serialization '{"Parquet": { }, "CompressionType": "NONE"}' \
--output-serialization '{"JSON": {}}' "output.json" && \
cat output.json | jq '.amazon_index + " " + .title'
For large files, it can be very convenient to parallelize the searchs:
FILE_SIZE=$(stat -c %s book32-listing-utf8.parquet) && \
echo The file is $FILE_SIZE bytes long.
for ((i = 0 ; i < 4 ; i++ ))
do
FIRST_BYTE=$(bc <<< "scale = 10; $FILE_SIZE / 4 * $i")
FIRST_BYTE=$(printf "%.*f\n" "0" "$FIRST_BYTE")
LAST_BYTE=$(bc <<< "scale = 10; $FILE_SIZE / 4 * ($i+1) + 1")
LAST_BYTE=$(printf "%.*f\n" "0" "$LAST_BYTE")
echo "Job $i: from $FIRST_BYTE to $LAST_BYTE."
SQL='SELECT amazon_index, filename, image_url,
title, author, category_id, category
FROM s3object
WHERE author like '\''%Javi%'\''
'
aws s3api select-object-content \
--scan-range "Start=$FIRST_BYTE,End=$LAST_BYTE" \
--bucket $BUCKET_NAME \
--key book32-listing-utf8.parquet \
--expression "$SQL" \
--expression-type 'SQL' \
--input-serialization '{"Parquet": { }, "CompressionType": "NONE"}' \
--output-serialization '{"JSON": {}}' "output-$i.json" \
> /dev/null 2>&1 &
done
After the completion of all the jobs it is possible to get the results easily. Select will include partial records in the answer, so be careful processing potential duplicates.
cat output-*.json | sort | jq
How it was the last example possible? S3 in general supports range retrieving: to check it, let's create now a file with known content:
for i in {0..999999}; do printf "%09d\n" $i >> numbers.txt ; done
cat numbers.txt | wc -l
ls -lh numbers.txt
head numbers.txt
tail numbers.txt
Your local operating system can extract information of the file, given we know exactly how it is structured:
i=0
dd if=numbers.txt skip=$(( i * 10)) count=10 status=none iflag=skip_bytes,count_bytes;
i=5000
dd if=numbers.txt skip=$(( i * 10)) count=10 status=none iflag=skip_bytes,count_bytes;
Let's copy the new file to your S3 bucket, and then we will do the previous trick but using the API, thanks to range selections:
aws s3 cp numbers.txt s3://$BUCKET_NAME/
i=5000
aws s3api get-object \
--bucket $BUCKET_NAME \
--key numbers.txt \
--range bytes=$(( i * 10))-$(( i * 10 + 9 )) n.txt
cat n.txt
Since S3 became strongly consistent, it is more suitable to be mounted as a filesystem. Amazon supports the MountPoint project to do it... and it works surprisingly well.
mount-s3 --version
Let's create a directory to map our bucket in our local filesystem, and another one to store cached metadata and objects:
mkdir .s3-cache
mkdir bucket
mount-s3 $BUCKET_NAME bucket --cache .s3-cache --allow-delete
Do you want to check the result? Be my guest:
ls bucket
You can even create new files, but as soon as the file descriptor is closed, you will not be able to append to it anymore:
echo "Hola" > bucket/greetings.txt
cat bucket/greetings.txt
aws s3 ls s3://$BUCKET_NAME
echo "Adios" > bucket/greetings.txt # fail
But, you know what? It is aware of how to use range selections:
i=70000
dd if=bucket/numbers.txt skip=$(( i * 10)) count=10 status=none iflag=skip_bytes,count_bytes;
And this can give use a very powerful tool if you combine it with a
compression format that supports position dictionaries. First, lets's
zip
our old friend, the ~/covers/
directory. And no, it is not possible
to directly zip into the bucket
folder.
sudo apt install zip -y
cd covers
zip ../covers.zip *.jpg
cd ..
ls -l covers.zip
The unzip
command supports direct extraction of one file. For example:
RANDOM_FILE_NAME=$(ls covers/ |sort -R | tail -1)
echo We will play with $RANDOM_FILE_NAME.
unzip -p covers.zip $RANDOM_FILE_NAME > local-$RANDOM_FILE_NAME
ls -l local-$RANDOM_FILE_NAME
So yes, we can actually extract a subset of files from a zipped archive
stored on S3 using mount-s3
efficiently!
aws s3 cp covers.zip s3://$BUCKET_NAME/
ls bucket/*.zip
unzip -p bucket/covers.zip $RANDOM_FILE_NAME > remote-$RANDOM_FILE_NAME
ls -l *$RANDOM_FILE_NAME
And yes, with great powers comes great responsibilities:
rm -fr bucket/
umount bucket