Skip to content

Instantly share code, notes, and snippets.

@ciberado
Created January 24, 2024 09:40
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ciberado/b6228f3214a1cb6dcde1889d40fe75ae to your computer and use it in GitHub Desktop.
Save ciberado/b6228f3214a1cb6dcde1889d40fe75ae to your computer and use it in GitHub Desktop.
S3, the lesser known features

S3,

the lesser known details

  1. Improving ingestion performance
  2. Why is pagination important?
  3. S3 as a SQL database
  4. Divide and conquer you files
  5. Mountpoint: the good, the bad and the awful

Improving ingestion

Start a tmux session, configure you region, and create your own bucket:

tmux

export AWS_DEFAULT_REGION=us-east-1
export BUCKET_NAME=s3workshop$RANDOM
echo Your bucket name is going to be $BUCKET_NAME. Take note of it.
aws s3 mb s3://$BUCKET_NAME

Get the required data to test the upload performance and start a new pane with htop:

mkdir data
aws s3 cp s3://s3workshopdata/covers.zip data/
unzip data/covers.zip  01*.jpg 02*.jpg -d covers/
cd covers
ls . | wc -l
tmux splitw -t "$session_uuid:" -d "htop"

Test how much time does it take to upload those files. Note that s3 cp will launch a thread for each processor in the VM:

time aws s3 cp --recursive . s3://$BUCKET_NAME/covers-simple/ > /dev/null

Test now how does it takes if we divide all files in groups (using the first three letters of the name as the prefix) and then upload two files in parallel, trying to saturate the compute capacity:

time ls . \
  | cut -c1-3 \
  | uniq \
  | xargs -P 6 -I '{}' \
      sh -c 'aws s3 cp --recursive \
          --exclude "*" --include "{}*" \
          . s3://$BUCKET_NAME/covers-parallel/ > /dev/null' 

Optionally, you can close the additional pane:

tmux kill-pane -a -t "$session_uuid:"

S3 queries

The lack of Glob

S3 operations doesn't support typical Linux * for working with files: only prefixes are directly implemented in the path argument. Additionally, object keys are not hierarchical: the / is considered just like any other character. But it influences the response provided by the AWS CLI, as you will see:

aws s3 ls s3://$BUCKET_NAME/covers-simple     # (ok, but not what you intended)
aws s3 ls s3://$BUCKET_NAME/covers-simple/    # (ok)
aws s3 ls s3://$BUCKET_NAME/covers-simple/01* # (what?)
aws s3 ls s3://$BUCKET_NAME/covers-simple/01  # (ok, surprisingly)
aws s3 ls s3://$BUCKET_NAME/covers-simple/01*.jpg # (double what?)

How pagination works

Check how the most basic list query works: it returns a number of items (max 1000), and a pagination token.

aws s3api list-objects-v2 \
  --bucket $BUCKET_NAME \
  --prefix covers-simple/ \
  --max-items 3

Save the token to be able to get the next bunch of files:

TOKEN=$(aws s3api list-objects-v2 \
  --bucket $BUCKET_NAME \
  --prefix covers-simple/ \
  --max-items 3 \
  | tee first-page.json \
  | jq .NextToken -r)
echo Next token is $TOKEN.
cat first-page.json

Use the saved token to continue with the process of getting all the names:

token=$(aws s3api list-objects-v2 \
  --bucket $BUCKET_NAME \
  --prefix covers-simple/ \
  --max-items 3 \
  --starting-token $TOKEN \
  | tee second-page.json \
  | jq .NextToken -r)
  cat second-page.json

How do you think this can impact the management of big amounts of files? To solve this problem, on approach consist in using S3 Inventory. It will generate a daily report with information regarding all the files included in the selection criteria.

To further explore your data distribution, it is recommended to take advantage of the Storage Lense service, that automatically aggregates information and presents it using different dashboards.

S3 as an almost free Database

Get the dataset that we will use for exploring S3 Select:

cd
wget https://github.com/uchidalab/book-dataset/raw/master/Task2/book32-listing.csv
ls -lh # Around 38MB
cat book32-listing.csv | wc -l # Aprox. 200K lines
head book32-listing.csv # It is a CSV file

The file is codified using ISO8859-1 instead of the supported UTF-8, so we need to convert it:

file -i book32-listing.csv # Unknown codification
iconv -f latin1  -t UTF-8 book32-listing.csv > book32-listing-utf8.csv
file -i book32-listing-utf8.csv # Now it is UTF-8

We can move it to S3:

aws s3 cp book32-listing-utf8.csv s3://$BUCKET_NAME/

Now we can directly query the data (almost) for free! And the result will be easily manipulated using javascript.

SQL='SELECT * FROM s3object LIMIT 10'
aws s3api select-object-content \
    --bucket $BUCKET_NAME \
    --key book32-listing-utf8.csv \
    --expression "$SQL" \
    --expression-type 'SQL' \
    --input-serialization '{"CSV": {}, "CompressionType": "NONE"}' \
    --output-serialization '{"JSON": {}}' "output.json" && \
cat output.json | jq

Of course, it is possible to use projections. But aliases cannot be included in the where clause:

SQL='SELECT _1 AS amazon_index, _2 AS filename, _3 AS image_url, 
            _4 AS title, _5 AS author,_6 AS category_id, _7 AS category 
     FROM s3object 
     LIMIT 3
'
aws s3api select-object-content \
    --bucket $BUCKET_NAME \
    --key book32-listing-utf8.csv \
    --expression "$SQL" \
    --expression-type 'SQL' \
    --input-serialization '{"CSV": {}, "CompressionType": "NONE"}' \
    --output-serialization '{"JSON": {}}' "output.json" && \
cat output.json | jq

Many operations will require proper configuration of the serializers, for example to ensure the process understand that the strings are quoted. Next query will fail:

SQL='SELECT count(*) FROM s3object'
aws s3api select-object-content \
    --bucket $BUCKET_NAME \
    --key book32-listing-utf8.csv \
    --expression "$SQL" \
    --expression-type 'SQL' \
    --input-serialization '{"CSV": {}, "CompressionType": "NONE"}' \
    --output-serialization '{"JSON": {}}' "output.json" # Fail

But this one will work:

SQL='SELECT count(*) FROM s3object'
aws s3api select-object-content \
    --bucket $BUCKET_NAME \
    --key book32-listing-utf8.csv \
    --expression "$SQL" \
    --expression-type 'SQL' \
    --input-serialization '{"CSV": { "AllowQuotedRecordDelimiter" : true }, "CompressionType": "NONE"}' \
    --output-serialization '{"JSON": {}}' "output.json" && \
cat output.json | jq

It is not possible to use GROUP BY, but you can include WHERE clauses in aggregations:

SQL='SELECT COUNT(*) AS amount 
     FROM s3object 
     WHERE _7 = '\''History'\'''
echo $SQL

aws s3api select-object-content \
    --bucket $BUCKET_NAME \
    --key book32-listing-utf8.csv \
    --expression "$SQL" \
    --expression-type 'SQL' \
    --input-serialization '{"CSV": { "AllowQuotedRecordDelimiter" : true }, "CompressionType": "NONE"}' \
    --output-serialization '{"JSON": {}}' "output.json" && \
cat output.json

Other aggregation operations are available, like MIN, MAX or even SUM. But casting is mandatory, as every field is considered a string by default:

SQL='SELECT MIN(CAST(_6 as INT)) AS minimum, 
            MAX(CAST(_6 AS INT)) AS maximum 
     FROM s3object
'
aws s3api select-object-content \
    --bucket $BUCKET_NAME \
    --key book32-listing-utf8.csv \
    --expression "$SQL" \
    --expression-type 'SQL' \
    --input-serialization '{"CSV": { "AllowQuotedRecordDelimiter" : true }, "CompressionType": "NONE"}' \
    --output-serialization '{"JSON": {}}' "output.json" && \
cat output.json

S3 Select works fine with compressed data, that can save up to 80% of storage cost. First, let's compress our file and upload the new version:

gzip -k book32-listing-utf8.csv
ls *.csv* -lh  # Aprox 30% of the original size
aws s3 cp book32-listing-utf8.csv.gz s3://$BUCKET_NAME/

Now it is possible to query that compressed file:

SQL='SELECT MIN(CAST(_6 as INT)) AS minimum, 
            MAX(CAST(_6 AS INT)) AS maximum 
     FROM s3object
'
aws s3api select-object-content \
    --bucket $BUCKET_NAME \
    --key book32-listing-utf8.csv.gz \
    --expression "$SQL" \
    --expression-type 'SQL' \
    --input-serialization '{"CSV": {"AllowQuotedRecordDelimiter" : true }, "CompressionType": "GZIP"}' \
    --output-serialization '{"JSON": {}}' "output.json" && \
cat output.json

Columnar file formats may provide better performance. Let's get a version of our file in parquet.

aws s3 cp s3://s3workshopdata/book32-listing-utf8.parquet .

We can check the content of the file with the parq command. Apologies, as you will see how I duplicated some lines with the conversion:

parq book32-listing-utf8.parquet
parq book32-listing-utf8.parquet --head 10 # Columnar format!!

Upload the file to your bucket, so you can play with it:

aws s3 cp book32-listing-utf8.parquet s3://$BUCKET_NAME/

S3 Select is perfectly compatible with parquet, and now it understands the type of each field and the associated name:

SQL='SELECT * FROM s3object LIMIT 3'
aws s3api select-object-content \
    --bucket $BUCKET_NAME \
    --key book32-listing-utf8.parquet \
    --expression "$SQL" \
    --expression-type 'SQL' \
    --input-serialization '{"Parquet": { }, "CompressionType": "NONE"}' \
    --output-serialization '{"JSON": {}}' "output.json" && \
cat output.json | jq

An example with aggregated data:

SQL='SELECT MIN(category_id) AS minimum, 
     MAX(category_id) AS maximum 
     FROM s3object
'
aws s3api select-object-content \
    --bucket $BUCKET_NAME \
    --key book32-listing-utf8.parquet \
    --expression "$SQL" \
    --expression-type 'SQL' \
    --input-serialization '{"Parquet": { }, "CompressionType": "NONE"}' \
    --output-serialization '{"JSON": {}}' "output.json" && \
cat output.json | jq

And a quite fast free text search (sorry for the duplications, my fault):

SQL='SELECT amazon_index, filename, image_url, 
            title, author, category_id, category 
    FROM s3object
    WHERE author like '\''%Javi%'\''
'
aws s3api select-object-content \
    --bucket $BUCKET_NAME \
    --key book32-listing-utf8.parquet \
    --expression "$SQL" \
    --expression-type 'SQL' \
    --input-serialization '{"Parquet": { }, "CompressionType": "NONE"}' \
    --output-serialization '{"JSON": {}}' "output.json" && \
cat output.json | jq '.amazon_index + " " + .title'

For large files, it can be very convenient to parallelize the searchs:

FILE_SIZE=$(stat -c %s book32-listing-utf8.parquet) && \
echo The file is $FILE_SIZE bytes long.

for ((i = 0 ; i < 4 ; i++ )) 
do 
  FIRST_BYTE=$(bc <<<  "scale = 10; $FILE_SIZE / 4 * $i")
  FIRST_BYTE=$(printf "%.*f\n" "0" "$FIRST_BYTE")
  LAST_BYTE=$(bc <<< "scale = 10; $FILE_SIZE / 4 * ($i+1) + 1")
  LAST_BYTE=$(printf "%.*f\n" "0" "$LAST_BYTE")
  echo "Job $i: from $FIRST_BYTE to $LAST_BYTE."

  SQL='SELECT amazon_index, filename, image_url, 
              title, author, category_id, category 
      FROM s3object
      WHERE author like '\''%Javi%'\''
  '
  aws s3api select-object-content \
      --scan-range "Start=$FIRST_BYTE,End=$LAST_BYTE" \
      --bucket $BUCKET_NAME \
      --key book32-listing-utf8.parquet \
      --expression "$SQL" \
      --expression-type 'SQL' \
      --input-serialization '{"Parquet": { }, "CompressionType": "NONE"}' \
      --output-serialization '{"JSON": {}}' "output-$i.json" \
      > /dev/null 2>&1 &
done

After the completion of all the jobs it is possible to get the results easily. Select will include partial records in the answer, so be careful processing potential duplicates.

cat output-*.json | sort | jq

Range selections

How it was the last example possible? S3 in general supports range retrieving: to check it, let's create now a file with known content:

for i in {0..999999}; do printf "%09d\n" $i >> numbers.txt ; done
cat numbers.txt | wc -l
ls -lh numbers.txt
head numbers.txt
tail numbers.txt

Your local operating system can extract information of the file, given we know exactly how it is structured:

i=0
dd if=numbers.txt skip=$(( i * 10)) count=10 status=none iflag=skip_bytes,count_bytes; 
i=5000
dd if=numbers.txt skip=$(( i * 10)) count=10 status=none iflag=skip_bytes,count_bytes; 

Let's copy the new file to your S3 bucket, and then we will do the previous trick but using the API, thanks to range selections:

aws s3 cp numbers.txt s3://$BUCKET_NAME/
i=5000
aws s3api get-object \
  --bucket $BUCKET_NAME \
  --key numbers.txt \
  --range bytes=$(( i * 10))-$(( i * 10 + 9 )) n.txt
cat n.txt

Mountpoint: treating S3 as a filesystem

Since S3 became strongly consistent, it is more suitable to be mounted as a filesystem. Amazon supports the MountPoint project to do it... and it works surprisingly well.

mount-s3 --version

Let's create a directory to map our bucket in our local filesystem, and another one to store cached metadata and objects:

mkdir .s3-cache
mkdir bucket
mount-s3 $BUCKET_NAME bucket --cache .s3-cache --allow-delete

Do you want to check the result? Be my guest:

ls bucket

You can even create new files, but as soon as the file descriptor is closed, you will not be able to append to it anymore:

echo "Hola" > bucket/greetings.txt
cat bucket/greetings.txt
aws s3 ls s3://$BUCKET_NAME
echo "Adios" > bucket/greetings.txt # fail

But, you know what? It is aware of how to use range selections:

i=70000
dd if=bucket/numbers.txt skip=$(( i * 10)) count=10 status=none iflag=skip_bytes,count_bytes; 

And this can give use a very powerful tool if you combine it with a compression format that supports position dictionaries. First, lets's zip our old friend, the ~/covers/ directory. And no, it is not possible to directly zip into the bucket folder.

sudo apt install zip -y
cd covers
zip ../covers.zip *.jpg
cd ..
ls -l covers.zip

The unzip command supports direct extraction of one file. For example:

RANDOM_FILE_NAME=$(ls covers/  |sort -R | tail -1)
echo We will play with $RANDOM_FILE_NAME.
unzip -p covers.zip $RANDOM_FILE_NAME > local-$RANDOM_FILE_NAME
ls -l local-$RANDOM_FILE_NAME

So yes, we can actually extract a subset of files from a zipped archive stored on S3 using mount-s3 efficiently!

aws s3 cp covers.zip s3://$BUCKET_NAME/
ls bucket/*.zip
unzip -p bucket/covers.zip $RANDOM_FILE_NAME > remote-$RANDOM_FILE_NAME
ls -l *$RANDOM_FILE_NAME

And yes, with great powers comes great responsibilities:

rm -fr bucket/
umount bucket
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment