Skip to content

Instantly share code, notes, and snippets.

@turtlemonvh
Created January 22, 2019 22:28
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save turtlemonvh/9667e5045e6d0cabf24f1d6b5307759e to your computer and use it in GitHub Desktop.
Save turtlemonvh/9667e5045e6d0cabf24f1d6b5307759e to your computer and use it in GitHub Desktop.
Counts for nested data in AWS S3
import boto3
from collections import Counter
"""
If your data uses "/" in a directory-like structure and you want to expand the list of items.
Similar to `tree -L2 prefix/` in *nix.
"""
s3 = boto3.client('s3')
bucket_name = "XXX" # s3 bucket name
starting_prefix = "YYY" # prefix to look under in the bucket
# Get the prefixes on the first level
prefixes = (key['Prefix'] for key in s3.list_objects_v2(Bucket=bucket_name, Delimiter="/", Prefix=starting_prefix)['CommonPrefixes'])
# Expand the list of prefixes with the next level into a flattened list
# Note that you can use something similar to this to continue to expand your prefixes more levels
expanded_prefixes = (key['Prefix'] for prefix in prefixes for key in s3.list_objects_v2(Bucket=bucket_name, Delimiter="/", Prefix=prefix)['CommonPrefixes'] )
# If you want counts of the number of times the 2nd level value shows up
# Helpful if your data is set up like "{PREFIX}/{UUID}/{DATE}" and you want to see the number of unique values of UUID for each DATE.
Counter(p.split("/")[-2] for p in expanded_prefixes)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment