Skip to content

Instantly share code, notes, and snippets.

@alecbw
Last active December 6, 2020 04:12
Show Gist options
  • Save alecbw/c104ab3dcf651f836981673803b7d506 to your computer and use it in GitHub Desktop.
Save alecbw/c104ab3dcf651f836981673803b7d506 to your computer and use it in GitHub Desktop.
Uses S3 Select. Up to 15x faster locally
import boto3
def get_row_count_of_s3_csv(bucket_name, path):
sql_stmt = """SELECT count(*) FROM s3object """
req = boto3.client('s3').select_object_content(
Bucket=bucket_name,
Key=path,
ExpressionType="SQL",
Expression=sql_stmt,
InputSerialization = {"CSV": {"FileHeaderInfo": "Use", "AllowQuotedRecordDelimiter": True}},
OutputSerialization = {"CSV": {}},
)
row_count = next(int(x["Records"]["Payload"]) for x in req["Payload"])
return row_count
# note: intra-AWS data transfer (e.g. Lambda <> S3) is much faster than egress, so this optimization is less impactful to intra-AWS use cases
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment