Skip to content

Instantly share code, notes, and snippets.

@ervinne13
Last active April 19, 2022 10:05
Show Gist options
  • Save ervinne13/3e75d079d8525e9740cc1742b3c0df5e to your computer and use it in GitHub Desktop.
Save ervinne13/3e75d079d8525e9740cc1742b3c0df5e to your computer and use it in GitHub Desktop.
A quick and easy CSV to Parquet converter from one bucket to another. This may be attached to a lambda function that can be triggered whenever a new s3:PutObject is executed.
from io import BytesIO
import boto3
import pandas as pd
from os import environ
def convert(bucket, key):
s3_client = boto3.client('s3', region_name=environ['REGION'])
s3_object = s3_client.get_object(Bucket=bucket, Key=key)
df = pd.read_csv(s3_object['Body'])
df.columns.astype(str)
target_bucket = f"{bucket}-parquet"
target_key = f"{key.split('-')[0]}.parquet"
parquet_out_buffer = BytesIO()
df.to_parquet(parquet_out_buffer, index=False, engine='fastparquet')
s3_res = boto3.resource('s3')
s3_res.Object(target_bucket, target_key).put(Body=parquet_out_buffer.getvalue())
return {
'bucket': target_bucket,
'key': target_key
}
@ervinne13
Copy link
Author

Create a REGION environment variable in your lambda function btw

@ervinne13
Copy link
Author

Also using fastparquet instead of pyarrow as pyarrow goes over the 50mb limit even with auto layers on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment