Skip to content

Instantly share code, notes, and snippets.

@Mlawrence95
Created July 27, 2020 22:54
Show Gist options
  • Save Mlawrence95/17a206915c481bb4872a399bcfd68c10 to your computer and use it in GitHub Desktop.
Save Mlawrence95/17a206915c481bb4872a399bcfd68c10 to your computer and use it in GitHub Desktop.
Given a CSV file that's inside a tar.gz file on AWS S3, read it into a Pandas dataframe without downloading or extracting the entire tar file
# checked against python 3.7.3, pandas 0.24.2, s3fs 0.4.2
import tarfile
import io
import s3fs
import pandas as pd
tar_path = f"s3://my-bucket/debug.tar.gz" # path in s3
metadata_path = "debug/metadata.csv" # path inside of the tar file
s3 = s3fs.S3FileSystem()
# this is in my experience, but it does work!
with s3.open(tar_path, 'rb') as debug_tar:
with tarfile.open(mode='r:gz', fileobj=debug_tar) as tar:
csv_contents = tar.extractfile(metadata_path).read()
df = pd.read_csv(io.BytesIO(csv_contents), encoding='utf8')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment