Skip to content

Instantly share code, notes, and snippets.

@iamaziz
Last active November 22, 2022 14:45
Show Gist options
  • Save iamaziz/5e4e85e9d63ff8d12f2848938fec7b0a to your computer and use it in GitHub Desktop.
Save iamaziz/5e4e85e9d63ff8d12f2848938fec7b0a to your computer and use it in GitHub Desktop.
Read csv files from tar.gz in S3 into pandas dataframes without untar or download (using with S3FS, tarfile, io, and pandas)
# -- read csv files from tar.gz in S3 with S3FS and tarfile (https://s3fs.readthedocs.io/en/latest/)
bucket = 'mybucket'
key = 'mycompressed_csv_files.tar.gz'
import s3fs
import tarfile
import io
import pandas as pd
fs = s3fs.S3FileSystem()
f = fs.open(f'{bucket}/{key}', 'rb')
tar = tarfile.open(f, 'r:gz')
csv_files = [f.name for f in tar.getmembers() if f.name.endswith('.csv')]
csv_file = csv_files[0] # here we read first csv file only
csv_contents = tar.extractfile(csv_file).read()
df = pd.read_csv(io.BytesIO(csv_contents), encoding='utf8')
f.close()
@avriiil
Copy link

avriiil commented Jul 21, 2021

Thanks for sharing this gist.
I'm getting a TypeError: expected str, bytes or os.PathLike object, not S3File. Does this work for you?

@iamaziz
Copy link
Author

iamaziz commented Jul 21, 2021

Hey @rrpelgrim it's been a while i've not used this, it was working though. The new awswrangler package by aws might be a better option https://github.com/awslabs/aws-data-wrangler

@avriiil
Copy link

avriiil commented Jul 22, 2021

Thanks for the tip, taking a look now.

@avriiil
Copy link

avriiil commented Jul 22, 2021

@iamaziz - any chance you could point me in the right direction within awswrangler? The wr.s3.read_csv doesn't read the .tgz compressed file... would really appreciate it 🙏

@asterios-pantousas
Copy link

Thank you very much for you share. Much appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment