Skip to content

Instantly share code, notes, and snippets.

@iamaziz
Last active November 22, 2022 14:45
Show Gist options
  • Save iamaziz/5e4e85e9d63ff8d12f2848938fec7b0a to your computer and use it in GitHub Desktop.
Save iamaziz/5e4e85e9d63ff8d12f2848938fec7b0a to your computer and use it in GitHub Desktop.
Read csv files from tar.gz in S3 into pandas dataframes without untar or download (using with S3FS, tarfile, io, and pandas)
# -- read csv files from tar.gz in S3 with S3FS and tarfile (https://s3fs.readthedocs.io/en/latest/)
bucket = 'mybucket'
key = 'mycompressed_csv_files.tar.gz'
import s3fs
import tarfile
import io
import pandas as pd
fs = s3fs.S3FileSystem()
f = fs.open(f'{bucket}/{key}', 'rb')
tar = tarfile.open(f, 'r:gz')
csv_files = [f.name for f in tar.getmembers() if f.name.endswith('.csv')]
csv_file = csv_files[0] # here we read first csv file only
csv_contents = tar.extractfile(csv_file).read()
df = pd.read_csv(io.BytesIO(csv_contents), encoding='utf8')
f.close()
@asterios-pantousas
Copy link

Thank you very much for you share. Much appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment