Skip to content

Instantly share code, notes, and snippets.

@john-science
Last active May 27, 2021 16:47
Show Gist options
  • Save john-science/6eeda4a44db830a35365503178f88788 to your computer and use it in GitHub Desktop.
Save john-science/6eeda4a44db830a35365503178f88788 to your computer and use it in GitHub Desktop.
Reading & Writing GZIP Files Faster in Python

Reading & Writing GZIP Files in Python

I have been testing various ways to read and write text files with GZIP in Python. There were a lot of uninteresting results, but there were two I thought were worth sharing.

Writing GZIP files

If you have a big list of strings to write to a file, you might be tempted to do:

f = gzip.open(out_path, 'wb')
for line in lines:
    f.write(line)
f.close()

But, it turns out that it's (10-20%) faster to do:

f = gzip.open(out_path, 'wb')
try:
    f.writelines(lines)
finally:
    f.close()

Reading GZIP files

If you have a big GZIP file to read (text, not binary), you might be temped to read it like:

import gzip
f = gzip.open(in_path, 'rb')
for line in f.readlines():
    # do stuff
f.close()

But it turns out it can be up to 3 times faster to read it like:

import gzip
import io
gz = gzip.open(in_path, 'rb')
f = io.BufferedReader(gz)
     for line in f.readlines():
         # do stuff
gz.close()
@markjay4k
Copy link

it might even be quicker to do
for line in f:

instead of
for line in f.readlines()

you should save on the time loading f.readlines() into memory

@mschmo
Copy link

mschmo commented May 22, 2018

Hello, can you provide some more information on the methods you used to gather those benchmark results?

  1. What version of Python?
  2. Content of lines

I ran a quick test with python 3.6 where lines was 100k records long, and iterating on lines using write() actually showed to be the more performant:

import gzip
import time
from statistics import mean, stdev

LINES = [b'I am a test line' for _ in range(100_000)]


def test_iter_write():
    f = gzip.open('./test_iter_write.txt.gz', 'wb')
    for line in LINES:
        f.write(line)
    f.close()

def test_writelines():
    f = gzip.open('./test_writelines.txt.gz', 'wb')
    try:
        f.writelines(LINES)
    finally:
        f.close()


if __name__ == '__main__':
    stats = {}
    funcs = (test_iter_write, test_writelines)
    for _ in range(10):
        for func in funcs:
            t = time.process_time()
            func()
            elapsed_time = time.process_time() - t
            trials = stats.setdefault(func.__name__, [])
            trials.append(elapsed_time)
    for func in funcs:
        f_name = func.__name__
        stat = stats[f_name]
        print(f'{f_name}: avg = {mean(stat):.6f} | stdev = {stdev(stat):.6f}')
$ python bench_gzip_write.py
test_iter_write: avg = 0.221160 | stdev = 0.010759
test_writelines: avg = 0.231999 | stdev = 0.011253

@soulmachine
Copy link

need this line.encode("utf-8") to encode str to bytes

@jsookikian
Copy link

How big are the files that you are running this test on? I am working with pretty large files (20Gb and up) and wanted to try this out to see if it makes a difference.

@songsongyoon
Copy link

what is "out_path" and "in_path"??
I typed (r"C:\Users---------------")

but python kept saying "SyntaxError: invalid syntax"

Do you know what is the reason?

@john-science
Copy link
Author

for line in f:
instead of
for line in f.readlines()

Ah, yes. That's a good point.

I wrote this 3 years ago and I wonder if at the time I was still using Python 2. Certainly in Python 3 I agree with you. Perhaps this page needs to be re-run for performance.

@john-science
Copy link
Author

what is "out_path" and "in_path"??

Those are paths to the file I am trying to GZIP (out_path) or GUNZIP (in_path).

@songsongyoon
Copy link

thanks!

@hevp
Copy link

hevp commented May 27, 2021

If you are writing lines coming from different processes in your script, you can use a buffered writer as well:

import gzip, io
with gzip.open(out_path, 'wb') as f:
    bw = io.BufferedWriter(f):
    bw.write(line)

This will speed up writing files significantly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment