Skip to content

Instantly share code, notes, and snippets.

@cathalgarvey
Created April 5, 2017 12:09
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save cathalgarvey/ab2449fbe3a8b134f127a97b0d74dd50 to your computer and use it in GitHub Desktop.
Save cathalgarvey/ab2449fbe3a8b134f127a97b0d74dd50 to your computer and use it in GitHub Desktop.
How to Read Lines from GZIP-Compressed Files in Go
package main
import (
"compress/gzip"
"os"
"bufio"
"fmt"
"log"
)
// GZLines iterates over lines of a file that's gzip-compressed.
// Iterating lines of an io.Reader is one of those things that Go
// makes needlessly complex.
func GZLines(filename string) (chan []byte, chan error, error) {
rawf, err := os.Open(filename)
if err != nil {
return nil, nil, err
}
rawContents, err := gzip.NewReader(rawf)
if err != nil {
return nil, nil, err
}
contents := bufio.NewScanner(rawContents)
cbuffer := make([]byte, 0, bufio.MaxScanTokenSize)
contents.Buffer(cbuffer, bufio.MaxScanTokenSize*50) // Otherwise long lines crash the scanner.
ch := make(chan []byte)
errs := make(chan error)
go func(ch chan []byte, errs chan error, contents *bufio.Scanner) {
defer func(ch chan []byte, errs chan error){
close(ch)
close(errs)
}(ch, errs)
var (
err error
)
for contents.Scan() {
ch <- contents.Bytes()
}
if err = contents.Err(); err != nil {
errs <- err
return
}
}(ch, errs, contents)
return ch, errs, nil
}
func main() {
fmt.Printf("Called on: %+v\n", os.Args)
lines, errors, err := GZLines(os.Args[1])
if err != nil {
log.Fatal(err)
}
go func(errs chan error) {
err := <- errs
log.Fatal(err)
}(errors)
for foo := range lines {
fmt.Printf("%+v\n", string(foo))
}
}
@dolmen
Copy link

dolmen commented Apr 15, 2017

My review:

  • give some air to your code: use empty lines
  • move os.Open out of GZLines, and change GZLines to accept an io.Reader: this will make GZLines much more reusable
  • line 25: very long lines (longer than bufio.MaxScanTokenSize*50) will still crash the scanner
  • line 33: remove err declaration, and use instead := at line 39
  • main: do not use a goroutine. Instead use a select block to read on either lines or errors.

@seebs
Copy link

seebs commented Sep 25, 2018

I can't find input data for which the explicit buffer allocation seems to make a difference, unless you're hitting input data with lines over 64k. But it shouldn't matter whether you're doing this to gzipped or uncompressed data.

@lovasoa
Copy link

lovasoa commented Nov 17, 2020

I made a new version with the suggested changes:

https://gist.github.com/lovasoa/38a207ecdefa1d60225403a644800818

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment