Skip to content

Instantly share code, notes, and snippets.

@miku

miku/.gitignore Secret

Last active March 28, 2024 09:33
Show Gist options
  • Save miku/738f361c8156264626c74f9b717927ff to your computer and use it in GitHub Desktop.
Save miku/738f361c8156264626c74f9b717927ff to your computer and use it in GitHub Desktop.
Fast, parallel filters in Go
top-1m.csv.zip
Notes.md

Fast, parallel filters in Go

Lightning Talk, Leipzig Gophers, 2020-12-18, 19:00 CET, https://golangleipzig.space

The Filter

From Software Tools, 1976, Chapter 2 ("Filters"):

By obvious analogy to electronics (or plumbing) we call such programs filter, because they make useful changes to a stream of data passing through.

The examples in the book use a Ratfor ("Rational FORTRAN") - which is quite readable.

Also:

The trick of getting most filters right is to find an orderly way of recognizing the components of the input stream.

Examples that chapter mentions:

  • wordcount (words)
  • entab, detab
  • overstrike (the overstrike looks for backspaces in typewriter text ...)

  • text compression with run length encoding (runs of one or more identical characters), compress, expand

On page 45, we find a quite modern term:

[...] This is what is known as "defensive programming". It costs next to nothing in source text or execution time, yet it reduces the chance of the program going wild should an important control variable somehow be damaged.

Some general design principle for args handling:

An ususual argument is given some reasonable interpretation whenever possible, and a harmless interpretation otherwise.

On page 63, finally:

Now for the payoff. We can use charcount in series with translit to provide all sorts of useful information.

translit ¬@n | charcount

We call this contruction a pipeline.

Snippets from the summary:

By pushing information about particular devices as far our to the edges as possible, we expand the range of programs that can freely cooperate.

[...] Once you learn that you can isolate and adapt by introducing filters, you begin to think more freely in terms of combining existing programs instead of writing new ones. You overcome much of the temptation to build a whole new package; instead you adapt pieces that already exist. You become, in short, more of a tool user.

The other chapters of the book are concerned with implementations of text patterns (grep), editing (ed), formatting (roff) (TeX was still in the future), macro processing, Ratfor-Fortran translator.

It also is about "clean code", in that it is concerned with approriate data structures, defensive programming, cohesion (reason for being an entity of its own), short subroutines, readability - which make software more robust (as you have an easier time understanding the components); and finally, composability on the user level (i.e. pipes and filters).

Principles:

  • keep it simple
  • build it in stages
  • let someone else do the hard part

Tiny library: parallel

Problem: Process larger amounts of data (e.g. 10k or 10M records), with a filter.

  • components coupled by a byte stream reminds me of the io interfaces

I wanted to have a minimal interface for

  • reading a stream
  • applying code in parallel
  • writing results

Decouple form (how) and content (what)

  • specify tranformation function
  • do not care about how and when it is executed

Example noop transformation:

func Noop(b []byte) ([]byte, error) {
    return b, nil
}

Usage

Design is minimalistic (but could be reduced further, I guess):

p := parallel.NewProcessor(os.Stdin, os.Stdout, Noop)
if err := p.Run(); err != nil {
    log.Fatal(err)
}

Have a processor struct that encapsulates streams and function.

type Processor struct {
    BatchSize       int
    RecordSeparator byte
    NumWorkers      int
    SkipEmptyLines  bool
    r               io.Reader
    w               io.Writer
    f               TransformerFunc
}

Let p.Run() setup channels, reading and batching.

Examples

UPPERCASE EVERYTHING.

package main

import (
	"bytes"
	"log"
	"os"

	"github.com/miku/parallel"
)

func main() {
	p := parallel.NewProcessor(os.Stdin, os.Stdout, func(p []byte) ([]byte, error) {
		return bytes.ToUpper(p), nil
	})
	if err := p.Run(); err != nil {
		log.Fatal(err)
	}
}

Running:

$ go run xu.go < xu.go | head -7
PACKAGE MAIN
IMPORT (
        "BYTES"
        "LOG"
        "OS"
        "GITHUB.COM/MIKU/PARALLEL"
)

Fetch links in parallel (batch size of 1, many parallel workers). We can use a generic fetch function.

func Fetch(link string) (*FetchResult, error) {
	start := time.Now()
	resp, err := client.Get(link)
	if err != nil {
		return nil, err
	}
	defer resp.Body.Close()
	n, err := io.Copy(ioutil.Discard, resp.Body)
	if err != nil {
		return nil, err
	}
	elapsed := time.Since(start)
	return &FetchResult{
		URL:        link,
		StatusCode: resp.StatusCode,
		Length:     n,
		Elapsed:    fmt.Sprintf("%0.2f", elapsed.Seconds()),
	}, nil
}

Then adapt input, output and error handling in the parallel processor.

func main() {
	p := parallel.NewProcessor(strings.NewReader(input), os.Stdout, func(b []byte) ([]byte, error) {
		link := string(bytes.TrimSpace(b))
		if len(link) == 0 {
			return nil, nil
		}
		r, err := Fetch(link)
		if err != nil {
			return nil, nil
		}
		return MarshalEnd(r, []byte("\n"))
	})
	p.BatchSize = 1
	p.NumWorkers = 128
	if err := p.Run(); err != nil {
		log.Fatal(err)
	}
}

Hello, Pipes!

$ unzip -p top-1m.csv.zip | cut -d , -f 2 | shuf -n 10 | go run xf.go
{"url":"http://sancharika.org","status":200,"length":341,"elapsed":"0.48"}
{"url":"http://kmhd.link","status":200,"length":10677,"elapsed":"0.48"}
{"url":"http://allcracksoft.com","status":200,"length":32186,"elapsed":"0.58"}
{"url":"http://medow.club","status":200,"length":44375,"elapsed":"0.61"}
{"url":"http://reciclapet.com.ec","status":200,"length":616,"elapsed":"0.63"}
{"url":"http://twoscotsabroad.com","status":200,"length":138299,"elapsed":"0.68"}
{"url":"http://projectcasting.com","status":200,"length":96820,"elapsed":"0.98"}
{"url":"http://naim.guru","status":200,"length":68670,"elapsed":"1.93"}

Implementation

View raw

(Sorry about that, but we can’t show files that are this big right now.)

View raw

(Sorry about that, but we can’t show files that are this big right now.)

View raw

(Sorry about that, but we can’t show files that are this big right now.)

View raw

(Sorry about that, but we can’t show files that are this big right now.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment