miku/.gitignore Secret

## .gitignore
top-1m.csv.zip
Notes.md

## 20200821_163807_01-1024x768.jpg

      
    Raw
  

              20200821_163807_01-1024x768.jpg
            
          
## 20200821_163807_01-1024x768_m.jpg

      
    Raw
  

              20200821_163807_01-1024x768_m.jpg
            
          
## 20200821_163807_01-1024x768_s.jpg

      
    Raw
  

              20200821_163807_01-1024x768_s.jpg
            
          
## README.md

      
    Raw
  

              README.md
            
          
    Fast, parallel filters in Go


Lightning Talk, Leipzig Gophers, 2020-12-18, 19:00 CET, https://golangleipzig.space

The Filter

From Software Tools, 1976, Chapter 2 ("Filters"):

By obvious analogy to electronics (or plumbing) we call such programs filter,
because they make useful changes to a stream of data passing through.


The examples in the book use a Ratfor ("Rational FORTRAN") - which is quite readable.
Also:

The trick of getting most filters right is to find an orderly way of
recognizing the components of the input stream.

Examples that chapter mentions:

wordcount (words)
entab, detab
overstrike (the overstrike looks for backspaces in typewriter text ...)


text compression with run length encoding (runs of one or more identical characters), compress, expand

On page 45, we find a quite modern term:

[...] This is what is known as "defensive programming". It costs next to
nothing in source text or execution time, yet it reduces the chance of the
program going wild should an important control variable somehow be damaged.


encryption (xor,
Diffie-Hellman just
came out in Nov '76 - although born in May
'75),
crypt
character transliteration, translit, lives on today in tr ('74, McIlroy)

Some general design principle for args handling:

An ususual argument is given some reasonable interpretation whenever
possible, and a harmless interpretation otherwise.

On page 63, finally:

Now for the payoff. We can use charcount in series with translit to provide all sorts of useful information.

translit ¬@n | charcount


We call this contruction a pipeline.

Snippets from the summary:

By pushing information about particular devices as far our to the edges as
possible, we expand the range of programs that can freely cooperate.


[...] Once you learn that you can isolate and adapt by introducing filters,
you begin to think more freely in terms of combining existing programs
instead of writing new ones. You overcome much of the temptation to build a
whole new package; instead you adapt pieces that already exist. You become,
in short, more of a tool user.

The other chapters of the book are concerned with implementations of text patterns (grep),
editing (ed), formatting (roff) (TeX was still in the
future), macro processing, Ratfor-Fortran translator.
It also is about "clean code", in that it is concerned with approriate data
structures, defensive programming, cohesion (reason for being an entity of its
own), short subroutines, readability - which make software more robust (as you
have an easier time understanding the components); and finally, composability
on the user level (i.e. pipes and filters).
Principles:

keep it simple
build it in stages
let someone else do the hard part

Tiny library: parallel

Problem: Process larger amounts of data (e.g. 10k or 10M records), with a filter.

components coupled by a byte stream reminds me of the
io interfaces

I wanted to have a minimal interface for

reading a stream
applying code in parallel
writing results

Decouple form (how) and content (what)


specify tranformation function
do not care about how and when it is executed

Example noop transformation:
func Noop(b []byte) ([]byte, error) {
    return b, nil
}
Usage

Design is minimalistic (but could be reduced further, I guess):
p := parallel.NewProcessor(os.Stdin, os.Stdout, Noop)
if err := p.Run(); err != nil {
    log.Fatal(err)
}
Have a processor struct that encapsulates streams and function.
type Processor struct {
    BatchSize       int
    RecordSeparator byte
    NumWorkers      int
    SkipEmptyLines  bool
    r               io.Reader
    w               io.Writer
    f               TransformerFunc
}
Let p.Run() setup channels, reading and batching.
Examples

UPPERCASE EVERYTHING.
package main

import (
	"bytes"
	"log"
	"os"

	"github.com/miku/parallel"
)

func main() {
	p := parallel.NewProcessor(os.Stdin, os.Stdout, func(p []byte) ([]byte, error) {
		return bytes.ToUpper(p), nil
	})
	if err := p.Run(); err != nil {
		log.Fatal(err)
	}
}
Running:
$ go run xu.go < xu.go | head -7
PACKAGE MAIN
IMPORT (
        "BYTES"
        "LOG"
        "OS"
        "GITHUB.COM/MIKU/PARALLEL"
)
Fetch links in parallel (batch size of 1, many parallel workers). We can use a generic fetch function.
func Fetch(link string) (*FetchResult, error) {
	start := time.Now()
	resp, err := client.Get(link)
	if err != nil {
		return nil, err
	}
	defer resp.Body.Close()
	n, err := io.Copy(ioutil.Discard, resp.Body)
	if err != nil {
		return nil, err
	}
	elapsed := time.Since(start)
	return &FetchResult{
		URL:        link,
		StatusCode: resp.StatusCode,
		Length:     n,
		Elapsed:    fmt.Sprintf("%0.2f", elapsed.Seconds()),
	}, nil
}
Then adapt input, output and error handling in the parallel processor.
func main() {
	p := parallel.NewProcessor(strings.NewReader(input), os.Stdout, func(b []byte) ([]byte, error) {
		link := string(bytes.TrimSpace(b))
		if len(link) == 0 {
			return nil, nil
		}
		r, err := Fetch(link)
		if err != nil {
			return nil, nil
		}
		return MarshalEnd(r, []byte("\n"))
	})
	p.BatchSize = 1
	p.NumWorkers = 128
	if err := p.Run(); err != nil {
		log.Fatal(err)
	}
}
Hello, Pipes!
$ unzip -p top-1m.csv.zip | cut -d , -f 2 | shuf -n 10 | go run xf.go
{"url":"http://sancharika.org","status":200,"length":341,"elapsed":"0.48"}
{"url":"http://kmhd.link","status":200,"length":10677,"elapsed":"0.48"}
{"url":"http://allcracksoft.com","status":200,"length":32186,"elapsed":"0.58"}
{"url":"http://medow.club","status":200,"length":44375,"elapsed":"0.61"}
{"url":"http://reciclapet.com.ec","status":200,"length":616,"elapsed":"0.63"}
{"url":"http://twoscotsabroad.com","status":200,"length":138299,"elapsed":"0.68"}
{"url":"http://projectcasting.com","status":200,"length":96820,"elapsed":"0.98"}
{"url":"http://naim.guru","status":200,"length":68670,"elapsed":"1.93"}

Implementation


uses batching for performance
https://github.com/miku/parallel


## a5439249eca36e985c4d3dae69a8ff95.png

      
    Raw
  

              a5439249eca36e985c4d3dae69a8ff95.png
            
          
## banner.png

      
    Raw
  

              banner.png
            
          
## go.mod

      
    Raw
  

              go.mod
            
          
            View raw
              (Sorry about that, but we can’t show files that are this big right now.)
        
    
## go.sum

      
    Raw
  

              go.sum
            
          
            View raw
              (Sorry about that, but we can’t show files that are this big right now.)
        
    
## gophertools.png

      
    Raw
  

              gophertools.png
            
          
## xf.go

      
    Raw
  

              xf.go
            
          
            View raw
              (Sorry about that, but we can’t show files that are this big right now.)
        
    
## xu.go

      
    Raw
  

              xu.go
            
          
            View raw
              (Sorry about that, but we can’t show files that are this big right now.)