Skip to content

Instantly share code, notes, and snippets.

@miku

miku/.gitignore

Last active Apr 21, 2021
Embed
What would you like to do?
A data deduplication example with Go

(Fuzzy) Matching with command line tools and Go

Lightning Talk at Leipzig Gophers #17 2021-04-20 19:00 CEST

Problem

Build a graph data structure from semi-structured data.

Specifically, a citation graph:

Finding outbound references is easier (we only need to look at a paper and its citations), finding inbound references requires knowing about many publications and which paper each of them cite.

A very asymmetric problem.

Here is a sample of inbound references for the 1978 CSP paper.

Inputs

  • a catalog of publications (150M, 800GB)
  • a list of references of widely varying completeness (from structured to string, 2B+, 1TB)

Options

  • A: use a big data framework, like hadoop, spark, flink, ...
  • B: hack your way through with command line tools

How about, ... B?

  • may run on a single machine
  • data may be easier to inspect
  • if your compute cluster is slow, it may even be faster

Oldie (2014), but goldie: Command-line Tools can be 235x Faster than your Hadoop Cluster (still discussed in 2018, 2019)

I found Taco Bell (2010) there, ... (and reddit screams: Taco Bell Programming (this needs to be up top on /r/programming more often) (2018, 488+), ...

The more I write code and design systems, the more I understand that many times, you can achieve the desired functionality simply with clever reconfigurations of the basic Unix tool set.

Welcome to 2021

2021: TB is the new GB

I'm impressionable, so maybe that "235x" article appealed to me.

One especially under-used approach for data processing is using standard shell tools and commands.

Ok, but what and how?

  • extr-awk-t, tr-ansform, tr-ansform, join, join, go!

Extract

  • we need identifiers, e.g. DOI, ARXIV IDs, PUBMED IDs, ...
  • we can run a fuzzy matching process to group similar things and find connections that way

Extract

Given a JSON lines files with structured but partial data.

{
  "biblio": {
    "container_name": "IEEE Transactions on Pattern Analysis and Machine Intelligence",
    "contrib_raw_names": [
      "M Ben-Ezra",
      "S K Nayar"
    ],
    "issue": "6",
    "pages": "689-698",
    "title": "Motion-based motion deblurring",
    "unstructured": "M. Ben-Ezra and S. K. Nayar. Motion-based motion deblurring. ... ",
    "volume": "26",
    "year": 2004
  },
  "index": 0,
  "key": "b0",
  "ref_source": "grobid",
  "release_ident": "26qgat7mzrerjacrlsz3gdmcgy",
  "release_year": 2014,
  "work_ident": "aaaoe2wcbvdjthnv36dlqgkray"
}

We extract a TSV of id, attribute and doc - with a zstdcat - tr - jq - sed - awk - GO - sort - zstd pipeline (abbrev):

$ zstdcat -T0 {input} |
    LC_ALL=C tr -d '\t' |
    parallel -j {n} --block 10M --pipe "jq -rc 'select ... | @tsv'" |
    LC_ALL=C sed 's/\\\\/\\/g' |
    LC_ALL=C awk -F $'\t' -v OFS='\t' '$2=tolower($2)' |
    skate-to-doi -B -S -f 2 |
    LC_ALL=C sort -S 30% --parallel 6 -T {tmpdir} -k2,2 |
    zstd -c -T0 > {output}
  • sort is important; and thankfully we have LC_ALL=C (ULSE87745/) and -S (buffer) - otherwise this would be literally unfeasible.
  • shoutout: zstd

The little Go tool in-between is a DOI cleanup tool.

$ cat fixtures/doi.tsv
1,xx 10.123/12,2
A,2,3
4,10.123/123 abc,12

$ cat fixtures/doi.tsv | skate-to-doi -d , -f 2
1,10.123/12,2
A,,3
4,10.123/123,12

100M lines with cleand DOI in 75s (kind of slow, but fast enough).

Joining and Grouping

Once we have two sorted files, joining them becomes a one-liner again.

      A                    B
-----------------    -----------------
ID    KEY1    DOC    ID    KEY1    DOC
ID    KEY1    DOC    ID    KEY2    DOC
ID    KEY2    DOC    ID    KEY3    DOC
ID    KEY2    DOC    ID    KEY4    DOC
ID    KEY2    DOC    ID    KEY5    DOC

For a first pass, we used join(1):

join - join lines of two files on a common field:

And yes:

Important: FILE1 and FILE2 must be sorted on the join fields.

We wanted to do a bit more for each "cluster" of documents (e.g. generate a new document, which represent one edge in the citation graph, or verification of sorts, ...).

Enter Go

We wanted to not just join on the key, but generate a small document that represents the edge between two papers.

So we built a small utility, that works like join but allows to attach additional computation to the grou

      A                    B
-----------------    -----------------
ID    KEY1    DOC    ID    KEY1    DOC |_____ f ....
ID    KEY1    DOC                      |

ID    KEY2    DOC    ID    KEY2    DOC |
ID    KEY2    DOC                      |_____ f ....
ID    KEY2    DOC                      |

ID    KEY3    DOC    ID    KEY3    DOC |_____ f ....

      ....
  • where f is some function, e.g. conversion to an edge

A minimal join like type

  • working title zipkey

A type that takes two readers, a function to extract a key (currently must be the same for both readers) and a function to pass the identifier group into (like callback).

zipper := zipkey.New(r, s, keyFunc, groupFunc)
zipper.Run()

The grouper determines the output.

If you want to stretch terminology a bit, the keyFunc is the mapper, groupFunc the reducer; it's just not distributed at all (but still fast enough).

Basic types

package zipkey

import (
    "bufio"
    "io"
)

// Group groups items by key and will contain the complete records (e.g. line)
// for further processing.
type Group struct {
    Key string
    G0  []string
    G1  []string
}

type (
    keyFunc   func(string) (string, error)
    groupFunc func(*Group) error
)

// ZipRun reads records (separated by sep) from two readers, extracts a key
// from each record with a keyFunc and collects records from the two streams
// into a Group. A callback groupFunc can be registered, which allows to
// customize the processing of the group. Current limitation: both streams need
// to use the same keyFunc.
type ZipRun struct {
    r0, r1 *bufio.Reader
    kf     keyFunc
    gf     groupFunc
    sep    byte
}

// New create a new ready to run ZipRun value.
func New(r0, r1 io.Reader, kf keyFunc, gf groupFunc) *ZipRun {
    return &ZipRun{
        r0:  bufio.NewReader(r0),
        r1:  bufio.NewReader(r1),
        kf:  kf,
        gf:  gf,
        sep: '\n',
    }
}

And running it is mostly a bigger switch statement:

// Run starts reading from both readers. The process stops, if one reader is
// exhausted or reads from any reader fail.
func (c *ZipRun) Run() error {
    var (
        k0, k1, c0, c1 string // key: k0, k1; current line: c0, c1
        done           bool
        err            error
        lineKey        = func(r *bufio.Reader) (line, key string, err error) {
            if line, err = r.ReadString(c.sep); err != nil {
                return
            }
            key, err = c.kf(line)
            return
        }
    )
    for {
        if done {
            break
        }
        switch {
        case k0 == "" || k0 < k1: ...
        case k1 == "" || k0 > k1: ...
        case k0 == k1: ...
        }
    }
    return nil
}

Final chapter: Fuzzy matching

If we do not have identifiers, we may still able to generate a key of sorts.

  • e.g. from normalized titles

A small command line tool:

  • skate-derive-key does what the above "tr - jq - sed ..." pipeline did, but it computes some "key"

From there, the rest of the pipeline looks the same. There is a important verification step to identify false positives (it's done in the grouper).

Performance

For fuzzy matching, we identify around 40M clusters (docs that map to the same key). I haven't benchmarked the zipkey, but did not bother to make it parallel, because it was fast enough.

Future

  • hope to open source these tools in Q2 2021
  • include graph data into scholar.archive.org, a search engine for scholarly documents
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment