miku/.gitignore

## .gitignore
golangleipzig.cdx
golangleipzig.warc.gz

## README.md

      
    Raw
  

              README.md
            
          
    Seeking Data


Lightning Talk 2021-05-25 19:00 CEST, Leipzig Gophers #18, martin.czygan@gmail.com

A revelation


many years ago, I would write data
processing tools and they were kind of slow


Probably many reasons, but mainly:

I did not have a good baseline performance number

Somehow, you can get trapped into layers of software, which can distract from
the underlying I/O or data issues.

What follows are a few notions and techniques when I found useful over the
years when working with data or writing tools

Data plays a role

Peter
Norvig
(2008), emphasizing data over algorithms:

The key here is that no matter how agile you are as coders, and I understand
that you’re all great, data is going to be more agile than code. Because
you’ve got to write the code yourself, but the data you can leverage

Personal comment: Sometimes, data gets buried under layers of code; and while
code is essential, try to not bury the data too deep.
Some counter examples I came across:

writing tens of thousands of lines of code to basically manage a dataset of
50MB (queries took multiple seconds) - 50MB fit nicely into RAM; you could
iterate over all the records all the time, even - it would probably be no
problem
introducing layers of abstractions where a single function would have done
the job - we always want generic things; Go helped me to not fear not
having the most generic code right from the start (e.g. via structural typing)

Know your latencies

A classic table:

https://gist.github.com/hellerbarde/2843375


Latency numbers every programmer should know


Timing everything

At times I timed everything, e.g. using:
$ time pcstat file.gz
...

real    0m0.006s
user    0m0.001s
sys     0m0.004s

Which brings to mind rule 2:

Rule 2. Measure. Don't tune for speed until you've measured, and even then
don't unless one part of the code overwhelms the rest.


useful for long running processer or alternative implementations

Programming like it is 1975

The operating system (e.g. Linux) is your friend.

So what’s wrong with 1975 programming?


The really short answer is that computers do not have two kinds of storage any more.


[...]


And people program this way: They have variables in “memory” and move data to
and from “disk”.

The author goes on and compares the approaches of squid and varnish (used as cache
by millions of sites).

Varnish doesn’t really try to control what is cached in RAM and what is not,
the kernel has code and hardware support to do a good job at that, and it
does a good job.

Personal project:

we used a persistent version of memcached, called
memcachedb, which used e.g.
BerkeleyDB - an embedded key value
store (redis vs mc)
we would have to insert 150M objects into memcachedb regularly, and it would take around 12h or more

We only really needed an HTTP endpoint serving blobs given a key. The data was
available in a single file anyway, so I wrote
microblob - which allows you to serve data
from a single file over HTTP.
Especially new-line delimited JSON.
$ ls -lah
total 166G
drwxr-xr-x.  3 daemon daemon 4.0K Apr 15 11:54 .
drwxr-xr-x. 23 root   root   4.0K Mar  3  2017 ..
-rw-r--r--.  1 daemon daemon 166G Apr 28 12:57 date-2021-04-12.ldj
drwxr-xr-x.  2 daemon daemon  48K Apr 28 12:56 date-2021-04-12.ldj.28ed2061.db

$ du -hs date-2021-04-12.ldj.28ed2061.db/
2.8G    date-2021-04-12.ldj.28ed2061.db/

$ ls -lah date-2021-04-12.ldj.28ed2061.db | head
total 2.8G
drwxr-xr-x. 2 daemon daemon   48K Apr 28 12:56 .
drwxr-xr-x. 3 daemon daemon  4.0K Apr 15 11:54 ..
-rw-r--r--. 1 daemon daemon  553K Apr 15 13:07 030743.ldb
-rw-r--r--. 1 daemon daemon  2.1M Apr 15 14:05 057194.ldb
-rw-r--r--. 1 daemon daemon  2.1M Apr 15 14:11 059944.ldb
-rw-r--r--. 1 daemon daemon  2.1M Apr 15 14:11 059945.ldb
-rw-r--r--. 1 daemon daemon  2.1M Apr 15 14:11 059947.ldb
-rw-r--r--. 1 daemon daemon  2.1M Apr 15 14:11 059948.ldb
-rw-r--r--. 1 daemon daemon  2.1M Apr 15 14:11 059949.ldb

Microblob uses a Go implementation of LevelDB for its key to (offset, length)
mapping. The index overhead is less than 2%.


But how to know how well the OS caches a file?

Enter mincore system
call (since Linux 2.3, with varying
semantics), which

returns a vector that indicates whether pages of the calling process's
virtual memory are resident in core (RAM), and so will not cause a disk
access (page fault) if referenced.

Command line tools:

fincore
pcstat (page cache status)

Example:
$ pcstat date-2021-04-12.ldj
+---------------------+----------------+------------+-----------+---------+
| Name                | Size (bytes)   | Pages      | Cached    | Percent |
|---------------------+----------------+------------+-----------+---------|
| date-2021-04-12.ldj | 177924127318   | 43438508   | 27429139  | 063.145 |
+---------------------+----------------+------------+-----------+---------+

$ free -g
              total        used        free      shared  buff/cache   available
Mem:            125           1           0           0         123         124
Swap:             1           0           1

Plus: Almost the complete index is in memory as well.
In summary: 63% of the file could be served w/o disk access. No explicit code
to move data between disk and RAM, just seek and read. We can trade memory for
access time, basically scale up and down.
Insert time (or index time) went from 12+ to 1+ hour.
Less than 800 lines of code including config file handling, live stats (expvar
and more), debug backend.
There is limited update handling: You can only append, but there is no
compaction implemented.
Large Scale Data at Rest


An interesting way to cope with web-scale data.

WARC (web archive) format

You can create these yourself, with wget since 1.14.
$ wget --mirror --warc-cdx golangleipzig --warc-file golangleipzig https://golangleipzig.space/

This wraps a site into a single compressed archive and will generate an index file as well.
$ ls -lah golang*
-rw-rw-r-- 1 tir tir  19K May 25 14:25 golangleipzig.cdx
-rw-rw-r-- 1 tir tir 6.2M May 25 14:25 golangleipzig.warc.gz

The index is a line oriented file with a couple of fields:
 CDX a b a m s k r M V g u
https://golangleipzig.space/ 20210525122546 https://golangleipzig.space/ text/html 200 ...
https://golangleipzig.space/apple-touch-icon.png 20210525122546 https://golangleipzig.s...
https://golangleipzig.space/favicon-32x32.png 20210525122546 https://golangleipzig.spac...
https://golangleipzig.space/favicon-16x16.png 20210525122546 https://golangleipzig.spac...
https://golangleipzig.space/site.webmanifest 20210525122546 https://golangleipzig.space...

Depending on the number of sites you archive, this index can become quite large; possibly many TB.
How to lookup such an index? A couple of ideas:

sort content in the index (e.g. by URL)
use another level of indirection, and create an index for the index
make use of the fact, that a multiple gzip files can be concatenated

Back of the envelope calculation:

Sorting is expensive, but a one-time operation. Assuming we can get an index to
be 0.1% of the data set size and we add two levels, then per PB, we could end
up with 0.01% index sizes, or 100G index size per PB.
The indices are sorted, so given e.g. 1M entries in a secondary index, it
should not take more than 20 seeks (binary search) to find the chunk of the first index. OS
caching applies here as well, as other methods to reduce this number by keeping
parts of this index in memory.
The read from the larger index is sequential again, seek and read.
Here, we exploit the streaming nature of gzip (other algorithms work as well,
e.g. zstd).

GZIP(slice 1) + GZIP(slice 2) + ... + GZIP(slice N) ~ GZIP(slice 1 + slice 2 + ...)

Summary: Large scale data lookups with common access patterns can be made
feasible with limited resources. Official name (in the context of WARCs):
zipnum.
Additional notes:

The earliest search algorithm — binary search — was first
mentioned by
John Mauchly more than six decades ago, 25 years before the advent of
relational databases [Mau46, Cod70, Knu98]


When only considering the number of comparisons needed to find akey inside an
index, binary search on a sorted array is the optimal searchalgorithm,
provably requiring the minimal possible number of comparisons.

But also: What's wrong with binary search
Making the unseekable seekable

A new stargz is born.
Recently, I came across CRFS:

CRFS is a read-only FUSE filesystem that lets you mount a container image,
served directly from a container registry

It also exploits the streaming nature of gzip in order to reduce I/O and
improve performance. Originally
built for Go test and build infrastructure.

A follow-up project is extended stargz, or
estargz.


Standard-Compatible Extensions to Tar.gz Layers for Lazy Pulling Container Images

Key observation:

76% of container start time is pulling data
only 6.4% of the image data is read

And a couple of solutions already exist for this problem (apart from minimizing
or caching images).
The basic idea is to combine various techniques to improve performance:

make tar or tar.gz files seekable, while keeping them compatible
use a plugin in you cloud (containerd plugin, stargz-snapshotter)
use HTTP Range Requests RFC7233
use a local profiler sandbox to record, which files are accessed by a particular workload

More on this:

FOSDEM21: Build and Run Containers With Lazy
Pulling
OCI spec proposal: opencontainers/image-spec#815

Wrap Up


there is much more to data storage and access, we may cover in later meetups
(this year we wanted to look into various data storage systems, e.g. etcd,
and classic approaches from
https://www.databass.dev/)

Some aspects not included:

caching
compression tradeoffs
search techniques (a find: index/suffixarray) - (note to self: use case 100K patterns matched at once)
query languages
parallelizable workloads
...


## envelope.jpg

      
    Raw
  

              envelope.jpg
            
          
## envelope_30.jpg

      
    Raw
  

              envelope_30.jpg
            
          
## envelope_50.jpg

      
    Raw
  

              envelope_50.jpg
            
          
## floppy_icon.gif

      
    Raw
  

              floppy_icon.gif
            
          
## latency.png

      
    Raw
  

              latency.png
            
          
## Makefile
SHELL := /bin/bash

golangleipzig.warc.gz:
	wget --content-on-error --mirror --warc-cdx golangleipzig --warc-file golangleipzig --no-warc-keep-log https://golangleipzig.space/ > /dev/null || true
	# We only want the WARC file.
	rm -f index.html
	rm -rf golangleipzig.space

.PHONY: clean
clean:
	rm -f golangleipzig.warc.gz
	rm -f golangleipzig.cdx
	rm -f index.html
	rm -rf golangleipzig.space


## microblob.png

      
    Raw
  

              microblob.png
            
          
## seek.gif

      
    Raw
  

              seek.gif
            
          
## stargz.png

      
    Raw
  

              stargz.png
            
          
## crfs.png

      
    Raw
  

              crfs.png
	SHELL := /bin/bash

	golangleipzig.warc.gz:
	wget --content-on-error --mirror --warc-cdx golangleipzig --warc-file golangleipzig --no-warc-keep-log https://golangleipzig.space/ > /dev/null \|\| true
	# We only want the WARC file.
	rm -f index.html
	rm -rf golangleipzig.space

	.PHONY: clean
	clean:
	rm -f golangleipzig.warc.gz
	rm -f golangleipzig.cdx
	rm -f index.html
	rm -rf golangleipzig.space