Skip to content

Instantly share code, notes, and snippets.

Last active Jan 17, 2022
What would you like to do?
Lightning talk, Notes on Storage

Seeking Data

Lightning Talk 2021-05-25 19:00 CEST, Leipzig Gophers #18,

A revelation

  • many years ago, I would write data processing tools and they were kind of slow

Probably many reasons, but mainly:

  • I did not have a good baseline performance number

Somehow, you can get trapped into layers of software, which can distract from the underlying I/O or data issues.

What follows are a few notions and techniques when I found useful over the years when working with data or writing tools

Data plays a role

Peter Norvig (2008), emphasizing data over algorithms:

The key here is that no matter how agile you are as coders, and I understand that you’re all great, data is going to be more agile than code. Because you’ve got to write the code yourself, but the data you can leverage

Personal comment: Sometimes, data gets buried under layers of code; and while code is essential, try to not bury the data too deep.

Some counter examples I came across:

  • writing tens of thousands of lines of code to basically manage a dataset of 50MB (queries took multiple seconds) - 50MB fit nicely into RAM; you could iterate over all the records all the time, even - it would probably be no problem
  • introducing layers of abstractions where a single function would have done the job - we always want generic things; Go helped me to not fear not having the most generic code right from the start (e.g. via structural typing)

Know your latencies

A classic table:

Latency numbers every programmer should know

Timing everything

At times I timed everything, e.g. using:

$ time pcstat file.gz

real    0m0.006s
user    0m0.001s
sys     0m0.004s

Which brings to mind rule 2:

Rule 2. Measure. Don't tune for speed until you've measured, and even then don't unless one part of the code overwhelms the rest.

  • useful for long running processer or alternative implementations

Programming like it is 1975

The operating system (e.g. Linux) is your friend.

The really short answer is that computers do not have two kinds of storage any more.


And people program this way: They have variables in “memory” and move data to and from “disk”.

The author goes on and compares the approaches of squid and varnish (used as cache by millions of sites).

Varnish doesn’t really try to control what is cached in RAM and what is not, the kernel has code and hardware support to do a good job at that, and it does a good job.

Personal project:

  • we used a persistent version of memcached, called memcachedb, which used e.g. BerkeleyDB - an embedded key value store (redis vs mc)
  • we would have to insert 150M objects into memcachedb regularly, and it would take around 12h or more

We only really needed an HTTP endpoint serving blobs given a key. The data was available in a single file anyway, so I wrote microblob - which allows you to serve data from a single file over HTTP.

Especially new-line delimited JSON.

$ ls -lah
total 166G
drwxr-xr-x.  3 daemon daemon 4.0K Apr 15 11:54 .
drwxr-xr-x. 23 root   root   4.0K Mar  3  2017 ..
-rw-r--r--.  1 daemon daemon 166G Apr 28 12:57 date-2021-04-12.ldj
drwxr-xr-x.  2 daemon daemon  48K Apr 28 12:56 date-2021-04-12.ldj.28ed2061.db

$ du -hs date-2021-04-12.ldj.28ed2061.db/
2.8G    date-2021-04-12.ldj.28ed2061.db/

$ ls -lah date-2021-04-12.ldj.28ed2061.db | head
total 2.8G
drwxr-xr-x. 2 daemon daemon   48K Apr 28 12:56 .
drwxr-xr-x. 3 daemon daemon  4.0K Apr 15 11:54 ..
-rw-r--r--. 1 daemon daemon  553K Apr 15 13:07 030743.ldb
-rw-r--r--. 1 daemon daemon  2.1M Apr 15 14:05 057194.ldb
-rw-r--r--. 1 daemon daemon  2.1M Apr 15 14:11 059944.ldb
-rw-r--r--. 1 daemon daemon  2.1M Apr 15 14:11 059945.ldb
-rw-r--r--. 1 daemon daemon  2.1M Apr 15 14:11 059947.ldb
-rw-r--r--. 1 daemon daemon  2.1M Apr 15 14:11 059948.ldb
-rw-r--r--. 1 daemon daemon  2.1M Apr 15 14:11 059949.ldb

Microblob uses a Go implementation of LevelDB for its key to (offset, length) mapping. The index overhead is less than 2%.

  • But how to know how well the OS caches a file?

Enter mincore system call (since Linux 2.3, with varying semantics), which

returns a vector that indicates whether pages of the calling process's virtual memory are resident in core (RAM), and so will not cause a disk access (page fault) if referenced.

Command line tools:


$ pcstat date-2021-04-12.ldj
| Name                | Size (bytes)   | Pages      | Cached    | Percent |
| date-2021-04-12.ldj | 177924127318   | 43438508   | 27429139  | 063.145 |

$ free -g
              total        used        free      shared  buff/cache   available
Mem:            125           1           0           0         123         124
Swap:             1           0           1

Plus: Almost the complete index is in memory as well.

In summary: 63% of the file could be served w/o disk access. No explicit code to move data between disk and RAM, just seek and read. We can trade memory for access time, basically scale up and down.

Insert time (or index time) went from 12+ to 1+ hour.

Less than 800 lines of code including config file handling, live stats (expvar and more), debug backend.

There is limited update handling: You can only append, but there is no compaction implemented.

Large Scale Data at Rest

An interesting way to cope with web-scale data.

  • WARC (web archive) format

You can create these yourself, with wget since 1.14.

$ wget --mirror --warc-cdx golangleipzig --warc-file golangleipzig

This wraps a site into a single compressed archive and will generate an index file as well.

$ ls -lah golang*
-rw-rw-r-- 1 tir tir  19K May 25 14:25 golangleipzig.cdx
-rw-rw-r-- 1 tir tir 6.2M May 25 14:25 golangleipzig.warc.gz

The index is a line oriented file with a couple of fields:

 CDX a b a m s k r M V g u 20210525122546 text/html 200 ... 20210525122546 https://golangleipzig.s... 20210525122546 https://golangleipzig.spac... 20210525122546 https://golangleipzig.spac... 20210525122546

Depending on the number of sites you archive, this index can become quite large; possibly many TB.

How to lookup such an index? A couple of ideas:

Back of the envelope calculation:

Sorting is expensive, but a one-time operation. Assuming we can get an index to be 0.1% of the data set size and we add two levels, then per PB, we could end up with 0.01% index sizes, or 100G index size per PB.

The indices are sorted, so given e.g. 1M entries in a secondary index, it should not take more than 20 seeks (binary search) to find the chunk of the first index. OS caching applies here as well, as other methods to reduce this number by keeping parts of this index in memory.

The read from the larger index is sequential again, seek and read.

Here, we exploit the streaming nature of gzip (other algorithms work as well, e.g. zstd).

  • GZIP(slice 1) + GZIP(slice 2) + ... + GZIP(slice N) ~ GZIP(slice 1 + slice 2 + ...)

Summary: Large scale data lookups with common access patterns can be made feasible with limited resources. Official name (in the context of WARCs): zipnum.

Additional notes:

The earliest search algorithm — binary search — was first mentioned by John Mauchly more than six decades ago, 25 years before the advent of relational databases [Mau46, Cod70, Knu98]

When only considering the number of comparisons needed to find akey inside an index, binary search on a sorted array is the optimal searchalgorithm, provably requiring the minimal possible number of comparisons.

But also: What's wrong with binary search

Making the unseekable seekable

A new stargz is born.

Recently, I came across CRFS:

CRFS is a read-only FUSE filesystem that lets you mount a container image, served directly from a container registry

It also exploits the streaming nature of gzip in order to reduce I/O and improve performance. Originally built for Go test and build infrastructure.

A follow-up project is extended stargz, or estargz.

Standard-Compatible Extensions to Tar.gz Layers for Lazy Pulling Container Images

Key observation:

  • 76% of container start time is pulling data
  • only 6.4% of the image data is read

And a couple of solutions already exist for this problem (apart from minimizing or caching images).

The basic idea is to combine various techniques to improve performance:

  • make tar or tar.gz files seekable, while keeping them compatible
  • use a plugin in you cloud (containerd plugin, stargz-snapshotter)
  • use HTTP Range Requests RFC7233
  • use a local profiler sandbox to record, which files are accessed by a particular workload

More on this:

Wrap Up

  • there is much more to data storage and access, we may cover in later meetups (this year we wanted to look into various data storage systems, e.g. etcd, and classic approaches from

Some aspects not included:

  • caching
  • compression tradeoffs
  • search techniques (a find: index/suffixarray) - (note to self: use case 100K patterns matched at once)
  • query languages
  • parallelizable workloads
  • ...
SHELL := /bin/bash
wget --content-on-error --mirror --warc-cdx golangleipzig --warc-file golangleipzig --no-warc-keep-log > /dev/null || true
# We only want the WARC file.
rm -f index.html
rm -rf
.PHONY: clean
rm -f golangleipzig.warc.gz
rm -f golangleipzig.cdx
rm -f index.html
rm -rf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment