- many years ago, I would write data processing tools and they were kind of slow
Probably many reasons, but mainly:
- I did not have a good baseline performance number
Somehow, you can get trapped into layers of software, which can distract from the underlying I/O or data issues.
What follows are a few notions and techniques when I found useful over the years when working with data or writing tools
Data plays a role
The key here is that no matter how agile you are as coders, and I understand that you’re all great, data is going to be more agile than code. Because you’ve got to write the code yourself, but the data you can leverage
Personal comment: Sometimes, data gets buried under layers of code; and while code is essential, try to not bury the data too deep.
Some counter examples I came across:
- writing tens of thousands of lines of code to basically manage a dataset of 50MB (queries took multiple seconds) - 50MB fit nicely into RAM; you could iterate over all the records all the time, even - it would probably be no problem
- introducing layers of abstractions where a single function would have done the job - we always want generic things; Go helped me to not fear not having the most generic code right from the start (e.g. via structural typing)
Know your latencies
A classic table:
Latency numbers every programmer should know
At times I timed everything, e.g. using:
$ time pcstat file.gz ... real 0m0.006s user 0m0.001s sys 0m0.004s
Which brings to mind rule 2:
Rule 2. Measure. Don't tune for speed until you've measured, and even then don't unless one part of the code overwhelms the rest.
- useful for long running processer or alternative implementations
Programming like it is 1975
The operating system (e.g. Linux) is your friend.
The really short answer is that computers do not have two kinds of storage any more.
And people program this way: They have variables in “memory” and move data to and from “disk”.
The author goes on and compares the approaches of squid and varnish (used as cache by millions of sites).
Varnish doesn’t really try to control what is cached in RAM and what is not, the kernel has code and hardware support to do a good job at that, and it does a good job.
- we used a persistent version of memcached, called memcachedb, which used e.g. BerkeleyDB - an embedded key value store (redis vs mc)
- we would have to insert 150M objects into memcachedb regularly, and it would take around 12h or more
We only really needed an HTTP endpoint serving blobs given a key. The data was available in a single file anyway, so I wrote microblob - which allows you to serve data from a single file over HTTP.
Especially new-line delimited JSON.
$ ls -lah total 166G drwxr-xr-x. 3 daemon daemon 4.0K Apr 15 11:54 . drwxr-xr-x. 23 root root 4.0K Mar 3 2017 .. -rw-r--r--. 1 daemon daemon 166G Apr 28 12:57 date-2021-04-12.ldj drwxr-xr-x. 2 daemon daemon 48K Apr 28 12:56 date-2021-04-12.ldj.28ed2061.db $ du -hs date-2021-04-12.ldj.28ed2061.db/ 2.8G date-2021-04-12.ldj.28ed2061.db/ $ ls -lah date-2021-04-12.ldj.28ed2061.db | head total 2.8G drwxr-xr-x. 2 daemon daemon 48K Apr 28 12:56 . drwxr-xr-x. 3 daemon daemon 4.0K Apr 15 11:54 .. -rw-r--r--. 1 daemon daemon 553K Apr 15 13:07 030743.ldb -rw-r--r--. 1 daemon daemon 2.1M Apr 15 14:05 057194.ldb -rw-r--r--. 1 daemon daemon 2.1M Apr 15 14:11 059944.ldb -rw-r--r--. 1 daemon daemon 2.1M Apr 15 14:11 059945.ldb -rw-r--r--. 1 daemon daemon 2.1M Apr 15 14:11 059947.ldb -rw-r--r--. 1 daemon daemon 2.1M Apr 15 14:11 059948.ldb -rw-r--r--. 1 daemon daemon 2.1M Apr 15 14:11 059949.ldb
Microblob uses a Go implementation of LevelDB for its key to (offset, length) mapping. The index overhead is less than 2%.
- But how to know how well the OS caches a file?
returns a vector that indicates whether pages of the calling process's virtual memory are resident in core (RAM), and so will not cause a disk access (page fault) if referenced.
Command line tools:
$ pcstat date-2021-04-12.ldj +---------------------+----------------+------------+-----------+---------+ | Name | Size (bytes) | Pages | Cached | Percent | |---------------------+----------------+------------+-----------+---------| | date-2021-04-12.ldj | 177924127318 | 43438508 | 27429139 | 063.145 | +---------------------+----------------+------------+-----------+---------+ $ free -g total used free shared buff/cache available Mem: 125 1 0 0 123 124 Swap: 1 0 1
Plus: Almost the complete index is in memory as well.
In summary: 63% of the file could be served w/o disk access. No explicit code to move data between disk and RAM, just seek and read. We can trade memory for access time, basically scale up and down.
Insert time (or index time) went from 12+ to 1+ hour.
Less than 800 lines of code including config file handling, live stats (expvar and more), debug backend.
There is limited update handling: You can only append, but there is no compaction implemented.
Large Scale Data at Rest
An interesting way to cope with web-scale data.
- WARC (web archive) format
You can create these yourself, with
wget since 1.14.
$ wget --mirror --warc-cdx golangleipzig --warc-file golangleipzig https://golangleipzig.space/
This wraps a site into a single compressed archive and will generate an index file as well.
$ ls -lah golang* -rw-rw-r-- 1 tir tir 19K May 25 14:25 golangleipzig.cdx -rw-rw-r-- 1 tir tir 6.2M May 25 14:25 golangleipzig.warc.gz
The index is a line oriented file with a couple of fields:
CDX a b a m s k r M V g u https://golangleipzig.space/ 20210525122546 https://golangleipzig.space/ text/html 200 ... https://golangleipzig.space/apple-touch-icon.png 20210525122546 https://golangleipzig.s... https://golangleipzig.space/favicon-32x32.png 20210525122546 https://golangleipzig.spac... https://golangleipzig.space/favicon-16x16.png 20210525122546 https://golangleipzig.spac... https://golangleipzig.space/site.webmanifest 20210525122546 https://golangleipzig.space...
Depending on the number of sites you archive, this index can become quite large; possibly many TB.
How to lookup such an index? A couple of ideas:
- sort content in the index (e.g. by URL)
- use another level of indirection, and create an index for the index
- make use of the fact, that a multiple gzip files can be concatenated
Back of the envelope calculation:
Sorting is expensive, but a one-time operation. Assuming we can get an index to be 0.1% of the data set size and we add two levels, then per PB, we could end up with 0.01% index sizes, or 100G index size per PB.
The indices are sorted, so given e.g. 1M entries in a secondary index, it should not take more than 20 seeks (binary search) to find the chunk of the first index. OS caching applies here as well, as other methods to reduce this number by keeping parts of this index in memory.
The read from the larger index is sequential again, seek and read.
Here, we exploit the streaming nature of gzip (other algorithms work as well, e.g. zstd).
- GZIP(slice 1) + GZIP(slice 2) + ... + GZIP(slice N) ~ GZIP(slice 1 + slice 2 + ...)
Summary: Large scale data lookups with common access patterns can be made feasible with limited resources. Official name (in the context of WARCs): zipnum.
The earliest search algorithm — binary search — was first mentioned by John Mauchly more than six decades ago, 25 years before the advent of relational databases [Mau46, Cod70, Knu98]
When only considering the number of comparisons needed to find akey inside an index, binary search on a sorted array is the optimal searchalgorithm, provably requiring the minimal possible number of comparisons.
But also: What's wrong with binary search
Making the unseekable seekable
A new stargz is born.
Recently, I came across CRFS:
CRFS is a read-only FUSE filesystem that lets you mount a container image, served directly from a container registry
It also exploits the streaming nature of gzip in order to reduce I/O and improve performance. Originally built for Go test and build infrastructure.
A follow-up project is extended stargz, or estargz.
Standard-Compatible Extensions to Tar.gz Layers for Lazy Pulling Container Images
- 76% of container start time is pulling data
- only 6.4% of the image data is read
And a couple of solutions already exist for this problem (apart from minimizing or caching images).
The basic idea is to combine various techniques to improve performance:
- make tar or tar.gz files seekable, while keeping them compatible
- use a plugin in you cloud (containerd plugin, stargz-snapshotter)
- use HTTP Range Requests RFC7233
- use a local profiler sandbox to record, which files are accessed by a particular workload
- FOSDEM21: Build and Run Containers With Lazy Pulling
- OCI spec proposal: opencontainers/image-spec#815
- there is much more to data storage and access, we may cover in later meetups (this year we wanted to look into various data storage systems, e.g. etcd, and classic approaches from https://www.databass.dev/)
Some aspects not included:
- compression tradeoffs
- search techniques (a find: index/suffixarray) - (note to self: use case 100K patterns matched at once)
- query languages
- parallelizable workloads