Skip to content

Instantly share code, notes, and snippets.

@christoofar
Last active April 21, 2024 23:44
Show Gist options
  • Save christoofar/29e8a7edda716642c11934dfba170c3c to your computer and use it in GitHub Desktop.
Save christoofar/29e8a7edda716642c11934dfba170c3c to your computer and use it in GitHub Desktop.
I seriously dislike `io.Reader`, `io.ReadCloser`, `io.Writer`...

TLDR version: People who were obsessed* about simplifying things created a coding pattern that is overtly complex to do a chore that is brain-dead simple.

* Developers drowning in object taxonomies, mainly.

So, anyway...

One thing that really grinds my gears while writing safexz is the ByteReader antipattern. Consider this: image

IIRC this pattern first appeared in Smalltalk or Objective-C then found its way over to Java into (what I call) a Nastypattern™:

import java.io.IOException;

public interface ByteReader {
    byte[] readBytes() throws IOException;
}

public class SimpleByteReader implements ByteReader {
    private byte[] source;
    private int currentPosition = 0;

    public SimpleByteReader(byte[] source) {
        this.source = source;
    }

    @Override
    public byte[] readBytes() throws IOException {
        if (currentPosition >= source.length) {
            return null; // EOF
        }
        byte[] bytesRead = new byte[1024]; // Or any other size buffer
        int bytesToRead = Math.min(bytesRead.length, source.length - currentPosition);
        System.arraycopy(source, currentPosition, bytesRead, 0, bytesToRead);
        currentPosition += bytesToRead;
        return bytesRead;
    }
}

This is just maddening to understand at first glance. And what about this thing is "simple"? For starters: the line calling Math.min which is trying to find the safe point to read (realistically, it's a write) from. I really like how Lasse Collin (of liblzma fame) approaches this pattern. He gives you a C macro that sets up a data structure for you to work with, which you can call like this:

image

What's in? lzma_stream? This:

image

Granted, there are more internal fields on this structure (and that's where the triple indirect pointer to the XZ backdoor was located, which was placed there during _init, but these simple fields at the top are all that you use to move a stream through the compress/decompress cycle. It's much easier for me to understand.

In a stream processor you generally have some block of working storage to do "work" and you have an input tray and an output tray, like a desk. Things come into the IN tray, you work on them, and you put the results in the OUT tray. The only things that matter are the pointers (and in the case of byte buffers, it's just an index indicating now much free space is inside the input tray), directing whoever is feeding stuff into the IN tray if it's safe to add more work, similary a pointer (index) is given to whoever is taking things from the OUT tray.

You don't even need the total_in and total_out fields either. Those are just there for convenience in case you're running this code on something weird that doesn't give you enough registers to track your progress.

Think of it like this... how did big institutions like the IRS manage to process millions of taxpayer records with almost no database technology, on computers with less than 2MB of RAM?

image

The data structure Lasse Collin wrote gives you an idea how. It wasn't until the very late 1970s that there was enough memory inside computers to store a significant amount of state. The only code that people wrote were stream processing jobs. You would collect data from the environment with sensors or humans tapping away at punch card machines, and magtape was the only medium that had a significant storage size. A 6250-bpi reel of tape, when formatted, could hold about 180MB of data:

image

Since XML and JSON did not exist yet, files were written with positional records. Here's an example:

05200MARY               JOHNSTON               2159091212     041677
061001234 MARKET ST          PHILA                 PA         19107
06325DIST    A         BAL 00030025
06326LSTR  111488    CR   00001645

The first 5 bytes denotes a record type. Somewhere, someone decided how many bytes {FIRSTNAME} gets and its position on the 05200 - CUSTOMER NAME PHONE BDAY record. Each one of these lines used to be punched on to cards, and when video was cheap enough to adopt in the 1970s, clerks would not have to look at a record list or set a punch card machine to output the data in the correct position on a record. Convenient data entry forms were made to key the information into the files a speedy way.

Your streamer program you might be writing is going to upsert into these records, mixed together like this. This ancient video from 1981 describes how that was done.

Computers shared data by reading from copies of these tapes mailed all over the globe. It was already known that tapes were mostly storing 00, so to create tapes where most of the bytes on it had sigificant values, they would create packed and unpacked versions. Packing is just this:

05200MARY*JOHNSTON*2159091212*041677
061001234 MARKET ST*PHILA*PA*19107
06325DIST*A**BAL 00030025
06326LSTR*111488*CR***00001645

But for data like CITYNAME, STATENAME, this data would still repeat like mad throughout the dataset as well as any other significant data that was repetitive in nature. (Fun fact: this exact stlye of packing flat files is what EDI is, which is still used today and it's pushed over TLS connections instead of huge reels of tape).

That's were the origins of DEFLATE up to LZMA come from. Positional recordkeeping generates tons of whitespace and repetitve data, lending itself very well to compaction. Instead of mailing a hefty 14 tapes of bank account information from one bank to another, you could feed all the master record tapes through an LZ77 job to pack everything into one tape and mail it for less than $5.

People like xz in particular because it's great at finding any arbitary binary patterns (think about all the 7-bit ASCII text out there being stored on computers today as a double-byte, leaving huge amounts of 00 in the data! or 32-bit machine code that leaves out lots of high opcodes, hence more 00), and text is only a subset of binary encoding, so programs and machine data that is repetitve would also be found and packed, not just the whitespace.

Go channels should have been the default pattern

Let's take a look at what I did to call lzma directly:

		go func(receive chan []byte, sender chan []byte) {
			decompressChanStream(receive, sender)
		}(input, output)

and here is how decompressChanStream is defined:

func decompressChanStream(in <-chan []byte, out chan<- []byte) {

Essentially, the in channel can be thought of as a computer tape that's loaded full of LZMA-compressed data (in reality this source can be from anywhere). When I start the job, decompressed data will begin to emerge from the out channel. When the input tape runs out, close()ing the input channel is the signal Lasse's liblzma needs to know the input is completed and there will be no more data coming, so it's time to wrap it up and finish the last writes.

With this simple pattern you can then do whatever you'd like with the output stream. Maybe you need to filter and disperse individual records, or broadcast the data, or put it on the screen or into a file. I don't have to care what you do with what comes out of it.

Data will continue to come out of the out channel until LZMA's work is done, then it will close the out channel.

By ranging over the out channel like this: image

I can now copy all the bytes coming out into a file (or to anywhere). I can come back to this code 30 years later and still understand it and not need any comments, because the physical act of connecting inputs and outputs is a basic human understanding, and it's a foundation pattern of stream processing. A foundation that's existed since the ENIAC, which was also a stream-processing computer.

I hate the way Java does this, and C# is similarly nasty. The only thing ByteReader and ByteWriter patterns do is encourage top-level coders to write surfaces over them to hide them from users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment