fletchmakes/pico8-spritesheet-data-compression.md Secret

## pico8-spritesheet-data-compression.md

      
    Raw
  

              pico8-spritesheet-data-compression.md
            
          
    PICO-8 Spritesheet Expansion and Data Compression

Introduction

This entire adventure started when I wanted a bit more than the 128x128 px spritesheet that PICO-8 allows us to have. Knowing that PICO-8 has general purpose RAM that we can use, I thought to myself: "What's stopping me from learning topics about data compression and storing spritesheets as a string, then during _init() loading them into RAM?" So that's what I set out to do.
Ultimately, using clever and timely memcpy() calls, we should be able to support up to 5 spritesheets (1 in cartridge data, 4 in the region 0x8000-0xffff) in PICO-8 with minimal runtime CPU resources. The tradeoff is that we'll use the cartridge's character count and compressed size to help store the spritesheets, so we'll want to try and compress the data as tightly as possible (lossless).
Know Your Limits

There are some things we should keep in mind before we dive into the really sweaty topics of data compression:

PICO-8 has 16^* colors, represented in hex by 0x0 -> 0xf. That means each color takes up 4 bits of space. [1]
PICO-8's spritesheet is 128x128 pixels, which means we need to store 8192 bytes of data as compressed as possible. [2]
The character set that PICO-8 uses, P8SCII, has 240 printable characters and 16 control codes. For our purposes, we'll only use the 240 printable characters since we are trying to copy our spritesheets into a string. [3]
Since P8SCII has 256 characters, we can identify each byte of information (8 bits) as a single character. This will be extremely useful for us, since it is functionally the same as a base256 encoding.

^*I know there are more colors, but for our purposes we ignore the secret palette
RLE Compression

The first step is a Run-Length Encoding compression. Basically, instead of storing each pixel separately, we store a number which indicates the amount of adjacent identical pixels, and then the color that pixel was. So, if there were 4 black pixels in a row, instead of storing 0000 0000 0000 0000 we could store 0100 0000 (number first, then color). You can read more about RLE here: https://en.wikipedia.org/wiki/Run-length_encoding
This compression algorithm performs best with limited palettes, like PICO-8! Even better if it's just 1-bit, since there will be longer runs of the same color pixel.
So, we know that we want to store the number of adjacent equivalent pixels, as well as the color of the pixel itself. My first attempt at the solution to this problem looked like this:
local spritesheet = '80,8a,16c,40, ...'

I could split the string into tokens along commas with split(), then grab two substrings. The last character of the token would indicate the color of the pixel, and the digits prior to the last character indicated the number of repetitions. While simple to implement, this wasn't good enough, and a full spritesheet would result in a large string (still better than storing the raw data however!). We can do better.
Mixing RLE Compression with base256 encoding

If we get very clever with how we structure our data, we can get away with even more compression gains. Consider the fact that we just need to store two pieces of data: the number of repetitions, and the color.
The color can be represented with 4 bits. By choosing carefully, we can conclude that a good number of bits to use for the repetitions would be 12. Why? Because 12+4 is 16, which is conveniently divisible by 8 and thus we can take advantage of a base256 encoding via P8SCII.
It also gives us an effective maximum of 4095 repetitions before we need to start a new repetition grouping, which is exceedingly unlikely (unless half the encoded spritesheet is black). A roomy upper bound of repetitions and the possibility for a base256 encoding is too good to pass up.
If we were to structure our data then, it might look like this:


After RLE compressing our data, we can then take each 16-bit word and divide it like this in order to do the base256 encoding:


That's all fine and dandy, but the astute among you might realize something: there is a non-zero chance that either char1 or char2 will be one of the 16 P8SCII control characters, which we'd like to avoid.
Swerving to avoid the potholes

We need to find a way to guarantee ourselves that both 8-bit words will be greater than (or equal to) 16. That way, we'll always have printable characters with which to work with.
After a lot of thinking, I came up with a solution that I think is reasonable. Instead of storing 12 bits of repetition data, we store only 10 bits. This limits our repetitions to a max of 1023 before starting a new grouping, but this is still a very large ceiling that is unlikely to be hit. What do we do with our newly freed 2 bits? We will always set them to 1 when we are encoding data, and then ignore those bits when we are decoding.
Well then, which bit positions should be always on? Take a look at this:


When we consider the 8-bit words then, it'll look like this:


You'll notice that both char1 and char2 are greater than or equal to 16! That's a huge win for us - we can store our data in a way that it will ALWAYS be able to be converted into a printable set of characters with a minimal loss in data. What does this mean for us? It means that every set of pixel repetitions from 1 to 1023 can be stored with two characters in our encoded string.
Time to get programming

UNDER CONSTRUCTION

[1] https://pico-8.fandom.com/wiki/Palette

[2] https://pico-8.fandom.com/wiki/Memory#Sprite_sheet

[3] https://pico-8.fandom.com/wiki/P8SCII