Skip to content

Instantly share code, notes, and snippets.

@focalintent
Last active August 29, 2015 14:26
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save focalintent/0d2de5159f6f9d47fcec to your computer and use it in GitHub Desktop.
Save focalintent/0d2de5159f6f9d47fcec to your computer and use it in GitHub Desktop.
On scale8, correctness, and cycle counting on avr

FastLED is about more than just pushing data out to leds as quickly as possible. It is also about providing the fastest supporting function (math, color maniuplation, etc...) while still remaining accurate. Mark has done an amazing job with many of the math functions, including fast trig functions that are ~98% accurate, and the core of the scaling functions. These scaling functions are what allow us to do things like the non-destructive, 0 overhead scaling and color adjustments in the library.

scale8 - the background

The basic scaling function in FastLED is scale8. It lets you scale an 8-bit value by another 8-bit value. It's a percentage of sorts, but from 0-255 instead of 0-100. So, for example, scale8(64,128) will give you 32. Scaling is something that gets used a lot in led work, so it is important that it be fast. The C version of scale8 takes advantage of some properties of binary math to do it's magic (while avoiding division), let's take a look:

uint8_t scale8(uint8_t i, fract8 scale) {
return (i*scale)>>8;
}

When you multiply two 8 bit values together, you get a sixteen bit result. It turns out that the high 8 bits represent the scaled value, for example, from our code above, scale8(64,128), 64*128 is 8192. In hex, 8192 is 0x2000. If you take the high 8 bits of that, 0x20, you have 32, as the result. Pretty nifty, no? This is something that works with decimal numbers as well. Let's say you want to halve a value, let's say 128. Well, havling is like taking 50%, so you could multiply 128 by 50 - which gives you 6400. Then you divide that by 100 (because, remember, 50% is 0.5) to get you back to 64. Same thing, just base 10 vs base 2.

So, let's look at what avr-gcc does with the above function. It is pretty smart, it aggressively chooses to inline it (N.B. we do our own flagging of functions in FastLED to encourage always inlining, as much as we can). Here's what the asm for a call to scale8 ends up looking like (the value for i has already been loaded into register r18, and the value for scale has been loaded into register r19):

mul r18, r19
movw r24, r0
eor r1,r1

This isn't that bad! Here's the breakdown of what it is doing. First, it is doing a 2 cycle multiply of the two 8 bit values. Since avr registers are only 8-bit, the avr mul opcode stores the results in registers r0 (low byte) and r1 (high byte). The next instruction, movw, is a single cycle instruction to move 2 registers to another pair of registers, in this case, moving r0 and r1 to r24 and r25. Finally, avr-gcc uses the r1 register as a "zero" register, meaning that it always expects the value in that register to be zero, so the 3rd instruction here, the eor r1,r1 effectively zeros out the r1 register again.

4 cycles, which is actually as fast as FastLED's asm version of scale8, however, it uses an extra register, since we're never going to use the value in r24. So it's a wasted, and keeping registers available is a part of keeping code fast. So, FastLED's scale8 on avr is an asm block that looks like this, instead (note, code tweaked lightly):

mul %0, %1   ; %0 maps to i, %1 maps to scale
mov %2, r1   ; %2 maps to the return value for scale8
clr r1       ; restore the r1 register

Still 4 cycles, but now uses one less register. If we aren't going to use the original value of i at the place where we called scale8 again, the compiler may opt to have %0 and %2 map to the same register.

What if you're doing multiple scale8 calls in a row? Well, unfortunately, avr_gcc insists on putting a clr r1 after every mul it emits. This means that if we had three calls to the C version of scale8 back to back (say, when scaling a CRGB object), the guts of each call would be 4 cycles. Like this:

mul r18, r19
movw r24, r0
eor r1,r1
mul r17, r19
movw 22, r0
eor r1, r1
mul r16, r19
movw 20, r0
eor r1, r1

So that's 12 cycles total. However, if we know we're going to be doing multiple multiplications in a row, we can optimize things a little bit. There's a function in FastLED called scale8_LEAVING_R1_DIRTY that leaves out the cleanup of the r1 register. It means the equivalent of the above in FastLED (when using scale8_LEAVING_R1_DIRTY for the first two scales is:

mul r18, r19
mov r24, r0
mul r17, r19
mov r23, r0
mul r16, r19
mov r22, r0
eor r1,r1

Now we're looking at 10 cycles. Saving 2 cycles doesn't sound like a lot, until you start adding things up. Let's say you make a pass at scaling down all of your LEDs before drawing the new frame (This is a nice way to get fading trails), and you have 200 leds. Now we're talking about 2400 cycles vs 2000 cycles. 400 cycles on a 16Mhz avr is 25µs. We just gave you an extra 25µs per frame. Again, doesn't sound like a lot, but if you have complex animations that are doing a lot of value scaling all over the place, these savings can add up quickly.

scale8 - the problem

Alas, there is a problem with scale8, that many FastLED users have run into. Because we are trying to keep things in 8 bit values, the scale is 0-255, and the intution is that scale8(x,255) for any value of x would not change the value of x (because 255 should be max, and be like scaling by 1, right?). However, that isn't how it works out. For example, with FastLED scale8(255,255) gives you a value of 254, not 255. What the hell, right? Well, let's dig into the math on this one.

255 * 255 is 65,025 which in hex is 0xFE01. The top 8 bits of that value is 0xFE, which is 254. Whoops. What's happening here? Well, with the math that we're doing here, you really want the max value to be 256, not 255. However, 256 is a 16 bit value, it doesn't fit into an 8-bit slot. That would greatly increase the amount of work done. By how much, you ask? Well, let's say we changed the type of scale to be uint16_t, so that we could have 256 be a scaling value. Now the output looks like this:

mul r20, r18
movw r24, r0
mul r20, r19
add r25, r0
eor r1, r1

Now that's 7 cycles. Plus the extra register for a 16 bit value of scale. Also, now this means that you have to use a 16bit value for scaling everywhere - increasing memory usage, as well as increasing the amount of time to save/load data from memory, as well as meaning every operation done w/your scaling value is now a 16 bit operation instead of 8. It's a decision that could have a pretty nasty cascading effect. (To put it in base 10 terms, it's like instead of the percentage range being 0-100%, it's 0-99%, which means even at the top of the scale, it will always trim some off).

So, Mark and I have been digging into solving this. We want to get rid of the 'gap' that exists when scaling by 255. We could special case when scale is 255.However that would add a conditional which would add a bunch of clocks:

cpi r19, 0xFF    ; tst the value in r19 against 255
breq _skip:      ; skip the scaling if our scaling value is 255
mul r20, r19
mov r16, r0
eor r1,r1
rjmp _done
_skip:
mov r16, r20
_done:

Now it's 8 cycles if scale != 255 and 4 cycles if it is. Ugly. Also, it turns out that fixing things this way would only solve the problem when scaling by 255. The problem exists in most places, actually, it's just most visible w/255.

scale8 - the fix

So, how to fix this? From a high level math perspective, the "right" way to fix this would be (if we were damning 8 bit math and performance): return (i * (scale+1)) >> 8;. So, what does this look like coming out of avr-gcc? Assuming scale is in r24 and i is in r20:

ldi r25, 0x00     ; "promote" scale to a 16 bit value
adiw r25, 0x01    ; now add 1 to the 2 byte value in registers r24/r25
mul r20, r24
movw r18, r0
mul r20, r25
add r19, r0
eor r1,r1

Now we're up to 10 cycles. Going from 4 cycles to 10 isn't really appealing to Mark and I. So, I started playing with different ways of doing this same kind of thing, but by hand in asm. The 16-bit promotion of scale isn't ideal. So, I started playing with different ways of looking at this.

adding one more

So, first thing I tried were ways to 'add one more i' to the result. However, first, a little side bar into 16 bit arithmetic on 8-bit cpu cores. Recall from above that when we multiply two 8 bit numbers, we get a 16 bit number, and now we're talking about wanting to add one more 8 bit number into that. When you only have 8 bit math operations, this ends up being a 2 step operation. First, let's look at adding 2 16bit numbers. Let's say we're adding the 16-bit value stored in registers r10/r11 (with the low 8-bits being in r1) to the 16-bit value stored in registers r12/r13 (and storing the result in r12/r13):

add r12, r10
adc r13, r11

The first operation is our normal 8-bit addition. However, in addition to storing the 8-bit result of that addition into r12, it will also set the carry flag to 1 or 0 based on whether or not the addition of r12 & r10 would be greater than 255 (the maximum result of adding 2 8bit values together is 510, or 0x1FE in hex). The second operation is also an 8-bit addition, but it will also add in the value in the carry flag. Great. Now, how about if we want to add an 8 bit value into a 16 bit value? Well, an 8-bit value is just a 16 bit value where the high 8 bits are 0. So, we can do that with this on avr (assuming we want to add the 8-bit value in r16 to the 16-bit value in r12/r13):

add r12, r16
adc r13, __zero_reg__ ;

Great, right? Now we can just do:

mul r18,r19           ; i in r18, scale in r19
add r0, r18
adc r1, __zero_reg__  ; ummm...
mov r20, r1           ; save the high 8 bits in r20.
eor r1,r1             ; clear r1

Only 6 cycles. That's MUCH better than the 10 cycles we were getting above, AND we get a more accurate result. Yes, it's an increase of 2 cycles per operation, but this is where one starts making fun tradeoffs: correctness or speed? (Mark and I are still deciding how we want to roll this into FastLED - whether unilaterally making the decision for correct, at the cost of 1-2 cycles per scale8 call, or provide a setting flag so people who care more about getting every last clock cycle and are ok with the gapping can have that).

Except... there's a problem. Remember above? avr-gcc uses r1 as its zero register. That means that add with carry is actually adc r1,r1, or doubling the value in r1. This isn't what we want AT ALL. (Those that follow me on twitter may have noticed me venting explitives in gcc's general direction right about at the point where I hit this).

a new zero?

Ok, so that was irritating. Ok. Well, first possible solution was to sacrifice an extra register and make it a new zero. That'll teach gcc.

clr r21      ; going to use r1 as our new zero
mul r18,r19  ; i in r18, scale in r19
add r0, r18
adc r1, r21
mov r20, r1
clr r1

Hey, that works. However, it uses an extra register, and because we need to clear that register we're now at 7 cycles. Maybe with multiple calls in a row gcc would be smart enough to realize that we're re-using this register. Still, I really hate the idea of using any more registers than we have to. Could there be other ways around this? It turns out, there is.

who needs zero?

Since we know that, for the adc, we're going to be adding a 0, it really means that what we want to do is increase the value in r1 by 1 if the carry flag has been set. Hey, there are other ways we can do this. Like:

mul r18, r19    ; i in r18, scale in r19
add r0, r18     ; do our add, optionally setting the carry flag
brcc _skip      ; if the carry flag is clear, branch to skip the next opcode
inc r1         ; if we're here, the carry flag was set, increment r1
_skip:
mov r20, r1
clr r1

Still 7 cycles, but now we aren't wasting an extra register. On the flip side, though, because of how this is done there's no chance for gcc to possibly be smart enough to save us some cycles clearing out our extra zero register by re-using it.

abusing math

So, feeling I hit the end of the road for playing with adding an extra i in, I re-started thinking about the two pieces of this. One being that we wanted to increment scale by 1 before doing the math, so we get the result that users would be expecting. I want the pre-shift result of scale8(255,255) to be 0xFFxx (honestly, I don't care what the low 8 bits are, in reality). But hey, if I'm multiplying something by 256, that's the equivalent of an 8 bit shift left, and since that would be right before an 8 bit shift right, that's a null operation. Maybe there's another route?

So, now we know that we want to increment scale by 1, and we know that we don't want to bother doing anything to i if scale is 256. Finally, I'm going to abuse the fact that we almost never want to use the pre-scaled value of a register (the pre-scaled value is still safe in memory until something overwrites it there), so I'm going to explicitly here make use of that and re-use the register for i for the output of scale8 (note: FastLED does this internally already, but it's been useful up to this point to keep the destination separate to make clearer what was going on).

inc r19      ; i in r18, scale in r19
breq _done:  ; if scale was 255, after inc, it is now 0, if so, skip to the end
mul r18, r19 ; now multiply i by our adjusted scale value
mov r18, r1  ; move our result back into i
clr r1
_done:

Now, this is promising. For one, when scale is < 255, it will run through in 6 cycles. Even better, though, when scale == 255, it will run through in 3. Except it still has some issues that I'm worried about. For one, it clobbers the value for scale. Knowing what I know about how scaling is used in a variety of places, the value for scale will be used for multiple scale8 calls in a row (perhaps hundreds if you're in a loop?). Blowing that value away means that it needs to be reloaded. Either by using an extra register to keep a 'clean' value of scale and moving it in again (1 cycle per call to scale8) or, worse, re-reading it from memory every time (2 cycles per call to scale8). Well, now we're up to 4-7 cycles, or worse depending on how gcc is feeling, 5-8 cycles.

Drat.

Back to the drawing board.

getting carried away

"but then, in the midst of my preparations for hari kiri, it came to me" --Chris Knight

After swearing to Mark that I was walking away from this, that I was going to accept one of our ugly 7+ cycle solutions, and that I was going to get some food I got an idea. Convention is that add and adc work on the same pairs of registers. However, it is just that, convention. Convention is also that they are adjacent operations. Usually. I started digging through the avr opcode reference, this time looking at what operations touched condition codes.

I also went back to my new zero solution. What if I could use a register for 0 without using an extra register? I could. The four registers regularly being used are, r0 and r1 (because that's where mul always puts its results) and, let's say, r18 (i), r19 (scale) (I've mentioned before, but will mention again, r18/r19 are just what I'm using for examples here, it could be any pairing, usually compiler decided). Ok, well, scale I don't want to clobber for reasons I've already mentioned above. However, the register that the result is going into? Well, what if I could use that? I quickly sketched out an ordering:

mul r18, r19    ; i in r18, scale in r19
add r0, r18     ; add i to r18, possibly setting the carry flag
ldi r0, 0x00    ; load the immediate 0 into r0 (note, this does _not_ touch any flags)
adc r1, r0      ; yay!  our adc with a 0 so we're just adding the carry flag
mov r18, r1     ; save our result back in r18
clr r1          ; restore the 0 register

However, there's another way to write the above. We could use our destination register (in this case r18) as our zero register as well, since we're going to overwrite it anyway, right? How about this:

mul r18, r19    ; i in r18, scale in r19
add r0, r18     ; add i to r18, possibly setting the carry flag
ldi r18, 0x00   ; load the immediate 0 into r18 (note, this does _not_ touch any flags)
adc r1, r18     ; yay!  our adc with a 0 so we're just adding the carry flag
mov r18, r1     ; save our result back in r18
clr r1          ; restore the 0 register

7 cycles, no clobbering scale, no wasted registers. However, there's one last twist left, and I mean that literally. adc adds the value of its first register and the value of its second register and the value of the carry flag and puts the result in the first register. Remember what you were taught in math? a + b === b + a - the order of the operands doesn't matter. I can swap the order of arguments to adc and get rid of that mov, letting adc do the mov for me:

Ladies and gentlemen of all genders, allow me to introduce to you the future of a corrected scale8 on avr:

mul %[i], %[scale] ;
add r0, %[i]       ; add i to r0, possibly setting the carry flag
ldi %[i], 0x00     ; load the immediate 0 into i (note, this does _not_ touch any flags)
adc %[i], r1       ; walking and chewing gum at the same time
clr __zero_reg__   ; restore the 0 register

6 cycles, constant, no clobbering scale, no using extra registers. Or to put it another way, correctness for only 2 extra cycles.

boom

ps. Dear gcc - next can we make the zero register be a register that isn't hard coded to get clobbered by an opcode like, say, multiply?

@kriegsman
Copy link

And in the end, "all it took" to improve this was both of us thinking about it for over a year, and then a sixteen hour blitz.

BOOM indeed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment