focalintent/gist:6a936de0502c98bbba6c

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    Comparing the guts of loops between C style loops and C++ ranged loops:
Right now - the runtime of this:
CRGBArray<100> leds;
...
for(CRGB & pixel : leds) { pixel = CRGB::Black; }

is about 10% slower than the runtime of this:
CRGB leds[NUM_LEDS];
...
for(int i = 0; i < NUM_LEDS; i++) { leds[i] = CRGB::Black; }

I wanted to understand why - so I went digging through the asm output.  It appears the cause is in the heart of the two loops.
Here's the asm for the range loop:
    216a:       4283            cmp     r3, r0
    216c:       d005            beq.n   217a <test1()+0x22>

    216e:       2100            movs    r1, #0
    2170:       7019            strb    r1, [r3, #0]
    2172:       7059            strb    r1, [r3, #1]
    2174:       7099            strb    r1, [r3, #2]
    2176:       191b            adds    r3, r3, r4

    2178:       e7f7            b.n     216a <test1()+0x12>

and here's the asm for the for loop:
    2192:       2100            movs    r1, #0
    2194:       7019            strb    r1, [r3, #0]
    2196:       7059            strb    r1, [r3, #1]
    2198:       7099            strb    r1, [r3, #2]
    219a:       3303            adds    r3, #3

    219c:       4283            cmp     r3, r0
    219e:       d1f8            bne.n   2192 <test3()+0xa>

The body of the loop is the same between the two.  Loading an immediate 0 and then storing that 0 value into the three bytes of the pixel, and then advancing the pointer.  That's where the two diverge.  The code with the iterators does an unconditional jump back to the beginning where the comparison and jump exit out of the loop is done.  Whereas the loop without the iterators gets to effectively be a do {} while() loop.
Why the difference?  It turns out in the second case the compiler is making an optimization because NUM_LEDS is a constant.  If I change my test code so that in both cases, what's being iterated over is passed in - a CRGBSet& for the range loop case, and a CRGB* and int for the C loop case, then the numbers are reversed - the C loop code is now 10% slower.  Unsurprisingly, the asm code at the heart of the range loop is unchanged, but the asm code at the heart of the C loop now looks like this:
218a:       428a            cmp     r2, r1
218c:       da06            bge.n   219c <test3(CRGB*, int)+0x1c>

218e:       2500            movs    r5, #0
2190:       701d            strb    r5, [r3, #0]
2192:       705d            strb    r5, [r3, #1]
2194:       709d            strb    r5, [r3, #2]
2196:       3201            adds    r2, #1
2198:       3303            adds    r3, #3

219a:       e7f6            b.n     218a <test3(CRGB*, int)+0xa>

Notice how the C loop now has an extra instruction relative to the other loop.  It's because the C loop is incrementing both an integer value and the data pointer at the same time.  When one thinks about it, that's a nice bit of optimization on the part of the compiler - but it still ends up one cycle per loop short of what the C++ range loop is.
C++ range loop wins again, at least for this code :)