Comparing the guts of loops between C style loops and C++ ranged loops:
Right now - the runtime of this:
CRGBArray<100> leds;
...
for(CRGB & pixel : leds) { pixel = CRGB::Black; }
is about 10% slower than the runtime of this:
CRGB leds[NUM_LEDS];
...
for(int i = 0; i < NUM_LEDS; i++) { leds[i] = CRGB::Black; }
I wanted to understand why - so I went digging through the asm output. It appears the cause is in the heart of the two loops.
Here's the asm for the range loop:
216a: 4283 cmp r3, r0
216c: d005 beq.n 217a <test1()+0x22>
216e: 2100 movs r1, #0
2170: 7019 strb r1, [r3, #0]
2172: 7059 strb r1, [r3, #1]
2174: 7099 strb r1, [r3, #2]
2176: 191b adds r3, r3, r4
2178: e7f7 b.n 216a <test1()+0x12>
and here's the asm for the for loop:
2192: 2100 movs r1, #0
2194: 7019 strb r1, [r3, #0]
2196: 7059 strb r1, [r3, #1]
2198: 7099 strb r1, [r3, #2]
219a: 3303 adds r3, #3
219c: 4283 cmp r3, r0
219e: d1f8 bne.n 2192 <test3()+0xa>
The body of the loop is the same between the two. Loading an immediate 0 and then storing that 0 value into the three bytes of the pixel, and then advancing the pointer. That's where the two diverge. The code with the iterators does an unconditional jump back to the beginning where the comparison and jump exit out of the loop is done. Whereas the loop without the iterators gets to effectively be a do {} while() loop.
Why the difference? It turns out in the second case the compiler is making an optimization because NUM_LEDS is a constant. If I change my test code so that in both cases, what's being iterated over is passed in - a CRGBSet& for the range loop case, and a CRGB* and int for the C loop case, then the numbers are reversed - the C loop code is now 10% slower. Unsurprisingly, the asm code at the heart of the range loop is unchanged, but the asm code at the heart of the C loop now looks like this:
218a: 428a cmp r2, r1
218c: da06 bge.n 219c <test3(CRGB*, int)+0x1c>
218e: 2500 movs r5, #0
2190: 701d strb r5, [r3, #0]
2192: 705d strb r5, [r3, #1]
2194: 709d strb r5, [r3, #2]
2196: 3201 adds r2, #1
2198: 3303 adds r3, #3
219a: e7f6 b.n 218a <test3(CRGB*, int)+0xa>
Notice how the C loop now has an extra instruction relative to the other loop. It's because the C loop is incrementing both an integer value and the data pointer at the same time. When one thinks about it, that's a nice bit of optimization on the part of the compiler - but it still ends up one cycle per loop short of what the C++ range loop is.
C++ range loop wins again, at least for this code :)