Skip to content

Instantly share code, notes, and snippets.

@thesps
Last active December 10, 2018 11:49
Show Gist options
  • Save thesps/6cefca78805cc49d4a32f82c57b6fa51 to your computer and use it in GitHub Desktop.
Save thesps/6cefca78805cc49d4a32f82c57b6fa51 to your computer and use it in GitHub Desktop.
Tables of performance, for given block partition factors, for different resource reuse for the MaxPooling PR

fastmachinelearning/hls4ml#117

Input images are 32x32x1, pool size is 2x2x1, giving output images of 16x16x1 (256 pixels). Configurations which give the desired behaviour of II = reuse are in bold.

There is almost a reliable trend for the choice of block partition factor. For reuse <= 16 (only partitioning in the 1st image dimension) using block partition factor of 16 / reuse for both input and output image gives sensible performance. For reuse >= 32, the partition factor for the 1st dimension is set to 1, and now I scan over the factors for the second image dimension. Here sensible performance is obtained with image_dimension / (reuse / 16), or 16 / ((reuse / 16) / 2). So for reuse=128, for example, good performance is obtained with block factor 32 / (128 / 16) = 4 for the 1st dimension, and 16 / (128 / 16) = 2 for the 2nd dimension, or 16 / ((128 / 16) / 2) = 4 for both. Either way the calculation seems to need to change slightly when partitioning also in the second dimension.

These are the block partition factors applied for each reuse factor, for the two schemes which give good performance, in the form (input partition factor, output partition factor). For reuse >= 32 the factor is applied to the second image dimension, with the first dimension partitioned with factor 1.

reuse 1 2 4 8 16 32 64 128 256
Partition Scheme 1 16, 16 8, 8 4, 4 2, 2 1, 1 16, 16 8, 8 4, 4 2, 2
Partition Scheme 2 16, 8 8, 4 4, 2 2, 1 1, 1 16, 8 8, 4 4, 2 2, 1

With these schemes, I get the following trend in LUT usage: LUTs vs reuse Note above reuse=64 the resource usage starts to increase again, albeit slowly.

Reuse 1:

In Partition Factor Out Partition Factor Latency II HLS LUT HLS FF
1 1 18 16 32438 20772
1 2 18 16 32534 20772
1 4 18 16 33174 20772
1 8 18 16 33558 20772
1 16 17 16 29718 20516
2 1 10 8 32644 20756
2 2 10 8 32644 20756
2 4 10 8 33284 20756
2 8 10 8 33668 20756
2 16 9 8 29828 20244
4 1 10 8 33924 20756
4 2 6 4 34530 20748
4 4 6 4 34530 20748
4 8 6 4 34914 20748
4 16 5 4 31074 19724
8 1 10 8 34692 20756
8 2 6 4 35298 20748
8 4 4 2 35658 20746
8 8 4 2 35658 20746
8 16 3 2 31800 18696
16 1 10 8 27012 20756
16 2 6 4 27618 20748
16 4 4 2 27978 20746
16 8 3 1 24072 20744
16 16 2 1 24072 16646
32 1 9 8 63880 38164
32 2 5 4 64486 38156
32 4 3 2 64828 38152
32 8 2 1 60940 38150
32 16 1 1 60934 34050

Reuse 2:

In Partition Factor Out Partition Factor Latency II HLS LUT HLS FF
1 1 18 16 19750 10404
1 2 18 16 20502 10404
1 4 18 16 21142 10404
1 8 18 16 21526 10404
1 16 17 16 17686 12196
2 1 10 8 20068 10388
2 2 10 8 20068 10388
2 4 10 8 21252 10388
2 8 10 8 21636 10388
2 16 9 8 17796 11924
4 1 10 8 21700 11412
4 2 6 4 21730 10380
4 4 6 4 21730 10380
4 8 6 4 22882 10380
4 16 5 4 19042 11404
8 1 10 8 22564 11924
8 2 6 4 22882 11404
8 4 4 2 21706 10378
8 8 4 2 21706 10378
8 16 3 2 19768 10376
16 1 10 8 22564 11924
16 2 7 5 23278 11918
16 4 5 3 23620 11402
16 8 4 2 19786 10378
16 16 3 2 19768 10376
32 1 9 8 59432 37524
32 2 6 5 60146 37518
32 4 4 3 60506 37002
32 8 3 2 56636 35976
32 16 2 2 56654 35976

Reuse 4:

In Partition Factor Out Partition Factor Latency II HLS LUT HLS FF
1 1 18 16 13462 5220
1 2 18 16 13942 5220
1 4 18 16 15126 5220
1 8 18 16 15510 5220
1 16 17 16 11670 8036
2 1 10 8 13668 5204
2 2 10 8 13668 5204
2 4 10 8 14468 5204
2 8 10 8 15620 5204
2 16 9 8 11780 7764
4 1 10 8 15524 6740
4 2 6 4 14754 5196
4 4 6 4 14754 5196
4 8 6 4 14946 5196
4 16 5 4 13026 7244
8 1 10 8 21748 15188
8 2 8 6 22264 14672
8 4 7 5 22062 13902
8 8 6 4 20130 13388
8 16 5 4 19170 15436
16 1 10 8 15524 6740
16 2 8 6 16312 6736
16 4 7 5 16878 5710
16 8 6 4 13026 5196
16 16 5 4 13026 7244
32 1 9 8 52392 36436
32 2 7 6 53180 36432
32 4 6 5 53746 35406
32 8 5 4 49894 34892
32 16 4 4 49894 36940

Reuse 8:

In Partition Factor Out Partition Factor Latency II HLS LUT HLS FF
1 1 18 16 10262 2628
1 2 18 16 10550 2628
1 4 18 16 11350 2628
1 8 18 16 12502 2628
1 16 17 16 8662 5956
2 1 10 8 10180 2612
2 2 10 8 10180 2612
2 4 10 8 10500 2612
2 8 10 8 10692 2612
2 16 9 8 8772 5684
4 1 10 8 15684 10804
4 2 10 8 16004 10804
4 4 10 8 16100 10804
4 8 10 8 15236 10804
4 16 9 8 14276 13876
8 1 10 8 17476 14900
8 2 10 8 17796 14900
8 4 10 8 17988 14900
8 8 10 8 16548 14900
8 16 9 8 16068 17972
16 1 10 8 10180 2612
16 2 10 8 10500 2612
16 4 10 8 10692 2612
16 8 10 8 8772 2612
16 16 9 8 8772 5684
32 1 9 8 47048 34356
32 2 9 8 47368 34356
32 4 9 8 47560 34356
32 8 9 8 45640 34356
32 16 8 8 45640 37428

Reuse 16:

In Partition Factor Out Partition Factor Latency II HLS LUT HLS FF
1 1 18 16 8518 1332
1 2 18 16 8566 1332
1 4 18 16 8886 1332
1 8 18 16 9078 1332
1 16 17 16 7158 4916
2 1 18 16 11718 9524
2 2 18 16 11766 9524
2 4 18 16 12086 9524
2 8 18 16 12278 9524
2 16 17 16 10358 13108
4 1 18 16 13766 13620
4 2 18 16 13814 13620
4 4 18 16 14134 13620
4 8 18 16 14326 13620
4 16 17 16 12406 17204
8 1 18 16 14918 15668
8 2 18 16 14966 15668
8 4 18 16 15286 15668
8 8 18 16 15478 15668
8 16 17 16 13558 19252
16 1 18 16 8518 1332
16 2 18 16 8566 1332
16 4 18 16 8886 1332
16 8 18 16 9078 1332
16 16 17 16 7158 4916
32 1 17 16 45386 34100
32 2 17 16 45434 34100
32 4 17 16 45754 34100
32 8 17 16 45946 34100
32 16 16 16 44026 37684

Above reuse=16, I partition the first image dimension with block partion factor 1, and now the factors in the table refer to the block partition factor on the second dimension.

Reuse 32:

In Partition Factor Out Partition Factor Latency II HLS LUT HLS FF
1 1 515 516 7861 631
1 2 515 516 7850 631
1 4 515 516 7860 631
1 8 515 516 7880 631
1 16 515 516 8048 631
2 1 258 256 9050 1228
2 2 258 256 9050 1228
2 4 258 256 9114 1228
2 8 258 256 9146 1228
2 16 258 256 9194 1228
4 1 131 128 7822 1036
4 2 130 128 7838 1036
4 4 130 128 7838 1036
4 8 130 128 7918 1036
4 16 130 128 7966 1036
8 1 131 128 8518 3180
8 2 67 64 7302 1036
8 4 66 64 7262 1036
8 8 66 64 7262 1036
8 16 66 64 7430 1036
16 1 131 128 9234 4652
16 2 68 65 8566 3566
16 4 36 33 7662 1550
16 8 35 32 7702 1484
16 16 35 32 7702 1484
32 1 130 128 12274 12076
32 2 67 65 11606 10990
32 4 35 33 10702 8974
32 8 34 32 10742 8908
32 16 34 32 10742 8908

Reuse 64:

In Partition Factor Out Partition Factor Latency II HLS LUT HLS FF
1 1 515 516 7861 631
1 2 515 516 7850 631
1 4 515 516 7860 631
1 8 515 516 7880 631
1 16 515 516 8048 631
2 1 258 256 8650 904
2 2 258 256 8650 904
2 4 258 256 8690 904
2 8 258 256 8770 904
2 16 258 256 8818 904
4 1 131 128 7422 712
4 2 130 128 7402 712
4 4 130 128 7402 712
4 8 130 128 7422 712
4 16 130 128 7590 712
8 1 131 128 8482 3000
8 2 68 65 7174 874
8 4 67 64 7126 840
8 8 67 64 7126 840
8 16 67 64 7294 840
16 1 131 128 11002 11176
16 2 69 66 9966 9132
16 4 68 65 9990 9066
16 8 67 64 10030 9032
16 16 67 64 10030 9032
32 1 130 128 12250 14888
32 2 68 66 11214 12844
32 4 67 65 11238 12778
32 8 66 64 11278 12744
32 16 66 64 11278 12744

I have checked with reuse = 64, setting block partition factor to 2 on dimension 1 of both input and output image, and then the section of the table giving the best performance is as below. So where factors 8 and 8 on dimension 2 gave the best performance with factors 1, 1 on dimension 2, now factors 4, 4 (dim 2) are best for factors 2, 2 (dim 1): hinting that it's just as effective slicing the array into blocks that don't span the entire axis.

In Partition Factor Out Partition Factor Latency II HLS LUT HLS FF
4 1 68 65 7174 874
4 2 67 64 7126 840
4 4 67 64 7126 840
4 8 67 64 7294 840

Reuse 128:

In Partition Factor Out Partition Factor Latency II HLS LUT HLS FF
1 1 515 516 7861 631
1 2 515 516 7850 631
1 4 515 516 7860 631
1 8 515 516 7880 631
1 16 515 516 8048 631
2 1 258 256 8432 742
2 2 258 256 8432 742
2 4 258 256 8442 742
2 8 258 256 8462 742
2 16 258 256 8630 742
4 1 131 128 7324 614
4 2 131 128 7334 614
4 4 131 128 7334 614
4 8 131 128 7354 614
4 16 131 128 7522 614
8 1 131 128 10404 8934
8 2 131 128 10414 8934
8 4 131 128 10434 8934
8 8 131 128 10658 8934
8 16 131 128 10662 8934
16 1 131 128 11208 12902
16 2 131 128 11218 12902
16 4 131 128 11238 12902
16 8 131 128 11406 12902
16 16 131 128 11406 12902
32 1 130 128 12020 14758
32 2 130 128 12030 14758
32 4 130 128 12050 14758
32 8 130 128 12218 14758
32 16 130 128 12218 14758

Reuse 256:

In Partition Factor Out Partition Factor Latency II HLS LUT HLS FF
1 1 515 516 7861 631
1 2 515 516 7850 631
1 4 515 516 7860 631
1 8 515 516 7880 631
1 16 515 516 8048 631
2 1 259 256 8409 693
2 2 259 256 8398 693
2 4 259 256 8408 693
2 8 259 256 8428 693
2 16 259 256 8596 693
4 1 259 256 11597 8981
4 2 259 256 11586 8981
4 4 259 256 11596 8981
4 8 259 256 11616 8981
4 16 259 256 11784 8981
8 1 259 256 12461 13013
8 2 259 256 12450 13013
8 4 259 256 12460 13013
8 8 259 256 12480 13013
8 16 259 256 12648 13013
16 1 259 256 12533 15029
16 2 259 256 12522 15029
16 4 259 256 12532 15029
16 8 259 256 12552 15029
16 16 259 256 12720 15029
32 1 258 256 13229 15957
32 2 258 256 13218 15957
32 4 258 256 13228 15957
32 8 258 256 13248 15957
32 16 258 256 13416 15957
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment