Skip to content

Instantly share code, notes, and snippets.

@AndreVallestero
Last active October 27, 2022 20:40
Show Gist options
  • Save AndreVallestero/ccea42e83011e9cd45452107090e2867 to your computer and use it in GitHub Desktop.
Save AndreVallestero/ccea42e83011e9cd45452107090e2867 to your computer and use it in GitHub Desktop.
PROPELLER: Pr​ofile Guided ​Op​timizing​ L​arge Scale L​LVM-based ​R​elinker
<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><style>body{margin-left:0;margin-right:0;margin-top:0}#bN015htcoyT__google-cache-hdr{background:#f8f9fa;font:13px arial,sans-serif;text-align:left;color:#202124;border:0;margin:0;border-bottom:1px solid #dadce0;line-height:16px;padding:16px 28px 24px 28px}#bN015htcoyT__google-cache-hdr *{display:inline;font:inherit;text-align:inherit;color:inherit;line-height:inherit;background:none;border:0;margin:0;padding:0;letter-spacing:0}#bN015htcoyT__google-cache-hdr a{text-decoration:none;color:#1a0dab}#bN015htcoyT__google-cache-hdr a:hover{text-decoration:underline}#bN015htcoyT__google-cache-hdr a:visited{color:#681da8}#bN015htcoyT__google-cache-hdr div{display:block;margin-top:4px}#bN015htcoyT__google-cache-hdr b{font-weight:bold;display:inline-block;direction:ltr}</style></head>
<body class="vsc-initialized" vlink="blue" link="blue" bgcolor="#ffffff"><div id="bN015htcoyT__google-cache-hdr">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="Producer" content="Skia/PDF m79">
<title>PROPELLER: Pr​ofile Guided ​Op​timizing​ L​arge Scale L​LVM-based ​R​elinker</title>
<table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="1"><b>Page 1</b></a></font></td></tr></tbody></table><font size="5" face="Times"><span style="font-size:36px;font-family:Times">
<div style="position:absolute;top:285;left:336"><nobr>PROPELLER:</nobr></div>
<div style="position:absolute;top:343;left:125"><nobr><b>Pr</b>​ofile Guided ​<b>Op</b>​timizing​ <b>L</b>​arge Scale</nobr></div>
<div style="position:absolute;top:397;left:270"><nobr><b>L</b>​LVM-based ​<b>R</b>​elinker</nobr></div>
</span></font>
<font size="4" face="Times"><span style="font-size:27px;font-family:Times">
<div style="position:absolute;top:508;left:108"><nobr>Background</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:580;left:108"><nobr>We recently evaluated Facebook’s BOLT, a Post Link Optimizer framework, on large google</nobr></div>
</span></font>
<font size="2" face="Times"><span style="font-size:7px;font-family:Times">
<div style="position:absolute;top:579;left:514"><nobr>1</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:602;left:108"><nobr>benchmarks and noticed that it improves key performance metrics of these benchmarks by 2%</nobr></div>
<div style="position:absolute;top:625;left:108"><nobr>to 6%, which is pretty impressive as this is over and above a baseline binary already heavily</nobr></div>
<div style="position:absolute;top:647;left:108"><nobr>optimized with ThinLTO + PGO. Furthermore, BOLT is also able to improve the performance of</nobr></div>
<div style="position:absolute;top:670;left:108"><nobr>binaries optimized via ​<font color="#1155cc"><a href="https://reviews.llvm.org/D54175">Context-Sensitive PGO​</a></font>.</nobr></div>
<div style="position:absolute;top:670;left:481"><nobr>While ThinLTO + PGO is also profile guided</nobr></div>
<div style="position:absolute;top:692;left:108"><nobr>and does very aggressive performance optimizations, there is more room for performance</nobr></div>
<div style="position:absolute;top:715;left:108"><nobr>improvements due to profile approximations while applying the transformations. BOLT uses</nobr></div>
<div style="position:absolute;top:737;left:108"><nobr>exact profiles from the final binary and is able to fill the gaps left by ThinLTO + PGO. The</nobr></div>
<div style="position:absolute;top:760;left:108"><nobr>performance improvements due to BOLT come from basic block layout, function reordering and</nobr></div>
<div style="position:absolute;top:782;left:108"><nobr>function splitting.</nobr></div>
<div style="position:absolute;top:827;left:108"><nobr>While BOLT does an excellent job of squeezing extra performance from highly optimized</nobr></div>
<div style="position:absolute;top:850;left:108"><nobr>binaries with optimizations such as code layout, it has these major issues:</nobr></div>
<div style="position:absolute;top:895;left:135"><nobr>1. It does not take advantage of distributed build systems.</nobr></div>
<div style="position:absolute;top:917;left:135"><nobr>2. It has scalability issues and to rewrite a binary with a ~300M text segment size:</nobr></div>
<div style="position:absolute;top:940;left:189"><nobr>a. Memory foot-print is 70G.</nobr></div>
<div style="position:absolute;top:962;left:189"><nobr>b. It takes more than 10 minutes to rewrite the binary.</nobr></div>
<div style="position:absolute;top:1007;left:108"><nobr>Similar to Full LTO, BOLT’s design is monolithic as it disassembles the original binary, optimizes</nobr></div>
<div style="position:absolute;top:1030;left:108"><nobr>and rewrites the final binary in one process. This limits the scalability of BOLT and the memory</nobr></div>
<div style="position:absolute;top:1052;left:108"><nobr>and time overhead shoots up quickly for large binaries.</nobr></div>
<div style="position:absolute;top:1097;left:108"><nobr>Inspired by the performance gains and to address the scalability issue of BOLT, we went about</nobr></div>
<div style="position:absolute;top:1120;left:108"><nobr>designing a scalable infrastructure that can perform BOLT-like post-link optimizations. In this</nobr></div>
<div style="position:absolute;top:1142;left:108"><nobr>RFC, we discuss our system, “Propeller”, which can perform profile guided link time binary</nobr></div>
</span></font>
<font size="2" face="Times"><span style="font-size:7px;font-family:Times">
<div style="position:absolute;top:1202;left:108"><nobr>1 <font style="font-size:12px">A Post Link optimizer optimizes a binary or a shared object directly based on its run-time profile without</font></nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:12px;font-family:Times">
<div style="position:absolute;top:1220;left:108"><nobr>re-invoking the compiler. </nobr></div>
</span></font>
<div style="position:absolute;top:1363;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="2"><b>Page 2</b></a></font></td></tr></tbody></table></div><font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:1472;left:108"><nobr>optimizations in a scalable way and is friendly to distributed build systems. Our system</nobr></div>
<div style="position:absolute;top:1494;left:108"><nobr>leverages the existing capabilities of the compiler tool-chain and is not a stand alone tool. Like</nobr></div>
<div style="position:absolute;top:1517;left:108"><nobr>BOLT, our system boosts the performance of optimized binaries via link-time optimizations</nobr></div>
<div style="position:absolute;top:1539;left:108"><nobr>using accurate profiles of the binary. We discuss the Propeller system and show how to do the</nobr></div>
<div style="position:absolute;top:1562;left:108"><nobr>whole program basic block layout using Propeller.</nobr></div>
<div style="position:absolute;top:1607;left:108"><nobr>Propeller does whole program basic block layout at link time via basic block sections. We have</nobr></div>
<div style="position:absolute;top:1629;left:108"><nobr>added support for having each basic block in its own section which allows the linker to do</nobr></div>
<div style="position:absolute;top:1652;left:108"><nobr>arbitrary reorderings of basic blocks to achieve any desired fine-grain code layout which</nobr></div>
<div style="position:absolute;top:1674;left:108"><nobr>includes block layout, function splitting and function reordering.</nobr></div>
<div style="position:absolute;top:1719;left:108"><nobr>Our experiments on large real-world applications and SPEC with code layout show that</nobr></div>
<div style="position:absolute;top:1742;left:108"><nobr>Propeller can optimize as effectively as BOLT, with just 20% of its memory footprint and time</nobr></div>
<div style="position:absolute;top:1764;left:108"><nobr>overhead.</nobr></div>
<div style="position:absolute;top:1809;left:108"><nobr>An LLVM branch with propeller patches is available in the git repository here:</nobr></div>
</span></font>
<font size="3" face="Times" color="#1155cc"><span style="font-size:14px;font-family:Times;color:#1155cc">
<div style="position:absolute;top:1832;left:108"><nobr><a href="https://github.com/google/llvm-propeller/">https://github.com/google/llvm-propeller/ </a><font color="#000000">We will upload patches for review for the various</font></nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:1854;left:108"><nobr>elements.</nobr></div>
</span></font>
<font size="4" face="Times"><span style="font-size:27px;font-family:Times">
<div style="position:absolute;top:1907;left:108"><nobr>Quick Start - How to optimize further with Propeller?</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:1978;left:108"><nobr><b># git clone and build repo&nbsp;</b></nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:18px;font-family:Times">
<div style="position:absolute;top:2000;left:108"><nobr><b>$</b>​ <b><font style="font-size:14px">cd $LLVM_DIR &amp;&amp; </font></b><font style="font-size:14px">​</font><b><font style="font-size:12px">git clone </font></b><font style="font-size:12px">​</font><b><font style="font-size:12px"><a href="https://github.com/google/llvm-propeller.git">https://github.com/google/llvm-propeller.git</a></font><font style="font-size:14px"><a href="https://github.com/google/llvm-propeller.git"></a>&nbsp;</font></b></nobr></div>
<div style="position:absolute;top:2030;left:108"><nobr><b>$</b>​ <b><font style="font-size:14px">mkdir $BUILD_DIR &amp;&amp; cd $BUILD_DIR&nbsp;</font></b></nobr></div>
<div style="position:absolute;top:2059;left:108"><nobr><b>$</b>​ <b><font style="font-size:14px">cmake -G Ninja -DLLVM_ENABLE_PROJECTS="clang;lld;compiler-rt" … \&nbsp;</font></b></nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:2088;left:128"><nobr><b>$LLVM_DIR/llvm-propeller/llvm&nbsp;</b></nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:18px;font-family:Times">
<div style="position:absolute;top:2111;left:108"><nobr><b>$</b>​ <b><font style="font-size:14px">ninja&nbsp;</font></b></nobr></div>
<div style="position:absolute;top:2140;left:108"><nobr><b>$</b>​ <b><font style="font-size:14px">export PATH=$BUILD_DIR/bin:$PATH&nbsp;</font></b></nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:2169;left:108"><nobr><b>&nbsp;</b></nobr></div>
<div style="position:absolute;top:2192;left:108"><nobr><b># Let’s Propeller-optimize the following program:&nbsp;</b></nobr></div>
</span></font>
<div style="position:absolute;top:2551;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="3"><b>Page 3</b></a></font></td></tr></tbody></table></div><font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:2921;left:815"><nobr><b>&nbsp;</b></nobr></div>
<div style="position:absolute;top:2943;left:108"><nobr><b>&nbsp;</b></nobr></div>
<div style="position:absolute;top:2966;left:108"><nobr><b># Step 1: Build the peak optimized binary with an additional flag.&nbsp;</b></nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:18px;font-family:Times">
<div style="position:absolute;top:2988;left:108"><nobr><b>$</b>​ <b><font style="font-size:14px">clang++ -O2 main.cc callee.cc -fpropeller-label -o a.out.labels \&nbsp;</font></b></nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:3017;left:128"><nobr><b>-fuse-ld=lld&nbsp;</b></nobr></div>
<div style="position:absolute;top:3040;left:108"><nobr><b>&nbsp;</b></nobr></div>
<div style="position:absolute;top:3062;left:108"><nobr><b># Step 2: Profile the binary, only one side of the branch is executed.&nbsp;</b></nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:18px;font-family:Times">
<div style="position:absolute;top:3085;left:108"><nobr><b>$</b>​ <b><font style="font-size:14px">perf record -e cycles:u -j any,u -- ./a.out.labels 1000000 2 &gt;&amp; \&nbsp;</font></b></nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:3114;left:118"><nobr><b>/dev/null&nbsp;</b></nobr></div>
<div style="position:absolute;top:3137;left:108"><nobr><b>&nbsp;</b></nobr></div>
<div style="position:absolute;top:3159;left:108"><nobr><b># Step 3: Convert the profiles using the tool provided&nbsp;</b></nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:18px;font-family:Times">
<div style="position:absolute;top:3182;left:108"><nobr><b>$</b>​ <b><font style="font-size:14px">$LLVM_DIR/llvm-propeller/create_llvm_prof --format=propeller \&nbsp;</font></b></nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:3211;left:128"><nobr><b>--binary=./a.out.labels --profile=perf.data --out=perf.propeller&nbsp;</b></nobr></div>
<div style="position:absolute;top:3233;left:108"><nobr><b>&nbsp;</b></nobr></div>
<div style="position:absolute;top:3256;left:108"><nobr><b># Step 4: Re-Optimize with Propeller, repeat Step 1 with propeller&nbsp; &nbsp;</b></nobr></div>
<div style="position:absolute;top:3278;left:108"><nobr><b># flag changed.&nbsp;</b></nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:18px;font-family:Times">
<div style="position:absolute;top:3301;left:108"><nobr><b>$</b>​ <b><font style="font-size:14px">clang++ -O2 main.cc callee.cc -fpropeller-optimize=perf.propeller \&nbsp;</font></b></nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:3330;left:118"><nobr><b>-fuse-ld=lld&nbsp;</b></nobr></div>
<div style="position:absolute;top:3353;left:108"><nobr><b>&nbsp;</b></nobr></div>
<div style="position:absolute;top:3376;left:108"><nobr>In Step 4, the optimized bit code can be used if it is saved in Step1 as Propeller is active only</nobr></div>
<div style="position:absolute;top:3399;left:108"><nobr>during compile backend and link.</nobr></div>
<div style="position:absolute;top:3444;left:108"><nobr>The optimized binary has a different layout of the basic blocks in main to keep the executed</nobr></div>
<div style="position:absolute;top:3466;left:108"><nobr>blocks together and split the cold blocks. </nobr></div>
</span></font>
<div style="position:absolute;top:3739;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="4"><b>Page 4</b></a></font></td></tr></tbody></table></div><font size="4" face="Times"><span style="font-size:21px;font-family:Times">
<div style="position:absolute;top:3875;left:108"><nobr>Details of the various Steps to optimize with Propeller</nobr></div>
</span></font>
<font size="3" face="Times" color="#434343"><span style="font-size:18px;font-family:Times;color:#434343">
<div style="position:absolute;top:3931;left:108"><nobr>Step1 : Build the peak-optimized binary with an additional flag</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:18px;font-family:Times">
<div style="position:absolute;top:3987;left:108"><nobr><b>$</b>​ <b><font style="font-size:14px">clang++ -O2 main.cc callee.cc -fpropeller-label -o a.out.labels&nbsp;</font></b></nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:4016;left:108"><nobr><b>&nbsp;</b></nobr></div>
<div style="position:absolute;top:4040;left:108"><nobr>This step builds an -O2 version, kept simple for illustration but could be using PGO and to</nobr></div>
<div style="position:absolute;top:4063;left:108"><nobr>obtain a highly optimized binary.</nobr></div>
<div style="position:absolute;top:4108;left:108"><nobr>The flag -fpropeller-label invokes the following flags:</nobr></div>
<div style="position:absolute;top:4151;left:135"><nobr><b>1. -fbasicblock-sections=labels&nbsp;</b></nobr></div>
<div style="position:absolute;top:4174;left:135"><nobr><b>2. -funique-internal-funcnames&nbsp;</b></nobr></div>
<div style="position:absolute;top:4196;left:135"><nobr><b>3. --Wl,--lto-basicblock-sections=labels (with LTO)&nbsp;</b></nobr></div>
<div style="position:absolute;top:4243;left:108"><nobr>These flags build the binary with labels for every basic block and unique names for all internal</nobr></div>
<div style="position:absolute;top:4265;left:108"><nobr>linkage functions.</nobr></div>
<div style="position:absolute;top:4333;left:108"><nobr>A quick inspection of the symbol table for the binary shows the following basic blocks sorted by</nobr></div>
<div style="position:absolute;top:4355;left:108"><nobr>virtual address for main and callee:</nobr></div>
<div style="position:absolute;top:4399;left:108"><nobr><b>00000000002010f0 T main&nbsp;</b></nobr></div>
<div style="position:absolute;top:4421;left:108"><nobr><b>0000000000201110 t a.BB.main&nbsp;</b></nobr></div>
<div style="position:absolute;top:4444;left:108"><nobr><b>0000000000201124 t aa.BB.main&nbsp;</b></nobr></div>
<div style="position:absolute;top:4466;left:108"><nobr><b>0000000000201150 T _Z6calleeb&nbsp;</b></nobr></div>
<div style="position:absolute;top:4489;left:108"><nobr><b>0000000000201156 t a.BB._Z6calleeb&nbsp;</b></nobr></div>
<div style="position:absolute;top:4511;left:108"><nobr><b>0000000000201167 t aa.BB._Z6calleeb&nbsp;</b></nobr></div>
<div style="position:absolute;top:4534;left:108"><nobr><b>0000000000201176 t aaa.BB._Z6calleeb&nbsp;</b></nobr></div>
<div style="position:absolute;top:4580;left:108"><nobr>The basic block names (unary) are chosen to allow tail merging of strings and keep strtab bloats</nobr></div>
<div style="position:absolute;top:4603;left:108"><nobr>minimal while retaining the name of the original function.</nobr></div>
<div style="position:absolute;top:4648;left:108"><nobr>Further, it is more efficient to split this step into two and save the optimized high level LLVM IR</nobr></div>
<div style="position:absolute;top:4670;left:108"><nobr>as it can be directly used in Step 4.</nobr></div>
</span></font>
<font size="3" face="Times" color="#434343"><span style="font-size:18px;font-family:Times;color:#434343">
<div style="position:absolute;top:4717;left:108"><nobr>Step 2: Obtain PMU profiles with LBR sampling</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:4751;left:108"><nobr>Notice that the binary is created to run the loop an arbitrary number of times. Sample the binary</nobr></div>
<div style="position:absolute;top:4774;left:108"><nobr>with a huge trip count to get a meaningful number of profiles, with the ​<font color="#1155cc"><a href="https://perf.wiki.kernel.org/index.php/Tutorial">perf</a></font></nobr></div>
<div style="position:absolute;top:4796;left:108"><nobr>(​<font color="#1155cc"><a href="https://perf.wiki.kernel.org/index.php/Tutorial">https://perf.wiki.kernel.org/index.php/Tutorial​</a></font>) utility: </nobr></div>
</span></font>
<div style="position:absolute;top:4927;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="5"><b>Page 5</b></a></font></td></tr></tbody></table></div><font size="3" face="Times"><span style="font-size:18px;font-family:Times">
<div style="position:absolute;top:5057;left:108"><nobr><b>$</b>​ <b><font style="font-size:14px">perf record -e cycles:u -j any,u -- ./a.out.labels 1000000 &gt;&amp; \&nbsp;</font></b></nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:5086;left:118"><nobr><b>/dev/null&nbsp;</b></nobr></div>
<div style="position:absolute;top:5133;left:108"><nobr>The profiles only exercise one side of the branch in the loop, the true branch, with even number</nobr></div>
<div style="position:absolute;top:5155;left:108"><nobr>of arguments.</nobr></div>
</span></font>
<font size="3" face="Times" color="#434343"><span style="font-size:18px;font-family:Times;color:#434343">
<div style="position:absolute;top:5202;left:108"><nobr>Step 3: Convert the profiles with the tool provided</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:18px;font-family:Times">
<div style="position:absolute;top:5257;left:108"><nobr><b>$</b>​ <b><font style="font-size:14px">$LLVM_DIR/llvm-propeller/create_llvm_prof --format=propeller \&nbsp;</font></b></nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:5287;left:128"><nobr><b>--binary=./a.out.labels --profile=perf.data --out=perf.propeller&nbsp;</b></nobr></div>
<div style="position:absolute;top:5333;left:108"><nobr>The create_llvm_prof tool is the same tool used by AutoFDO to convert profiles and we have</nobr></div>
<div style="position:absolute;top:5355;left:108"><nobr>extended it for Propeller. The patched tool is part of the git repo and we use it to convert the</nobr></div>
<div style="position:absolute;top:5378;left:108"><nobr>raw PMU profiles into a format used by Propeller.</nobr></div>
<div style="position:absolute;top:5423;left:108"><nobr>The conversion does the following. The symbol table from the optimized binary is used to</nobr></div>
<div style="position:absolute;top:5445;left:108"><nobr>associate profile samples (virtual addresses) to basic blocks. The textual representation shows</nobr></div>
<div style="position:absolute;top:5468;left:108"><nobr>the execution profiles for all the basic blocks using their respective labels which makes it easy</nobr></div>
<div style="position:absolute;top:5490;left:108"><nobr>for the propeller optimizations to annotate CFGs.</nobr></div>
</span></font>
<font size="3" face="Times" color="#434343"><span style="font-size:18px;font-family:Times;color:#434343">
<div style="position:absolute;top:5537;left:108"><nobr>Step 4 : Re-optimize with Propeller</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:18px;font-family:Times">
<div style="position:absolute;top:5593;left:108"><nobr><b>$</b>​ <b><font style="font-size:14px">clang++ -O2 sample.cc -fpropeller-optimize=perf.propeller \&nbsp;</font></b></nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:5622;left:128"><nobr><b>-fuse-ld=lld&nbsp;</b></nobr></div>
<div style="position:absolute;top:5668;left:108"><nobr>Here the .cc file can be replaced with optimized .ll file from Step 1 if cached. We are working on</nobr></div>
<div style="position:absolute;top:5691;left:108"><nobr>a simple patch to easily bypass the LLVM optimization passes when the IR is detected to be</nobr></div>
<div style="position:absolute;top:5713;left:108"><nobr>optimized. The optimization options are exactly the same as in Step 1 except the Propeller</nobr></div>
<div style="position:absolute;top:5736;left:108"><nobr>flags.</nobr></div>
<div style="position:absolute;top:5781;left:108"><nobr>The flag -fpropeller-optimze=perf.propeller invokes the following flags:</nobr></div>
<div style="position:absolute;top:5824;left:135"><nobr><b>1. -fbasicblock-sections=perf.propeller&nbsp;&nbsp;</b></nobr></div>
<div style="position:absolute;top:5847;left:135"><nobr><b>2. -funique-internal-funcnames&nbsp;</b></nobr></div>
<div style="position:absolute;top:5869;left:135"><nobr><b>3. -Wl,--propeller=perf.propeller&nbsp;</b></nobr></div>
<div style="position:absolute;top:5892;left:135"><nobr><b>4. -Wl,--optimize-bb-jumps&nbsp;</b></nobr></div>
<div style="position:absolute;top:5914;left:135"><nobr><b>5. -Wl,--lto-basicblock-sections=perf.propeller (with LTO)&nbsp;</b></nobr></div>
<div style="position:absolute;top:5961;left:108"><nobr>The ​<i>“-fbasicblock-sections=” </i><font style="font-size:15px">​</font>flag generates basic block sections for functions which have</nobr></div>
<div style="position:absolute;top:5983;left:108"><nobr>samples as indicated in perf.propeller. Basic block sections are explained in detail later. The</nobr></div>
</span></font>
<div style="position:absolute;top:6115;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="6"><b>Page 6</b></a></font></td></tr></tbody></table></div><font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:6224;left:108"><nobr><i>“-funique-internal-funcnames” </i>flag generates unique names for functions with internal-linkage.</nobr></div>
<div style="position:absolute;top:6246;left:108"><nobr>The ​<i>“-Wl,--propeller=” </i>and ​<i>“-Wl,--optimize-bb-jumps” </i>flags trigger propeller optimizations and</nobr></div>
<div style="position:absolute;top:6269;left:108"><nobr>linker relaxation at link time.</nobr></div>
<div style="position:absolute;top:6314;left:108"><nobr>The propeller interface in the linker constructs the CFG and annotates it with profiles. The block</nobr></div>
<div style="position:absolute;top:6336;left:108"><nobr>reordering algorithm discovers the optimal layout of basic blocks and the linker reorders it to</nobr></div>
<div style="position:absolute;top:6359;left:108"><nobr>form the final binary. Debug Information and CFI for basic block sections are created ​with</nobr></div>
<div style="position:absolute;top:6381;left:108"><nobr>appropriate relocations so that the linker can create correct debuginfo and CFI for reordered</nobr></div>
<div style="position:absolute;top:6404;left:108"><nobr>functions.</nobr></div>
<div style="position:absolute;top:6426;left:108"><nobr>Add the option ​<i>“-Wl,--propeller-keep-named-symbols” </i>to the above build and notice how the</nobr></div>
<div style="position:absolute;top:6449;left:108"><nobr>final binary has a different ordering of basic blocks where the executed ones are kept close</nobr></div>
<div style="position:absolute;top:6471;left:108"><nobr>together:</nobr></div>
<div style="position:absolute;top:6538;left:108"><nobr><b>0000000000201000 T main&nbsp;</b></nobr></div>
<div style="position:absolute;top:6560;left:108"><nobr><b>0000000000201018 t a.BB.main&nbsp;</b></nobr></div>
<div style="position:absolute;top:6583;left:108"><nobr><b>0000000000201030 T _Z6calleeb&nbsp;</b></nobr></div>
<div style="position:absolute;top:6605;left:108"><nobr><b>0000000000201036 t a.BB._Z6calleeb&nbsp;</b></nobr></div>
<div style="position:absolute;top:6628;left:108"><nobr><b>0000000000201045 t aaa.BB._Z6calleeb&nbsp;</b></nobr></div>
<div style="position:absolute;top:6650;left:108"><nobr><b>000000000020104e t aa.BB.main&nbsp;</b></nobr></div>
<div style="position:absolute;top:6673;left:108"><nobr><b>0000000000201057 t aa.BB._Z6calleeb&nbsp;</b></nobr></div>
<div style="position:absolute;top:6695;left:108"><nobr><b>&nbsp;</b></nobr></div>
<div style="position:absolute;top:6719;left:108"><nobr>By default, without option ​<b>-Wl,--propeller-keep-named-symbols, </b>all hot basic block</nobr></div>
<div style="position:absolute;top:6741;left:108"><nobr>section symbols are deleted as they are part of the main function and the first basic block of</nobr></div>
<div style="position:absolute;top:6764;left:108"><nobr>every function partition (cold partition) is retained.</nobr></div>
<div style="position:absolute;top:6809;left:108"><nobr>Function ​<i>main </i>has been split and the basic blocks ​<i>aa.BB.main </i>and ​<i>aa.BB._Z6calleeb </i>which are</nobr></div>
<div style="position:absolute;top:6831;left:108"><nobr>cold have been moved out and the executed basic blocks are kept together.</nobr></div>
</span></font>
<font size="4" face="Times"><span style="font-size:27px;font-family:Times">
<div style="position:absolute;top:6884;left:108"><nobr>BOLT versus Propeller</nobr></div>
</span></font>
<font size="3" face="Times" color="#434343"><span style="font-size:18px;font-family:Times;color:#434343">
<div style="position:absolute;top:6949;left:108"><nobr>Design</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:7006;left:135"><nobr>1. The input to BOLT is an optimized binary and its hardware PMU profiles. BOLT</nobr></div>
<div style="position:absolute;top:7028;left:162"><nobr>disassembles the binary, optimizes, and writes out a new optimized binary. Propeller</nobr></div>
<div style="position:absolute;top:7051;left:162"><nobr>takes a different approach and re-links the binary from cached IR objects to achieve</nobr></div>
<div style="position:absolute;top:7073;left:162"><nobr>scalability. The input to Propeller is the optimized LLVM bit-code object files and the</nobr></div>
<div style="position:absolute;top:7096;left:162"><nobr>hardware profiles, and Propeller re-generates the native object files by invoking the</nobr></div>
<div style="position:absolute;top:7118;left:162"><nobr>compiler backends, in parallel or distributed, and re-links the object files to form the new</nobr></div>
<div style="position:absolute;top:7141;left:162"><nobr>optimized binary. </nobr></div>
</span></font>
<div style="position:absolute;top:7303;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="7"><b>Page 7</b></a></font></td></tr></tbody></table></div><font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:7412;left:135"><nobr>2. BOLT is a stand-alone LLVM based tool whereas Propeller is implemented as part of the</nobr></div>
<div style="position:absolute;top:7434;left:162"><nobr>existing LLVM tool-chain. Propeller leverages the tool chain’s existing capabilities for</nobr></div>
<div style="position:absolute;top:7457;left:162"><nobr>backend compilations and linking whereas BOLT uses LLVM APIs for disassembly,</nobr></div>
<div style="position:absolute;top:7479;left:162"><nobr>parsing debug information, and linking (e.g., JIT API to do the final link) to re-build the</nobr></div>
<div style="position:absolute;top:7502;left:162"><nobr>binary post optimizations. </nobr></div>
<div style="position:absolute;top:7524;left:135"><nobr>3. BOLT requires that all static relocations be retained in the final binary to aid</nobr></div>
<div style="position:absolute;top:7547;left:162"><nobr>dis-assembly whereas Propeller requires basic block labels to be emitted in the symbol</nobr></div>
<div style="position:absolute;top:7569;left:162"><nobr>table. Both static relocations and basic block labels increase the size of the final binary</nobr></div>
<div style="position:absolute;top:7592;left:162"><nobr>but do not affect performance as they are not loaded onto memory during execution.</nobr></div>
</span></font>
<font size="3" face="Times" color="#434343"><span style="font-size:18px;font-family:Times;color:#434343">
<div style="position:absolute;top:7661;left:108"><nobr>Scalability</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:7718;left:135"><nobr>1. BOLT does not work well with distributed systems as it directly works on the input binary</nobr></div>
<div style="position:absolute;top:7740;left:162"><nobr>in a single process. Propeller is designed to play well with distributed systems and</nobr></div>
<div style="position:absolute;top:7763;left:162"><nobr>achieves this by splitting the optimization process into two parts, a distributed part and a</nobr></div>
<div style="position:absolute;top:7785;left:162"><nobr>whole-program part. </nobr></div>
<div style="position:absolute;top:7808;left:135"><nobr>2. Propeller handles incremental builds much better than BOLT. Small source changes,</nobr></div>
<div style="position:absolute;top:7830;left:162"><nobr>which do not affect the profile information significantly, only require re-compiling the</nobr></div>
<div style="position:absolute;top:7853;left:162"><nobr>affected IR objects, and then re-linking with the existing profiles to form the Propeller</nobr></div>
<div style="position:absolute;top:7875;left:162"><nobr>optimized binary. With BOLT, even small source changes invalidate the profile</nobr></div>
<div style="position:absolute;top:7898;left:162"><nobr>information, an error if the binary’s build id does not match the profile. Hence, the binary</nobr></div>
<div style="position:absolute;top:7920;left:162"><nobr>corresponding to the source change must be re-built and re-profiled. More importantly,</nobr></div>
<div style="position:absolute;top:7943;left:162"><nobr>the heavyweight process of disassembly, analysis, and rewrite by BOLT must be</nobr></div>
<div style="position:absolute;top:7965;left:162"><nobr>repeated.</nobr></div>
</span></font>
<font size="3" face="Times" color="#434343"><span style="font-size:18px;font-family:Times;color:#434343">
<div style="position:absolute;top:8034;left:108"><nobr>Debug Information Handling</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:8091;left:108"><nobr>BOLT must parse DWARF debug information (DebugInfo) and Call Frame Information (CFI) and</nobr></div>
<div style="position:absolute;top:8114;left:108"><nobr>rewrite them according to the optimizations applied. Since BOLT parses CFI, it is able to rewrite</nobr></div>
<div style="position:absolute;top:8136;left:108"><nobr>the CFI descriptors compactly after the optimizations are applied. With Propeller, the changes</nobr></div>
<div style="position:absolute;top:8159;left:108"><nobr>needed for DebugInfo and CFI are made by the compiler with relocations created for the linker</nobr></div>
<div style="position:absolute;top:8181;left:108"><nobr>to patch. The linker does not have to parse DebugInfo and CFI and this is consistent with how</nobr></div>
<div style="position:absolute;top:8204;left:108"><nobr>the LLVM tool-chain fixes DebugInfo and CFI with function sections. The CFI information is</nobr></div>
<div style="position:absolute;top:8226;left:108"><nobr>created by the compiler with a very conservative assumption that no two basic blocks from the</nobr></div>
<div style="position:absolute;top:8249;left:108"><nobr>same function will be adjacent. This makes CFI unnecessarily bloated in the final binary. </nobr></div>
</span></font>
<div style="position:absolute;top:8491;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="8"><b>Page 8</b></a></font></td></tr></tbody></table></div><font size="4" face="Times"><span style="font-size:27px;font-family:Times">
<div style="position:absolute;top:8701;left:108"><nobr>Can this be done with PGO itself in the compiler?</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:8751;left:108"><nobr><i><b>TLDR: No, because of lack of precise context and path sensitive profiles.</b></i></nobr></div>
<div style="position:absolute;top:8796;left:108"><nobr>This question can be rephrased as how does BOLT/Propeller squeeze more performance from</nobr></div>
<div style="position:absolute;top:8818;left:108"><nobr>an already highly optimized PGO+LTO binary?</nobr></div>
<div style="position:absolute;top:8863;left:108"><nobr>Binaries built for peak performance use either instrumented PGO or AutoFDO, along with</nobr></div>
<div style="position:absolute;top:8886;left:108"><nobr>LTO/ThinLTO for cross-module optimizations. It has been noticed that there are lots of</nobr></div>
<div style="position:absolute;top:8908;left:108"><nobr>instances where profile information is either inaccurate or absent due to the various</nobr></div>
<div style="position:absolute;top:8931;left:108"><nobr>transformations applied. While the profile counts are precise at the moment they are read, the</nobr></div>
<div style="position:absolute;top:8953;left:108"><nobr>compiler cannot maintain the precision of the profiles after some transformations. The profiles</nobr></div>
<div style="position:absolute;top:8976;left:108"><nobr>are inaccurate due to lack of context sensitive and path sensitive information and also due to</nobr></div>
<div style="position:absolute;top:8998;left:108"><nobr>cross-module interactions. Examples of such profile inconsistencies include but not limited to</nobr></div>
<div style="position:absolute;top:9021;left:108"><nobr>the following:</nobr></div>
<div style="position:absolute;top:9066;left:135"><nobr>1. A hot function foo not inlined in the instrumented binary does not contain the context</nobr></div>
<div style="position:absolute;top:9088;left:162"><nobr>sensitive profile of each of its call sites and the basic-block ordering in every PGO inlined</nobr></div>
<div style="position:absolute;top:9111;left:162"><nobr>call-site will end up being similar.</nobr></div>
<div style="position:absolute;top:9133;left:135"><nobr>2. A function foo defined in module A will not have its profile updated and corrected when it</nobr></div>
<div style="position:absolute;top:9156;left:162"><nobr>is inlined in module B.</nobr></div>
<div style="position:absolute;top:9178;left:135"><nobr>3. With instrumentation based profiling, the instrumentation pass that introduces counters</nobr></div>
<div style="position:absolute;top:9201;left:162"><nobr>that accumulate edge profiles happens early and a later pass that introduces new</nobr></div>
<div style="position:absolute;top:9223;left:162"><nobr>branches or additional control-flow is also introducing potential loss of profile information</nobr></div>
<div style="position:absolute;top:9246;left:162"><nobr>due to path sensitivity. </nobr></div>
<div style="position:absolute;top:9268;left:135"><nobr>4. Also, with instrumentation based profiling, the instrumentation pass happens before the</nobr></div>
<div style="position:absolute;top:9291;left:162"><nobr>second inlining pass and any function inlined in the second pass will not have context</nobr></div>
<div style="position:absolute;top:9313;left:162"><nobr>sensitive profile information collected.</nobr></div>
<div style="position:absolute;top:9336;left:135"><nobr>5. Loop optimization passes like loop peeling, loop rotation and loop unrolling introduce</nobr></div>
<div style="position:absolute;top:9358;left:162"><nobr>new branches whose probabilities cannot be accurately determined and are guessed</nobr></div>
<div style="position:absolute;top:9381;left:162"><nobr>using aggressive heuristics.</nobr></div>
<div style="position:absolute;top:9403;left:135"><nobr>6. Profile updates after applying jump threading optimizations are made using heuristics.</nobr></div>
<div style="position:absolute;top:9426;left:135"><nobr>7. Profile updates after branch lowering of logical “OR (||)” and “AND (&amp;&amp;)” instructions are</nobr></div>
<div style="position:absolute;top:9448;left:162"><nobr>also made using heuristics.</nobr></div>
<div style="position:absolute;top:9493;left:108"><nobr>Recently, Context-Sensitive PGO (CSPGO) was invented to augment regular PGO with context</nobr></div>
<div style="position:absolute;top:9516;left:108"><nobr>sensitive profiles and study the effects on performance. CSPGO adds another round of profiling</nobr></div>
<div style="position:absolute;top:9538;left:108"><nobr>to PGO by instrumenting the optimized PGO binary built with the first set of profiles. Just like in</nobr></div>
</span></font>
<div style="position:absolute;top:9679;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="9"><b>Page 9</b></a></font></td></tr></tbody></table></div><font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:9788;left:108"><nobr>PGO, the first set of profiles still drives the inlining decisions which makes the second set of</nobr></div>
<div style="position:absolute;top:9810;left:108"><nobr>profiles much more precise with respect to context sensitive information. Experiments show that</nobr></div>
<div style="position:absolute;top:9833;left:108"><nobr>CSPGO can further improve the performance of PGO optimized binaries by a few percent which</nobr></div>
<div style="position:absolute;top:9855;left:108"><nobr>is strong evidence that PGO drops performance on the table due to the lack of precise profiles.</nobr></div>
<div style="position:absolute;top:9878;left:108"><nobr>Note that while CSPGO addresses the problem of lack of context sensitive profiles, it still is</nobr></div>
<div style="position:absolute;top:9900;left:108"><nobr>subjected to imprecise path sensitive profile information like PGO.</nobr></div>
<div style="position:absolute;top:9945;left:108"><nobr>Both BOLT and Propeller re-profile the peak optimized binary and use the profiles to directly fix</nobr></div>
<div style="position:absolute;top:9968;left:108"><nobr>the binary. Since the compiler is not invoked, BOLT/Propeller are always operating with</nobr></div>
<div style="position:absolute;top:9990;left:108"><nobr>accurate profiles and are able to fix some of the inaccuracies in some of the optimizations</nobr></div>
<div style="position:absolute;top:10013;left:108"><nobr>performed by PGO like basic block layout. Context-Sensitive PGO (CSPGO) aims to address</nobr></div>
<div style="position:absolute;top:10035;left:108"><nobr>the exact same problem and does well in improving the performance of PGO binaries further but</nobr></div>
<div style="position:absolute;top:10058;left:108"><nobr>our experiments show that BOLT and Propeller are even able to further optimize CSPGO</nobr></div>
<div style="position:absolute;top:10080;left:108"><nobr>binaries.</nobr></div>
<div style="position:absolute;top:10125;left:108"><nobr>Notice that Propeller does not need to re-invoke the high level IR optimizations as it tries to</nobr></div>
<div style="position:absolute;top:10148;left:108"><nobr>re-layout the final code. Propeller can also bypass MIR but complete support for serializing MIR</nobr></div>
<div style="position:absolute;top:10170;left:108"><nobr>is not available yet. We are looking at adding this support to improve the efficiency of Propeller.</nobr></div>
<div style="position:absolute;top:10193;left:108"><nobr>Propeller needs to re-run CFI instruction insertion and code generation in the backend as the</nobr></div>
<div style="position:absolute;top:10215;left:108"><nobr>optimization pass uses basic block sections which needs better CFI handling.</nobr></div>
</span></font>
<font size="4" face="Times"><span style="font-size:27px;font-family:Times">
<div style="position:absolute;top:10268;left:108"><nobr>Basic Block Sections</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:10340;left:108"><nobr>Modern compilers support compiling with function and data sections via options</nobr></div>
<div style="position:absolute;top:10363;left:108"><nobr>-ffunction-sections and -fdata-sections, respectively, which places each function and data item</nobr></div>
<div style="position:absolute;top:10385;left:108"><nobr>in a separate section in the native object file. This allows the linker to do many link time</nobr></div>
<div style="position:absolute;top:10408;left:108"><nobr>optimizations like dead data and code elimination, identical code folding, and function</nobr></div>
<div style="position:absolute;top:10430;left:108"><nobr>reordering. Without function sections, performing these optimizations at link time is significantly</nobr></div>
<div style="position:absolute;top:10453;left:108"><nobr>harder. Linkers operate at the granularity of sections, and this paper introduces basic block</nobr></div>
<div style="position:absolute;top:10475;left:108"><nobr>sections which compiles every basic block into its own section. With this support, arbitrary</nobr></div>
<div style="position:absolute;top:10498;left:108"><nobr>reordering of basic blocks at link time is feasible. We introduce a new compiler option,</nobr></div>
<div style="position:absolute;top:10520;left:108"><nobr>-fbasicblock-sections, which places every basic block in a unique ELF text section in the object</nobr></div>
<div style="position:absolute;top:10543;left:108"><nobr>file along with a symbol labelling the basic block. The linker can then order the basic block</nobr></div>
<div style="position:absolute;top:10565;left:108"><nobr>sections in any arbitrary sequence which when done correctly can encapsulate block layout,</nobr></div>
<div style="position:absolute;top:10588;left:108"><nobr>function layout and function splitting optimizations. However, there are a couple of challenges to</nobr></div>
<div style="position:absolute;top:10610;left:108"><nobr>be addressed for this to be feasible:</nobr></div>
<div style="position:absolute;top:10655;left:135"><nobr>1. The compiler must not allow any implicit fall-through between any two adjacent basic</nobr></div>
<div style="position:absolute;top:10678;left:162"><nobr>blocks as they could be reordered at link time to be non-adjacent. In other words, the</nobr></div>
<div style="position:absolute;top:10700;left:162"><nobr>compiler must make a fall-through between adjacent basic blocks explicit by retaining</nobr></div>
<div style="position:absolute;top:10723;left:162"><nobr>the direct jump instruction that jumps to the next basic block. These branches can only</nobr></div>
</span></font>
<div style="position:absolute;top:10867;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="10"><b>Page 10</b></a></font></td></tr></tbody></table></div><font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:10976;left:162"><nobr>be removed later in the linking phase after the final ordering is performed as determined</nobr></div>
<div style="position:absolute;top:10998;left:162"><nobr>by Propeller. </nobr></div>
<div style="position:absolute;top:11021;left:135"><nobr>2. Each additional section added to an object file bloats its size by tens of bytes. The</nobr></div>
<div style="position:absolute;top:11043;left:162"><nobr>number of basic blocks can be potentially very large compared to the size of functions</nobr></div>
<div style="position:absolute;top:11066;left:162"><nobr>and can bloat object sizes significantly. For instance, the clang binary contains 1.5M</nobr></div>
<div style="position:absolute;top:11088;left:162"><nobr>basic blocks from approximately 700K functions. </nobr></div>
<div style="position:absolute;top:11111;left:135"><nobr>3. All inter-basic block branch targets would now need to be resolved by the linker as they</nobr></div>
<div style="position:absolute;top:11133;left:162"><nobr>cannot be calculated during compile time. This is done using static relocations which</nobr></div>
<div style="position:absolute;top:11156;left:162"><nobr>bloats the size of the object files. Further, the compiler tries to use short branch</nobr></div>
<div style="position:absolute;top:11178;left:162"><nobr>instructions on some ISAs for branch offsets that can be accommodated in one byte.</nobr></div>
<div style="position:absolute;top:11201;left:162"><nobr>This is not possible with basic block sections as the offset is not determined at compile</nobr></div>
<div style="position:absolute;top:11223;left:162"><nobr>time, and long branch instructions have to be used everywhere. </nobr></div>
<div style="position:absolute;top:11246;left:135"><nobr>4. Debug Information (DebugInfo) and Call Frame Information (CFI) emission needs</nobr></div>
<div style="position:absolute;top:11268;left:162"><nobr>special handling with basic block sections. DebugInfo needs to be emitted with more</nobr></div>
<div style="position:absolute;top:11291;left:162"><nobr>relocations as basic block sections can break a function into potentially several disjoint</nobr></div>
<div style="position:absolute;top:11313;left:162"><nobr>pieces, and CFI needs to be emitted per basic block. This also bloats the object file and</nobr></div>
<div style="position:absolute;top:11336;left:162"><nobr>binary sizes significantly.</nobr></div>
<div style="position:absolute;top:11358;left:135"><nobr>5. The linker needs to perform a relaxation pass on all the branch instructions after laying</nobr></div>
<div style="position:absolute;top:11381;left:162"><nobr>out basic blocks. This relaxation removes explicit fall-through branches between</nobr></div>
<div style="position:absolute;top:11403;left:162"><nobr>adjacent basic blocks and shrinks jump instructions whose offsets can be</nobr></div>
<div style="position:absolute;top:11426;left:162"><nobr>accommodated in smaller equivalent instructions.</nobr></div>
<div style="position:absolute;top:11426;left:570"><nobr>Side note: RISC <font style="font-size:13px">ISAs generally</font></nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:13px;font-family:Times">
<div style="position:absolute;top:11426;left:697"><nobr>&nbsp;</nobr></div>
<div style="position:absolute;top:11426;left:737"><nobr>&nbsp;</nobr></div>
<div style="position:absolute;top:11426;left:809"><nobr>&nbsp;</nobr></div>
<div style="position:absolute;top:11448;left:162"><nobr>support only small offsets, and in order to jump to a distant place, you have to construct an</nobr></div>
<div style="position:absolute;top:11448;left:216"><nobr>&nbsp;</nobr></div>
<div style="position:absolute;top:11448;left:250"><nobr>&nbsp;</nobr></div>
<div style="position:absolute;top:11448;left:293"><nobr>&nbsp;</nobr></div>
<div style="position:absolute;top:11448;left:351"><nobr>&nbsp;</nobr></div>
<div style="position:absolute;top:11448;left:382"><nobr>&nbsp; &nbsp;</nobr></div>
<div style="position:absolute;top:11448;left:441"><nobr>&nbsp; &nbsp;</nobr></div>
<div style="position:absolute;top:11448;left:501"><nobr>&nbsp; &nbsp; &nbsp;</nobr></div>
<div style="position:absolute;top:11448;left:587"><nobr>&nbsp;</nobr></div>
<div style="position:absolute;top:11448;left:633"><nobr>&nbsp;</nobr></div>
<div style="position:absolute;top:11448;left:663"><nobr>&nbsp;</nobr></div>
<div style="position:absolute;top:11448;left:700"><nobr>&nbsp; &nbsp;</nobr></div>
<div style="position:absolute;top:11448;left:788"><nobr>&nbsp; &nbsp;</nobr></div>
<div style="position:absolute;top:11469;left:162"><nobr>address in a register (using a few instructions) and then jump to that location. The proposed</nobr></div>
<div style="position:absolute;top:11469;left:218"><nobr>&nbsp; &nbsp; &nbsp;</nobr></div>
<div style="position:absolute;top:11469;left:305"><nobr>&nbsp;</nobr></div>
<div style="position:absolute;top:11469;left:353"><nobr>&nbsp; &nbsp;</nobr></div>
<div style="position:absolute;top:11469;left:395"><nobr>&nbsp;</nobr></div>
<div style="position:absolute;top:11469;left:487"><nobr>&nbsp;</nobr></div>
<div style="position:absolute;top:11469;left:517"><nobr>&nbsp;</nobr></div>
<div style="position:absolute;top:11469;left:552"><nobr>&nbsp;</nobr></div>
<div style="position:absolute;top:11469;left:591"><nobr>&nbsp; &nbsp;</nobr></div>
<div style="position:absolute;top:11469;left:640"><nobr>&nbsp;</nobr></div>
<div style="position:absolute;top:11469;left:704"><nobr>&nbsp;</nobr></div>
<div style="position:absolute;top:11469;left:739"><nobr>&nbsp;</nobr></div>
<div style="position:absolute;top:11469;left:809"><nobr>&nbsp;</nobr></div>
<div style="position:absolute;top:11490;left:162"><nobr>scheme (emitting most generic instruction sequence for a branch and let the linker to relax)</nobr></div>
<div style="position:absolute;top:11490;left:218"><nobr>&nbsp;</nobr></div>
<div style="position:absolute;top:11490;left:286"><nobr>&nbsp;</nobr></div>
<div style="position:absolute;top:11490;left:327"><nobr>&nbsp;</nobr></div>
<div style="position:absolute;top:11490;left:383"><nobr>&nbsp;</nobr></div>
<div style="position:absolute;top:11490;left:463"><nobr>&nbsp;</nobr></div>
<div style="position:absolute;top:11490;left:536"><nobr>&nbsp;</nobr></div>
<div style="position:absolute;top:11490;left:560"><nobr>&nbsp; &nbsp;</nobr></div>
<div style="position:absolute;top:11490;left:627"><nobr>&nbsp;</nobr></div>
<div style="position:absolute;top:11490;left:658"><nobr>&nbsp; &nbsp;</nobr></div>
<div style="position:absolute;top:11490;left:706"><nobr>&nbsp;</nobr></div>
<div style="position:absolute;top:11490;left:748"><nobr>&nbsp; &nbsp;</nobr></div>
<div style="position:absolute;top:11490;left:809"><nobr>&nbsp;</nobr></div>
<div style="position:absolute;top:11512;left:162"><nobr>would work for RISC without having to create thunks.</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:11556;left:108"><nobr>Propeller overcomes the size bloats incurred by creating basic block sections on demand.</nobr></div>
<div style="position:absolute;top:11579;left:108"><nobr>Propeller does not create basic block sections for the set of functions that did not correspond to</nobr></div>
<div style="position:absolute;top:11601;left:108"><nobr>any sample in the profile. This helps in greatly limiting the number of basic blocks for which</nobr></div>
<div style="position:absolute;top:11624;left:108"><nobr>sections are created, and our experiments show that for our large benchmarks more than 90 %</nobr></div>
<div style="position:absolute;top:11646;left:108"><nobr>of basic blocks were excluded. Call Frame Information (CFI), which is very useful for stack</nobr></div>
<div style="position:absolute;top:11669;left:108"><nobr>unwinding, poses a significant challenge with basic block sections. Unlike debug information,</nobr></div>
<div style="position:absolute;top:11691;left:108"><nobr>CFI does not support non-contiguous ranges and a lot of the information needs to be duplicated,</nobr></div>
<div style="position:absolute;top:11714;left:108"><nobr>leading to size bloats in object files and the final binary. Our experiments show that CFI is</nobr></div>
<div style="position:absolute;top:11736;left:108"><nobr>responsible for the largest overall size bloats (10% on average). We later describe some</nobr></div>
<div style="position:absolute;top:11759;left:108"><nobr>techniques used to keep this minimal and propose changes to the DWARF format to reduce the</nobr></div>
<div style="position:absolute;top:11781;left:108"><nobr>CFI related size bloat even further. In spite of the challenges above, basic block sections allow</nobr></div>
<div style="position:absolute;top:11804;left:108"><nobr>Propeller to do code layout optimizations in a scalable manner with much less memory and time</nobr></div>
<div style="position:absolute;top:11826;left:108"><nobr>overheads when compared to BOLT. The cost of creating basic block sections can be</nobr></div>
<div style="position:absolute;top:11849;left:108"><nobr>distributed/parallelized as this happens in the backends when each IR object file is assembled</nobr></div>
<div style="position:absolute;top:11871;left:108"><nobr>into the native object file. With this feature, the analysis for determining the right basic block</nobr></div>
<div style="position:absolute;top:11894;left:108"><nobr>sequence and reordering the blocks can be done more efficiently by the linker. </nobr></div>
</span></font>
<div style="position:absolute;top:12055;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="11"><b>Page 11</b></a></font></td></tr></tbody></table></div><font size="4" face="Times"><span style="font-size:21px;font-family:Times">
<div style="position:absolute;top:12191;left:108"><nobr>Labeling Basic Blocks</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:12255;left:108"><nobr>Propeller labels every basic block with a unique symbol as this allows easy mapping of virtual</nobr></div>
<div style="position:absolute;top:12277;left:108"><nobr>addresses from PMU profiles back to the corresponding basic blocks. Since the number of basic</nobr></div>
<div style="position:absolute;top:12300;left:108"><nobr>blocks is large, the labeling bloats the symbol table sizes and the string table sizes significantly.</nobr></div>
<div style="position:absolute;top:12322;left:108"><nobr>While the binary size does increase this does not affect performance as the symbol table is not</nobr></div>
<div style="position:absolute;top:12345;left:108"><nobr>loaded in memory during run-time. The string table size bloat is however kept very minimal</nobr></div>
<div style="position:absolute;top:12367;left:108"><nobr>using a unary naming scheme that uses string suffix compression. The basic blocks for function</nobr></div>
<div style="position:absolute;top:12390;left:108"><nobr>foo are named "a.bb.foo", "aa.bb.foo", . . . This turns out to be very good for string table sizes</nobr></div>
<div style="position:absolute;top:12412;left:108"><nobr>and the bloat in the string table size for a very large binary is only 8 %.</nobr></div>
</span></font>
<font size="4" face="Times"><span style="font-size:21px;font-family:Times">
<div style="position:absolute;top:12484;left:108"><nobr>Linker Relaxation after reordering basic blocks</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:12526;left:108"><nobr>After the linker has reordered the basic block sections according to the desired sequence, it</nobr></div>
<div style="position:absolute;top:12549;left:108"><nobr>runs a relaxation pass to optimize jump instructions. Currently, the compiler emits the long form</nobr></div>
<div style="position:absolute;top:12571;left:108"><nobr>of all jump instructions . AMD64 ISA supports variants of jump instructions with one byte offset</nobr></div>
<div style="position:absolute;top:12594;left:108"><nobr>or a four byte offset. The compiler generates jump instructions with R_X86_64 32-bit PC</nobr></div>
<div style="position:absolute;top:12616;left:108"><nobr>relative relocations. We would like to use a new relocation type for these jump instructions as</nobr></div>
<div style="position:absolute;top:12639;left:108"><nobr>it makes it easy and accurate while relaxing these instructions.</nobr></div>
<div style="position:absolute;top:12684;left:108"><nobr>The relaxation pass does three things: </nobr></div>
<div style="position:absolute;top:12729;left:108"><nobr>First, it deletes all explicit fall-through direct jump instructions between adjacent basic blocks.</nobr></div>
<div style="position:absolute;top:12751;left:108"><nobr>This is done by discarding the tail of the basic block section. </nobr></div>
<div style="position:absolute;top:12796;left:108"><nobr>Second, it shrinks jump instructions whose offsets can fit in a smaller equivalent jump</nobr></div>
<div style="position:absolute;top:12819;left:108"><nobr>instruction. The AMD64 ISA supports variants of jump instructions with either a one byte offset</nobr></div>
<div style="position:absolute;top:12841;left:108"><nobr>or a four byte offset. Shorter jump instructions are preferred where possible to reduce code</nobr></div>
<div style="position:absolute;top:12864;left:108"><nobr>size. The jump instructions are shrunk by using jump relocations. Jump relocations are used to</nobr></div>
<div style="position:absolute;top:12886;left:108"><nobr>modify the opcode of the jump instruction. Jump relocations contain three values, instruction</nobr></div>
<div style="position:absolute;top:12909;left:108"><nobr>offset, jump type and size. There is a one to one correspondence between jump relocations and</nobr></div>
<div style="position:absolute;top:12931;left:108"><nobr>jump instructions. While writing this jump instruction out to the final binary, the linker uses the</nobr></div>
<div style="position:absolute;top:12954;left:108"><nobr>jump relocation to determine the opcode and the size of the modified jump instruction. These</nobr></div>
<div style="position:absolute;top:12976;left:108"><nobr>new relocations are required because the input object files are memory-mapped without write</nobr></div>
<div style="position:absolute;top:12999;left:108"><nobr>permissions and directly modifying the object files requires copying these sections. Copying a</nobr></div>
<div style="position:absolute;top:13021;left:108"><nobr>large number of basic block sections significantly bloats memory and we have invented jump</nobr></div>
<div style="position:absolute;top:13044;left:108"><nobr>relocations as a simple solution to avoid this bloat. </nobr></div>
<div style="position:absolute;top:13089;left:108"><nobr>Third, the relaxation pass replaces a two jump instruction sequence, JX and JY , with just one</nobr></div>
<div style="position:absolute;top:13111;left:108"><nobr>jump instruction if JX is a conditional branch whose target is the adjacent basic block and JY is</nobr></div>
</span></font>
<div style="position:absolute;top:13243;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="12"><b>Page 12</b></a></font></td></tr></tbody></table></div><font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:13352;left:108"><nobr>a direct branch. This can be done by replacing the two jump instructions with a single jump</nobr></div>
<div style="position:absolute;top:13374;left:108"><nobr>instruction which has as the inverse op-code of JX as its op-code, and the target of JY as its</nobr></div>
<div style="position:absolute;top:13397;left:108"><nobr>branch target. This change is also made using a jump relocation at the appropriate code offset</nobr></div>
<div style="position:absolute;top:13419;left:108"><nobr>in that section</nobr></div>
</span></font>
<font size="4" face="Times"><span style="font-size:21px;font-family:Times">
<div style="position:absolute;top:13469;left:108"><nobr>Handling Exceptions</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:13533;left:108"><nobr>Basic blocks that are involved in exception handling, that is, they either throw or catch an</nobr></div>
<div style="position:absolute;top:13555;left:108"><nobr>exception, are grouped together and placed in a single section. Unique sections are not created</nobr></div>
<div style="position:absolute;top:13578;left:108"><nobr>for such basic blocks. This is because the exception table is constructed using basic block</nobr></div>
<div style="position:absolute;top:13600;left:108"><nobr>offsets and creating unique sections for such basic blocks would require more static relocations.</nobr></div>
<div style="position:absolute;top:13623;left:108"><nobr>Creating sections for exception handling basic blocks will be part of future work. Benchmarks</nobr></div>
<div style="position:absolute;top:13645;left:108"><nobr>which spend a significant amount of time handling exceptions might not get the optimization</nobr></div>
<div style="position:absolute;top:13668;left:108"><nobr>benefits of Propeller.</nobr></div>
</span></font>
<font size="4" face="Times"><span style="font-size:21px;font-family:Times">
<div style="position:absolute;top:13717;left:108"><nobr>Updating DebugInfo and CFI</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:13782;left:108"><nobr>Generating correct debug information (DebugInfo) and Call Frame Information (CFI) with basic</nobr></div>
<div style="position:absolute;top:13804;left:108"><nobr>block sections is challenging. Since basic blocks coming from different functions can be</nobr></div>
<div style="position:absolute;top:13827;left:108"><nobr>arbitrarily reordered and mixed together, we must appropriately update the DebugInfo and CFI.</nobr></div>
<div style="position:absolute;top:13872;left:108"><nobr>DebugInfo is easier compared to CFI as we can leverage the DW_AT_ranges tag which allows</nobr></div>
<div style="position:absolute;top:13894;left:108"><nobr>description of a possibly non-contiguous range of addresses occupied by an entity. Thus, every</nobr></div>
<div style="position:absolute;top:13917;left:108"><nobr>basic block section forces a separate entry in DW_AT_ranges, plus two relocations pointing to</nobr></div>
<div style="position:absolute;top:13939;left:108"><nobr>symbols at the start and end of the basic block, respectively. DebugInfo will bloat object file</nobr></div>
<div style="position:absolute;top:13962;left:108"><nobr>sizes further with basic block sections. </nobr></div>
<div style="position:absolute;top:14007;left:108"><nobr>On the other hand, CFI doesn’t provide any easy way to specify non-contiguous range of</nobr></div>
<div style="position:absolute;top:14029;left:108"><nobr>addresses occupied by a function – the DWARF standard explicitly requires emitting separate</nobr></div>
<div style="position:absolute;top:14052;left:108"><nobr>CFI Frame Descriptor Entries for each contiguous fragment of a function. Thus, the CFI</nobr></div>
<div style="position:absolute;top:14074;left:108"><nobr>information for all callee-saved registers (possibly including the frame pointer, if necessary)</nobr></div>
<div style="position:absolute;top:14097;left:108"><nobr>have to be emitted along with redefining the Call Frame Address (CFA), viz. where the current</nobr></div>
<div style="position:absolute;top:14119;left:108"><nobr>frame starts. </nobr></div>
<div style="position:absolute;top:14164;left:108"><nobr>This causes a significant bloat of the .eh_frame sections, which is partially mitigated by</nobr></div>
<div style="position:absolute;top:14187;left:108"><nobr>de-duplicating common CFI instructions to the CFI Common Information Entry. We only</nobr></div>
<div style="position:absolute;top:14209;left:108"><nobr>de-duplicate CFI instructions with offset 0 from the beginning of the CFI frame, i.e. those that</nobr></div>
<div style="position:absolute;top:14232;left:108"><nobr>describe the CFI state before entering the frame. </nobr></div>
<div style="position:absolute;top:14277;left:108"><nobr>Having support for non-contiguous ranges in CFI would significantly minimize the size</nobr></div>
<div style="position:absolute;top:14299;left:108"><nobr>overheads and complexity of supporting basic block sections. </nobr></div>
</span></font>
<div style="position:absolute;top:14431;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="13"><b>Page 13</b></a></font></td></tr></tbody></table></div><font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:14562;left:108"><nobr>To allow easy basic block rewriting in the linker (e.g. removing unnecessary fall-through jumps),</nobr></div>
<div style="position:absolute;top:14585;left:108"><nobr>we force relocations against symbols and not sections. Moreover, in cases where the range is</nobr></div>
<div style="position:absolute;top:14607;left:108"><nobr>represented in DWARF as start and length, we defer the length calculation to the link stage, by</nobr></div>
<div style="position:absolute;top:14630;left:108"><nobr>emitting an appropriate SIZE relocation instead of hardcoding the length directly in the object file</nobr></div>
<div style="position:absolute;top:14652;left:108"><nobr>by the compiler.</nobr></div>
</span></font>
<font size="4" face="Times"><span style="font-size:21px;font-family:Times">
<div style="position:absolute;top:14724;left:108"><nobr>Example</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:14796;left:116"><nobr>bbsection_example.cc&nbsp;</nobr></div>
<div style="position:absolute;top:14819;left:116"><nobr>--------------------&nbsp;</nobr></div>
<div style="position:absolute;top:14841;left:116"><nobr>&nbsp;</nobr></div>
<div style="position:absolute;top:14864;left:116"><nobr>__attribute__((noinline))&nbsp;</nobr></div>
<div style="position:absolute;top:14886;left:116"><nobr>int baz(int count) {&nbsp;</nobr></div>
<div style="position:absolute;top:14909;left:136"><nobr>if (count % 2 == 0)&nbsp;</nobr></div>
<div style="position:absolute;top:14931;left:155"><nobr>return bar();&nbsp;</nobr></div>
<div style="position:absolute;top:14954;left:136"><nobr>else&nbsp;</nobr></div>
<div style="position:absolute;top:14976;left:155"><nobr>return foo();&nbsp;</nobr></div>
<div style="position:absolute;top:14999;left:116"><nobr>}&nbsp;</nobr></div>
<div style="position:absolute;top:15021;left:116"><nobr>&nbsp;</nobr></div>
<div style="position:absolute;top:15044;left:116"><nobr>int main(int argc, char *argv[]) {&nbsp;</nobr></div>
<div style="position:absolute;top:15066;left:136"><nobr>return baz(argc);&nbsp;</nobr></div>
<div style="position:absolute;top:15089;left:116"><nobr>}</nobr></div>
<div style="position:absolute;top:15166;left:108"><nobr>In the above example, function baz will have a couple of basic blocks. Generate basic block</nobr></div>
<div style="position:absolute;top:15189;left:108"><nobr>sections using:</nobr></div>
<div style="position:absolute;top:15264;left:116"><nobr>$ clang -fbasicblock-sections=all bbsection_example.cc -S</nobr></div>
<div style="position:absolute;top:15342;left:108"><nobr>and inspecting the assembly file:</nobr></div>
</span></font>
<font size="3" face="Times" color="#ff0000"><span style="font-size:14px;font-family:Times;color:#ff0000">
<div style="position:absolute;top:15418;left:170"><nobr>.section</nobr></div>
<div style="position:absolute;top:15418;left:278"><nobr>.text._Z3bazi,"ax",@progbits</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:15441;left:170"><nobr><b>.globl _Z3bazi </b></nobr></div>
<div style="position:absolute;top:15441;left:362"><nobr><b># -- Begin function _Z3bazi</b></nobr></div>
<div style="position:absolute;top:15463;left:170"><nobr><b>.p2align</b></nobr></div>
<div style="position:absolute;top:15463;left:278"><nobr><b>4, 0x90 </b></nobr></div>
</span></font>
<div style="position:absolute;top:15619;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="14"><b>Page 14</b></a></font></td></tr></tbody></table></div><font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:15737;left:170"><nobr><b>.type _Z3bazi,@function</b></nobr></div>
<div style="position:absolute;top:15759;left:116"><nobr><b>_Z3bazi: </b></nobr></div>
<div style="position:absolute;top:15759;left:328"><nobr><b># @_Z3bazi</b></nobr></div>
<div style="position:absolute;top:15782;left:170"><nobr>.cfi_startproc</nobr></div>
<div style="position:absolute;top:15804;left:170"><nobr>pushq %rbp</nobr></div>
<div style="position:absolute;top:15827;left:170"><nobr>.cfi_def_cfa_offset 16</nobr></div>
<div style="position:absolute;top:15849;left:170"><nobr>.cfi_offset %rbp, -16</nobr></div>
<div style="position:absolute;top:15872;left:170"><nobr>movq %rsp, %rbp</nobr></div>
<div style="position:absolute;top:15894;left:170"><nobr>.cfi_def_cfa_register %rbp</nobr></div>
<div style="position:absolute;top:15917;left:170"><nobr>subq $16, %rsp</nobr></div>
<div style="position:absolute;top:15939;left:171"><nobr>....</nobr></div>
</span></font>
<font size="3" face="Times" color="#0000ff"><span style="font-size:14px;font-family:Times;color:#0000ff">
<div style="position:absolute;top:15962;left:170"><nobr>jne</nobr></div>
<div style="position:absolute;top:15962;left:224"><nobr>aa.BB._Z3bazi</nobr></div>
<div style="position:absolute;top:15984;left:170"><nobr>jmp</nobr></div>
<div style="position:absolute;top:15984;left:224"><nobr>a.BB._Z3bazi</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:16007;left:170"><nobr>.cfi_endproc</nobr></div>
</span></font>
<font size="3" face="Times" color="#ff0000"><span style="font-size:14px;font-family:Times;color:#ff0000">
<div style="position:absolute;top:16029;left:170"><nobr>.section</nobr></div>
<div style="position:absolute;top:16029;left:278"><nobr>.text,"ax",@progbits,unique,1</nobr></div>
<div style="position:absolute;top:16052;left:116"><nobr>a.BB._Z3bazi: </nobr></div>
<div style="position:absolute;top:16052;left:343"><nobr># %if.then</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:16074;left:170"><nobr>.cfi_startproc</nobr></div>
<div style="position:absolute;top:16097;left:170"><nobr>.cfi_def_cfa %rbp, 16</nobr></div>
<div style="position:absolute;top:16119;left:170"><nobr>.cfi_offset %rbp, -16</nobr></div>
<div style="position:absolute;top:16142;left:171"><nobr>...</nobr></div>
<div style="position:absolute;top:16164;left:170"><nobr>jmp</nobr></div>
<div style="position:absolute;top:16164;left:224"><nobr>aaa.BB._Z3bazi</nobr></div>
<div style="position:absolute;top:16187;left:170"><nobr>.size a.BB._Z3bazi, .Ltmp0-a.BB._Z3bazi</nobr></div>
<div style="position:absolute;top:16209;left:170"><nobr>.cfi_endproc</nobr></div>
</span></font>
<font size="3" face="Times" color="#ff0000"><span style="font-size:14px;font-family:Times;color:#ff0000">
<div style="position:absolute;top:16232;left:170"><nobr>.section</nobr></div>
<div style="position:absolute;top:16232;left:278"><nobr>.text,"ax",@progbits,unique,2</nobr></div>
<div style="position:absolute;top:16254;left:116"><nobr>aa.BB._Z3bazi: </nobr></div>
<div style="position:absolute;top:16254;left:352"><nobr># %if.else</nobr></div>
<div style="position:absolute;top:16277;left:157"><nobr>​ <font color="#000000">.cfi_startproc</font></nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:16299;left:166"><nobr>...</nobr></div>
</span></font>
<font size="3" face="Times" color="#0000ff"><span style="font-size:14px;font-family:Times;color:#0000ff">
<div style="position:absolute;top:16322;left:170"><nobr>jmp</nobr></div>
<div style="position:absolute;top:16322;left:224"><nobr>aaa.BB._Z3bazi</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:16344;left:166"><nobr>.cfi_endproc</nobr></div>
</span></font>
<font size="3" face="Times" color="#ff0000"><span style="font-size:14px;font-family:Times;color:#ff0000">
<div style="position:absolute;top:16367;left:170"><nobr>.section</nobr></div>
<div style="position:absolute;top:16367;left:278"><nobr>.text,"ax",@progbits,unique,3</nobr></div>
<div style="position:absolute;top:16389;left:116"><nobr>aaa.BB._Z3bazi: </nobr></div>
<div style="position:absolute;top:16389;left:361"><nobr># %return</nobr></div>
<div style="position:absolute;top:16412;left:166"><nobr>​<font color="#000000">.cfi_startproc</font></nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:16434;left:171"><nobr>…</nobr></div>
<div style="position:absolute;top:16457;left:171"><nobr>ret</nobr></div>
<div style="position:absolute;top:16479;left:171"><nobr>.cfi_endproc</nobr></div>
<div style="position:absolute;top:16502;left:170"><nobr>.size _Z3bazi, .Lfunc_end2-_Z3bazi</nobr></div>
<div style="position:absolute;top:16578;left:135"><nobr>1. Four basic blocks are present for function baz (_Z3bazi) and four section directives</nobr></div>
<div style="position:absolute;top:16601;left:162"><nobr>separate them into different “.text.” sections.</nobr></div>
<div style="position:absolute;top:16623;left:135"><nobr>2. The entry basic block is used to represent the function and hence is part of the function</nobr></div>
<div style="position:absolute;top:16646;left:162"><nobr>section “.text._Z3bazi”. </nobr></div>
</span></font>
<div style="position:absolute;top:16807;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="15"><b>Page 15</b></a></font></td></tr></tbody></table></div><font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:16916;left:135"><nobr>3. The other three basic blocks “a.BB._Z3bazi”, “aa.BB._Z3bazi” and “aaa.BB._Z3bazi” are</nobr></div>
<div style="position:absolute;top:16938;left:162"><nobr>in unnamed ELF sections. The sections are unnamed to reduce the object size bloat</nobr></div>
<div style="position:absolute;top:16961;left:162"><nobr>from naming these sections. A separate option called -funique-bb-section-names</nobr></div>
<div style="position:absolute;top:16983;left:162"><nobr>generates the long names for the sections.</nobr></div>
<div style="position:absolute;top:17006;left:135"><nobr>4. The sections however would be prefixed as “.text”, “.text.hot” or “.text.unlikely” if profile</nobr></div>
<div style="position:absolute;top:17028;left:162"><nobr>information for the basic blocks is available. This is to allow grouping of hot basic blocks</nobr></div>
<div style="position:absolute;top:17051;left:162"><nobr>by the linker if needed.</nobr></div>
<div style="position:absolute;top:17073;left:135"><nobr>5. A fall-through branch between basic blocks “_Z3bazi” and “a.BB._Z3bazi” is converted</nobr></div>
<div style="position:absolute;top:17096;left:162"><nobr>into an explicit unconditional direct jump instruction, marked in blue. This is needed to</nobr></div>
<div style="position:absolute;top:17118;left:162"><nobr>allow these two basic blocks to be floated independently in the address space.</nobr></div>
<div style="position:absolute;top:17141;left:135"><nobr>6. All the basic blocks with sections created are labelled as “[a]+.BB._Z3bazi”. This does</nobr></div>
<div style="position:absolute;top:17163;left:162"><nobr>bloat the symbol table but keeps the strtab bloat very small due to suffix compression.</nobr></div>
<div style="position:absolute;top:17186;left:135"><nobr>7. Each basic block section is placed in its own unique CFI FDE. We repeat all CFI</nobr></div>
<div style="position:absolute;top:17208;left:162"><nobr>instructions needed to restore states of registers and frame address and have a</nobr></div>
<div style="position:absolute;top:17231;left:162"><nobr>de-duplication pass to reduce the overhead of CFI information. Since CFI does not</nobr></div>
<div style="position:absolute;top:17253;left:162"><nobr>support address ranges like DWARF debug info does, we have to pay a significant</nobr></div>
<div style="position:absolute;top:17276;left:162"><nobr>penalty for size bloat.</nobr></div>
<div style="position:absolute;top:17321;left:108"><nobr>Now, let us look the object file for this function:</nobr></div>
<div style="position:absolute;top:17396;left:116"><nobr>$ clang -fbasicblock-sections=all bbsection_example.cc -S</nobr></div>
<div style="position:absolute;top:17419;left:116"><nobr>$ objdump -dr bbsection_example.o&nbsp;</nobr></div>
<div style="position:absolute;top:17474;left:108"><nobr>Inspecting the object file contents for function baz (_Z3bazi) :</nobr></div>
<div style="position:absolute;top:17527;left:116"><nobr><b>Disassembly of section .text._Z3bazi:&nbsp;</b></nobr></div>
<div style="position:absolute;top:17549;left:116"><nobr><b>&nbsp;</b></nobr></div>
<div style="position:absolute;top:17572;left:116"><nobr><b>0000000000000000 &lt;_Z3bazi&gt;:&nbsp;</b></nobr></div>
<div style="position:absolute;top:17594;left:146"><nobr>0: 55 </nobr></div>
<div style="position:absolute;top:17594;left:432"><nobr>push %rbp&nbsp;</nobr></div>
<div style="position:absolute;top:17617;left:146"><nobr>1: 48 89 e5 </nobr></div>
<div style="position:absolute;top:17617;left:432"><nobr>mov %rsp,%rbp&nbsp;</nobr></div>
<div style="position:absolute;top:17639;left:146"><nobr>4: 48 83 ec 10 </nobr></div>
<div style="position:absolute;top:17639;left:432"><nobr>sub $0x10,%rsp&nbsp;</nobr></div>
<div style="position:absolute;top:17662;left:146"><nobr>8: 89 7d f8 </nobr></div>
<div style="position:absolute;top:17662;left:432"><nobr>mov %edi,-0x8(%rbp)&nbsp;</nobr></div>
<div style="position:absolute;top:17684;left:146"><nobr>b: 8b 45 f8 </nobr></div>
<div style="position:absolute;top:17684;left:432"><nobr>mov -0x8(%rbp),%eax&nbsp;</nobr></div>
<div style="position:absolute;top:17707;left:146"><nobr>e: 99 </nobr></div>
<div style="position:absolute;top:17707;left:432"><nobr>cltd&nbsp; &nbsp;</nobr></div>
<div style="position:absolute;top:17729;left:146"><nobr>f: b9 02 00 00 00 </nobr></div>
<div style="position:absolute;top:17729;left:432"><nobr>mov $0x2,%ecx&nbsp;</nobr></div>
<div style="position:absolute;top:17752;left:136"><nobr>14: f7 f9 </nobr></div>
<div style="position:absolute;top:17752;left:432"><nobr>idiv %ecx&nbsp;</nobr></div>
<div style="position:absolute;top:17774;left:136"><nobr>16: 83 fa 00 </nobr></div>
<div style="position:absolute;top:17774;left:432"><nobr>cmp $0x0,%edx&nbsp;</nobr></div>
</span></font>
<font size="3" face="Times" color="#0000ff"><span style="font-size:14px;font-family:Times;color:#0000ff">
<div style="position:absolute;top:17797;left:136"><nobr>19: 0f 85 00 00 00 00 jne 1f &lt;_Z3bazi+0x1f&gt;&nbsp;</nobr></div>
<div style="position:absolute;top:17819;left:353"><nobr>1b: R_X86_64_PC32 aa.BB._Z3bazi-0x4&nbsp;</nobr></div>
<div style="position:absolute;top:17842;left:136"><nobr>1f: e9 00 00 00 00 </nobr></div>
<div style="position:absolute;top:17842;left:432"><nobr>jmpq 24 &lt;_Z3bazi+0x24&gt;&nbsp;</nobr></div>
</span></font>
<div style="position:absolute;top:17995;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="16"><b>Page 16</b></a></font></td></tr></tbody></table></div><font size="3" face="Times" color="#0000ff"><span style="font-size:14px;font-family:Times;color:#0000ff">
<div style="position:absolute;top:18112;left:353"><nobr>20: R_X86_64_PLT32 a.BB._Z3bazi-0x4&nbsp;</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:18189;left:135"><nobr>● At offset 19, a relocation has now been created for the conditional jump instruction and</nobr></div>
<div style="position:absolute;top:18212;left:162"><nobr>is a 6 byte instruction. These additional relocations bloat up object file sizes.</nobr></div>
<div style="position:absolute;top:18234;left:135"><nobr>● The linker will relax and shrink branch instructions if the jump is closer.</nobr></div>
<div style="position:absolute;top:18279;left:108"><nobr>In the symbol table below, note that the basic block symbols are all local.</nobr></div>
<div style="position:absolute;top:18333;left:116"><nobr>Symbol table '.symtab' contains 12 entries:</nobr></div>
<div style="position:absolute;top:18356;left:130"><nobr>Num: Value </nobr></div>
<div style="position:absolute;top:18356;left:275"><nobr>Size Type Bind Vis Ndx Name</nobr></div>
<div style="position:absolute;top:18378;left:139"><nobr>0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND </nobr></div>
<div style="position:absolute;top:18401;left:139"><nobr>1: 0000000000000000 0 FILE LOCAL DEFAULT ABS section_example.cc</nobr></div>
<div style="position:absolute;top:18423;left:139"><nobr>7: 0000000000000000 13 NOTYPE LOCAL DEFAULT 5 a.BB._Z3bazi</nobr></div>
<div style="position:absolute;top:18446;left:139"><nobr>8: 0000000000000000 13 NOTYPE LOCAL DEFAULT 7 aa.BB._Z3bazi</nobr></div>
<div style="position:absolute;top:18468;left:139"><nobr>9: 0000000000000000 9 NOTYPE LOCAL DEFAULT 9 aaa.BB._Z3bazi</nobr></div>
<div style="position:absolute;top:18491;left:134"><nobr>10: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND _Z3barv</nobr></div>
<div style="position:absolute;top:18513;left:134"><nobr>11: 0000000000000000 36 FUNC GLOBAL DEFAULT 3 _Z3bazi</nobr></div>
<div style="position:absolute;top:18536;left:134"><nobr>12: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND _Z3foov</nobr></div>
<div style="position:absolute;top:18558;left:134"><nobr>13: 0000000000000000 36 FUNC GLOBAL DEFAULT 10 main</nobr></div>
<div style="position:absolute;top:18612;left:108"><nobr>Finally, let us look at the binary where we use lld and a symbol ordering file to place each basic</nobr></div>
<div style="position:absolute;top:18635;left:108"><nobr>block of baz at non-contiguous locations.</nobr></div>
<div style="position:absolute;top:18710;left:117"><nobr>$ cat &gt; ./symbol_file.txt&nbsp;</nobr></div>
<div style="position:absolute;top:18733;left:117"><nobr>aa.BB._Z3bazi&nbsp;</nobr></div>
<div style="position:absolute;top:18755;left:117"><nobr>main&nbsp;</nobr></div>
<div style="position:absolute;top:18778;left:117"><nobr>_Z3bazi&nbsp;</nobr></div>
<div style="position:absolute;top:18800;left:117"><nobr>_Z3barv&nbsp;</nobr></div>
<div style="position:absolute;top:18823;left:117"><nobr>a.BB._Z3bazi&nbsp;</nobr></div>
<div style="position:absolute;top:18845;left:117"><nobr>_Z3foov&nbsp;</nobr></div>
<div style="position:absolute;top:18868;left:117"><nobr>aaa.BB._Z3bazi&nbsp;</nobr></div>
<div style="position:absolute;top:18890;left:117"><nobr>$ clang -fbasicblock-sections=all bbsection_example.cc \&nbsp;</nobr></div>
<div style="position:absolute;top:18913;left:147"><nobr>-Wl,--symbol-ordering-file,./symbol_file.txt</nobr></div>
<div style="position:absolute;top:18935;left:117"><nobr>$ nm -n a.out&nbsp;</nobr></div>
<div style="position:absolute;top:18990;left:108"><nobr>The symbols sorted by address are as follows:</nobr></div>
<div style="position:absolute;top:19044;left:116"><nobr>0000000000201000 t aa.BB._Z3bazi </nobr></div>
</span></font>
<div style="position:absolute;top:19183;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="17"><b>Page 17</b></a></font></td></tr></tbody></table></div><font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:19301;left:116"><nobr>0000000000201010 T main</nobr></div>
<div style="position:absolute;top:19323;left:116"><nobr>0000000000201040 T _Z3bazi</nobr></div>
<div style="position:absolute;top:19346;left:116"><nobr>0000000000201060 T _Z3barv</nobr></div>
<div style="position:absolute;top:19368;left:116"><nobr>0000000000201068 t a.BB._Z3bazi</nobr></div>
<div style="position:absolute;top:19391;left:116"><nobr>0000000000201080 T _Z3foov</nobr></div>
<div style="position:absolute;top:19413;left:116"><nobr>0000000000201088 t aaa.BB._Z3bazi</nobr></div>
<div style="position:absolute;top:19490;left:108"><nobr>Notice how each basic block is now located as specified in the symbol ordering file. The basic</nobr></div>
<div style="position:absolute;top:19512;left:108"><nobr>block section symbols are useful while debugging if the basic block is not contiguous with the</nobr></div>
<div style="position:absolute;top:19535;left:108"><nobr>original section. The linker will delete a basic block label if the previous basic block for that</nobr></div>
<div style="position:absolute;top:19557;left:108"><nobr>label also corresponds to the same function. It is expected that a function would be partitioned</nobr></div>
<div style="position:absolute;top:19580;left:108"><nobr>in 2 ways at most and hence the remaining labels will be deleted by the linker in the interests of</nobr></div>
<div style="position:absolute;top:19602;left:108"><nobr>binary size.</nobr></div>
<div style="position:absolute;top:19647;left:108"><nobr>Also, another problem exists with static functions which have local linkage. Static function name</nobr></div>
<div style="position:absolute;top:19670;left:108"><nobr>symbols could be duplicated in the final binary when definitions from more than one module are</nobr></div>
<div style="position:absolute;top:19692;left:108"><nobr>linked. This causes problems with disambiguating the symbol name to the instance. We solved</nobr></div>
<div style="position:absolute;top:19715;left:108"><nobr>this problem by renaming static functions to include the module names too so that collisions</nobr></div>
<div style="position:absolute;top:19737;left:108"><nobr>among these instances are rare.</nobr></div>
</span></font>
<font size="4" face="Times"><span style="font-size:27px;font-family:Times">
<div style="position:absolute;top:19813;left:108"><nobr>The Layout Algorithm</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:19862;left:108"><nobr>Propeller performs intra-function basic block reordering based on the extended TSP model</nobr></div>
<div style="position:absolute;top:19885;left:108"><nobr>(​<font color="#1155cc"><a href="https://arxiv.org/abs/1809.04676">https://arxiv.org/abs/1809.04676​</a></font>)</nobr></div>
<div style="position:absolute;top:19885;left:385"><nobr>and</nobr></div>
<div style="position:absolute;top:19885;left:443"><nobr>function</nobr></div>
<div style="position:absolute;top:19885;left:531"><nobr>reordering</nobr></div>
<div style="position:absolute;top:19885;left:636"><nobr>based</nobr></div>
<div style="position:absolute;top:19885;left:710"><nobr>on</nobr></div>
<div style="position:absolute;top:19885;left:758"><nobr>HFSort</nobr></div>
<div style="position:absolute;top:19907;left:108"><nobr>(​<font color="#1155cc"><a href="https://dl.acm.org/citation.cfm?id=3049858">https://dl.acm.org/citation.cfm?id=3049858​</a></font><a href="https://dl.acm.org/citation.cfm?id=3049858"></a>). Propeller is also capable of splitting functions.</nobr></div>
<div style="position:absolute;top:19952;left:108"><nobr>The high-level description of these algorithms are as follows.</nobr></div>
</span></font>
<font size="4" face="Times"><span style="font-size:21px;font-family:Times">
<div style="position:absolute;top:20002;left:108"><nobr>ExtTSP Basic Block Reordering Algorithm</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:20043;left:108"><nobr>The ExtTSP (extended TSP) metric provides a score for every ordering of basic blocks in a</nobr></div>
<div style="position:absolute;top:20066;left:108"><nobr>function, by combining the gains from fall-throughs and short jumps.</nobr></div>
<div style="position:absolute;top:20111;left:108"><nobr>Given an ordering of the basic blocks, for a function f, the ExtTSP score is computed as follows. </nobr></div>
</span></font>
<div style="position:absolute;top:20371;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="18"><b>Page 18</b></a></font></td></tr></tbody></table></div><font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:20644;left:108"><nobr>In short, it computes a weighted sum of frequencies of all edges in the control flow graph. Each</nobr></div>
<div style="position:absolute;top:20667;left:108"><nobr>edge gets its weight depending on whether the given ordering makes the edge a fallthrough, a</nobr></div>
<div style="position:absolute;top:20689;left:108"><nobr>short forward jump, or a short backward jump.</nobr></div>
<div style="position:absolute;top:20734;left:108"><nobr>Although this problem is NP-hard like the regular TSP, an iterative greedy basic-block-chaining</nobr></div>
<div style="position:absolute;top:20757;left:108"><nobr>algorithm is used to find a close to optimal solution. This algorithm is described as follows.</nobr></div>
<div style="position:absolute;top:20802;left:108"><nobr>Starting with one basic block sequence (BB chain) for every basic block, the algorithm iteratively</nobr></div>
<div style="position:absolute;top:20824;left:108"><nobr>joins BB chains together in order to maximize the extended TSP score of all the chains.</nobr></div>
<div style="position:absolute;top:20847;left:108"><nobr>Initially, it finds all ​<i>mutually-forced </i>edges in the profiled cfg. These are all the edges which are --</nobr></div>
<div style="position:absolute;top:20869;left:108"><nobr>based on the profile -- the only (executed) outgoing edge from their source node and the only</nobr></div>
<div style="position:absolute;top:20892;left:108"><nobr>(executed) incoming edges to their sink nodes. Next, the source and sink of all mutually-forced</nobr></div>
<div style="position:absolute;top:20914;left:108"><nobr>edges are attached together as fallthrough edges.</nobr></div>
<div style="position:absolute;top:20959;left:108"><nobr>Then, at every iteration, the algorithm tries to merge a pair of BB chains which leads to the</nobr></div>
<div style="position:absolute;top:20982;left:108"><nobr>highest gain in the ExtTSP score. The algorithm extends the search space by considering</nobr></div>
<div style="position:absolute;top:21004;left:108"><nobr>splitting short (less than 128 bytes in binary size) BB chains into two chains and then merging</nobr></div>
<div style="position:absolute;top:21027;left:108"><nobr>these two chains with the other chain in four (additional) ways. After every merge, the new</nobr></div>
<div style="position:absolute;top:21049;left:108"><nobr>merge gains are updated. The algorithm repeats joining BB chains until no additional can be</nobr></div>
<div style="position:absolute;top:21072;left:108"><nobr>achieved. At this step, it sorts all the existing chains in decreasing order of their execution</nobr></div>
<div style="position:absolute;top:21094;left:108"><nobr>density, i.e., the total profiled frequency of the chain divided by its binary size.</nobr></div>
</span></font>
<font size="4" face="Times"><span style="font-size:21px;font-family:Times">
<div style="position:absolute;top:21166;left:108"><nobr>HFSort Function Reordering Algorithm</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:21208;left:108"><nobr>The HFSort function reordering algorithm computes an optimal ordering. It goes through all</nobr></div>
<div style="position:absolute;top:21230;left:108"><nobr>functions in decreasing order of their execution density and for each one, finds its ​<i>most likely</i></nobr></div>
<div style="position:absolute;top:21253;left:108"><nobr><i>caller </i><font style="font-size:15px">​</font>(the function which calls it the most) and places the caller’s cluster right before the callee’s</nobr></div>
<div style="position:absolute;top:21275;left:108"><nobr>cluster. After all functions are considered, the HFSort algorithm orders all function clusters in</nobr></div>
<div style="position:absolute;top:21298;left:108"><nobr>decreasing order of their total execution density.</nobr></div>
</span></font>
<font size="4" face="Times"><span style="font-size:21px;font-family:Times">
<div style="position:absolute;top:21347;left:108"><nobr>Function Splitting</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:21389;left:108"><nobr>After BB reordering and function reordering, for every function, Propeller goes through the BB</nobr></div>
<div style="position:absolute;top:21411;left:108"><nobr>chains in the computed BB order for that function, and for each chain, places the basic blocks of</nobr></div>
</span></font>
<div style="position:absolute;top:21559;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="19"><b>Page 19</b></a></font></td></tr></tbody></table></div><font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:21668;left:108"><nobr>that chain at the hot or cold part of the layout depending on whether the chain’s total frequency</nobr></div>
<div style="position:absolute;top:21690;left:108"><nobr>is zero or not.</nobr></div>
<div style="position:absolute;top:21735;left:108"><nobr>The ExtTSP BB reordering algorithm orders the BB chains decreasing order of their execution</nobr></div>
<div style="position:absolute;top:21758;left:108"><nobr>density. As a result, cold basic blocks are placed at the end of the computed BB layout of every</nobr></div>
<div style="position:absolute;top:21780;left:108"><nobr>function.</nobr></div>
</span></font>
<font size="4" face="Times"><span style="font-size:21px;font-family:Times">
<div style="position:absolute;top:21852;left:108"><nobr>Ordering overhead</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:21894;left:108"><nobr>The total time complexity of BB reordering is O(V​<font style="font-size:7px">3​</font>E), where V is the number of (hot) basic</nobr></div>
<div style="position:absolute;top:21916;left:108"><nobr>blocks in a function and E is the total number of (profiled) edges in the function. This time</nobr></div>
<div style="position:absolute;top:21939;left:108"><nobr>complexity is derived as follows:</nobr></div>
<div style="position:absolute;top:21984;left:108"><nobr>Merging of BB chains can happen at most V times. Every time two BB chains are merged</nobr></div>
<div style="position:absolute;top:22006;left:108"><nobr>together, we must recompute the gains from merging the newly created chain with every other</nobr></div>
<div style="position:absolute;top:22029;left:108"><nobr>existing BB chain. A crude analysis shows that the time complexity of the gain recomputation</nobr></div>
<div style="position:absolute;top:22051;left:108"><nobr>phase is at most O(V​<font style="font-size:7px">2​</font>E). Multiplying this by V gives the claimed running time.</nobr></div>
<div style="position:absolute;top:22096;left:108"><nobr>The time complexity of the HFSort function reordering stage is O(N lg N + C), where N is the</nobr></div>
<div style="position:absolute;top:22119;left:108"><nobr>total number of profiled functions and C is the number of profiled call edges between functions.</nobr></div>
<div style="position:absolute;top:22164;left:108"><nobr>In practice the total ordering overhead is about 6 extra seconds of linker time for building clang</nobr></div>
<div style="position:absolute;top:22186;left:108"><nobr>and about 2 seconds for Search1.</nobr></div>
</span></font>
<font size="4" face="Times"><span style="font-size:27px;font-family:Times">
<div style="position:absolute;top:22239;left:108"><nobr>Experiments</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:22311;left:108"><nobr>The baseline benchmark is built with peak optimizations, PGO and ThinLTO, enabled. Also,</nobr></div>
<div style="position:absolute;top:22334;left:108"><nobr>Propeller has been used in two configurations. The first configuration, called "List", only creates</nobr></div>
<div style="position:absolute;top:22356;left:108"><nobr>basic block sections for functions with samples, this is the default configuration used with</nobr></div>
<div style="position:absolute;top:22379;left:108"><nobr>-fpropeller-optimize flag. The second configuration, called "all", creates basic blocks for all</nobr></div>
<div style="position:absolute;top:22401;left:108"><nobr>functions, enabled with -fbasicblock-sections=all. All the benchmarks were built for the X86_64</nobr></div>
<div style="position:absolute;top:22424;left:108"><nobr>platform on a 18 core Intel machine with 192 G of RAM. The SPEC benchmarks were also run</nobr></div>
<div style="position:absolute;top:22446;left:108"><nobr>on this machine and the large benchmarks have their own harnesses for performance</nobr></div>
<div style="position:absolute;top:22469;left:108"><nobr>measurement.</nobr></div>
<div style="position:absolute;top:22514;left:108"><nobr>BOLT built from upstream sources failed to optimize our large benchmarks mainly because it</nobr></div>
<div style="position:absolute;top:22536;left:108"><nobr>could not disassemble binaries when read-only data was mixed with code, or when multiple text</nobr></div>
<div style="position:absolute;top:22559;left:108"><nobr>sections were used. We have implemented a patch with BOLT to exclude optimizing functions it</nobr></div>
<div style="position:absolute;top:22581;left:108"><nobr>could not understand, this is done by using trampolines to jump between optimized and original</nobr></div>
</span></font>
<div style="position:absolute;top:22747;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="20"><b>Page 20</b></a></font></td></tr></tbody></table></div><font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:22856;left:108"><nobr>code. These problematic functions were not hot and did not directly affect performance. With</nobr></div>
<div style="position:absolute;top:22878;left:108"><nobr>this fix, BOLT is able to optimize all of our benchmarks.</nobr></div>
</span></font>
<font size="4" face="Times"><span style="font-size:21px;font-family:Times">
<div style="position:absolute;top:22928;left:108"><nobr>Memory Overhead</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:22969;left:108"><nobr>The memory overhead for the monolithic (non-parallel) action is compared here. BOLT does the</nobr></div>
<div style="position:absolute;top:22992;left:108"><nobr>disassembly, analysis and rewrite in one process and it is this overhead which is reported here.</nobr></div>
<div style="position:absolute;top:23014;left:108"><nobr>For Propeller and baseline, this overhead corresponds to the final link step. For both Propeller</nobr></div>
<div style="position:absolute;top:23037;left:108"><nobr>and BOLT, the memory needed to do profile conversion is not shown as this depends on</nobr></div>
<div style="position:absolute;top:23059;left:108"><nobr>several factors like the duration of profiling and the size of the profile. BOLT and Propeller’s</nobr></div>
<div style="position:absolute;top:23082;left:108"><nobr>profile conversion process is independent and has not been considered for comparison</nobr></div>
<div style="position:absolute;top:23136;left:132"><nobr>Benchmarks</nobr></div>
<div style="position:absolute;top:23136;left:285"><nobr>Baseline</nobr></div>
<div style="position:absolute;top:23136;left:443"><nobr>List</nobr></div>
<div style="position:absolute;top:23136;left:586"><nobr>All</nobr></div>
<div style="position:absolute;top:23136;left:713"><nobr>BOLT</nobr></div>
<div style="position:absolute;top:23173;left:146"><nobr>Search1</nobr></div>
<div style="position:absolute;top:23173;left:297"><nobr>6.8 G</nobr></div>
<div style="position:absolute;top:23173;left:432"><nobr>14.6 G</nobr></div>
<div style="position:absolute;top:23173;left:572"><nobr>34.9 G</nobr></div>
<div style="position:absolute;top:23173;left:718"><nobr>70 G</nobr></div>
<div style="position:absolute;top:23210;left:146"><nobr>Search2</nobr></div>
<div style="position:absolute;top:23210;left:297"><nobr>6.7 G</nobr></div>
<div style="position:absolute;top:23210;left:432"><nobr>11.7 G</nobr></div>
<div style="position:absolute;top:23210;left:572"><nobr>31.9 G</nobr></div>
<div style="position:absolute;top:23210;left:711"><nobr>64.7 G</nobr></div>
<div style="position:absolute;top:23247;left:149"><nobr>Storage</nobr></div>
<div style="position:absolute;top:23247;left:297"><nobr>4.6 G</nobr></div>
<div style="position:absolute;top:23247;left:437"><nobr>8.6 G</nobr></div>
<div style="position:absolute;top:23247;left:572"><nobr>17.8 G</nobr></div>
<div style="position:absolute;top:23247;left:711"><nobr>26.4 G</nobr></div>
<div style="position:absolute;top:23284;left:155"><nobr>Clang</nobr></div>
<div style="position:absolute;top:23284;left:297"><nobr>0.4 G</nobr></div>
<div style="position:absolute;top:23284;left:437"><nobr>2.9 G</nobr></div>
<div style="position:absolute;top:23284;left:576"><nobr>6.4 G</nobr></div>
<div style="position:absolute;top:23284;left:711"><nobr>33.9 G</nobr></div>
<div style="position:absolute;top:23322;left:135"><nobr>SPEC [min]</nobr></div>
<div style="position:absolute;top:23342;left:143"><nobr>(505.mcf)</nobr></div>
<div style="position:absolute;top:23322;left:298"><nobr>32 M</nobr></div>
<div style="position:absolute;top:23322;left:438"><nobr>36 M</nobr></div>
<div style="position:absolute;top:23322;left:577"><nobr>37 M</nobr></div>
<div style="position:absolute;top:23322;left:711"><nobr>30 MB</nobr></div>
<div style="position:absolute;top:23379;left:133"><nobr>SPEC [max]</nobr></div>
<div style="position:absolute;top:23399;left:116"><nobr>(523.xalancbmk)</nobr></div>
<div style="position:absolute;top:23379;left:294"><nobr>111 M</nobr></div>
<div style="position:absolute;top:23379;left:433"><nobr>161 M</nobr></div>
<div style="position:absolute;top:23379;left:573"><nobr>260 M</nobr></div>
<div style="position:absolute;top:23379;left:712"><nobr>822 M</nobr></div>
<div style="position:absolute;top:23451;left:108"><nobr>With Propeller, it is clear that the "List" configuration which creates basic block sections on</nobr></div>
<div style="position:absolute;top:23473;left:108"><nobr>demand does significantly better than "All". "List" is lower by more than 50 % compared to "All".</nobr></div>
<div style="position:absolute;top:23496;left:108"><nobr>This is primarily because a large percentage of the code is cold for large benchmarks (more</nobr></div>
<div style="position:absolute;top:23518;left:108"><nobr>than 80 %) and the "List" configuration eliminates the overhead of creating basic block sections</nobr></div>
<div style="position:absolute;top:23541;left:108"><nobr>for the cold parts. The code layout optimization ignores functions without samples and hence,</nobr></div>
<div style="position:absolute;top:23563;left:108"><nobr>this does not affect performance.</nobr></div>
</span></font>
<font size="4" face="Times"><span style="font-size:21px;font-family:Times">
<div style="position:absolute;top:23635;left:108"><nobr>Time Overhead</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:23677;left:108"><nobr>The time overheads correspond to the same actions for which memory overheads are</nobr></div>
<div style="position:absolute;top:23700;left:108"><nobr>presented.</nobr></div>
<div style="position:absolute;top:23745;left:108"><nobr>The data shows that Propeller, "list" configuration in particular, is much faster than BOLT in</nobr></div>
<div style="position:absolute;top:23767;left:108"><nobr>optimizing binaries but is slower than baseline. The additional time needed by Propeller is due</nobr></div>
<div style="position:absolute;top:23790;left:108"><nobr>to two reasons. First, the link action constructs the Inter-procedural control-flow graph (iCFG)</nobr></div>
</span></font>
<div style="position:absolute;top:23935;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="21"><b>Page 21</b></a></font></td></tr></tbody></table></div><font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:24044;left:108"><nobr>and runs the layout algorithm to discover an optimized ordering of basic blocks. Second, the</nobr></div>
<div style="position:absolute;top:24066;left:108"><nobr>input object file sizes are larger than the baseline due to basic block sections which impacts</nobr></div>
<div style="position:absolute;top:24089;left:108"><nobr>memory overhead and link time. Reducing CFI information bloat as discussed would reduce the</nobr></div>
<div style="position:absolute;top:24111;left:108"><nobr>link time considerably, by 10 % when building with "All" configuration</nobr></div>
<div style="position:absolute;top:24188;left:132"><nobr>Benchmarks</nobr></div>
<div style="position:absolute;top:24188;left:285"><nobr>Baseline</nobr></div>
<div style="position:absolute;top:24188;left:443"><nobr>List</nobr></div>
<div style="position:absolute;top:24188;left:586"><nobr>All</nobr></div>
<div style="position:absolute;top:24188;left:713"><nobr>BOLT</nobr></div>
<div style="position:absolute;top:24225;left:146"><nobr>Search1</nobr></div>
<div style="position:absolute;top:24225;left:302"><nobr>28 s</nobr></div>
<div style="position:absolute;top:24225;left:441"><nobr>87 s</nobr></div>
<div style="position:absolute;top:24225;left:576"><nobr>365 s</nobr></div>
<div style="position:absolute;top:24225;left:716"><nobr>675 s</nobr></div>
<div style="position:absolute;top:24262;left:146"><nobr>Search2</nobr></div>
<div style="position:absolute;top:24262;left:306"><nobr>7 s</nobr></div>
<div style="position:absolute;top:24262;left:437"><nobr>204 s</nobr></div>
<div style="position:absolute;top:24262;left:576"><nobr>565 s</nobr></div>
<div style="position:absolute;top:24262;left:716"><nobr>504 s</nobr></div>
<div style="position:absolute;top:24299;left:149"><nobr>Storage</nobr></div>
<div style="position:absolute;top:24299;left:302"><nobr>12 s</nobr></div>
<div style="position:absolute;top:24299;left:441"><nobr>49 s</nobr></div>
<div style="position:absolute;top:24299;left:576"><nobr>188 s</nobr></div>
<div style="position:absolute;top:24299;left:716"><nobr>500 s</nobr></div>
<div style="position:absolute;top:24336;left:155"><nobr>Clang</nobr></div>
<div style="position:absolute;top:24336;left:306"><nobr>5 s</nobr></div>
<div style="position:absolute;top:24336;left:441"><nobr>31 s</nobr></div>
<div style="position:absolute;top:24336;left:581"><nobr>53 s</nobr></div>
<div style="position:absolute;top:24336;left:716"><nobr>148 s</nobr></div>
<div style="position:absolute;top:24373;left:135"><nobr>SPEC [min]</nobr></div>
<div style="position:absolute;top:24394;left:118"><nobr>(531.deepsjeng)</nobr></div>
<div style="position:absolute;top:24373;left:295"><nobr>0.08 s</nobr></div>
<div style="position:absolute;top:24373;left:434"><nobr>0.29 s</nobr></div>
<div style="position:absolute;top:24373;left:574"><nobr>0.30 s</nobr></div>
<div style="position:absolute;top:24373;left:713"><nobr>0.98 s</nobr></div>
<div style="position:absolute;top:24431;left:133"><nobr>SPEC [max]</nobr></div>
<div style="position:absolute;top:24451;left:116"><nobr>(523.xalancbmk)</nobr></div>
<div style="position:absolute;top:24431;left:295"><nobr>0.31 s</nobr></div>
<div style="position:absolute;top:24431;left:434"><nobr>0.93 s</nobr></div>
<div style="position:absolute;top:24431;left:574"><nobr>2.08 s</nobr></div>
<div style="position:absolute;top:24431;left:713"><nobr>4.35 s</nobr></div>
<div style="position:absolute;top:24570;left:108"><nobr>An alternate design was considered where each module constructs a subset of the CFG and</nobr></div>
<div style="position:absolute;top:24593;left:108"><nobr>performs an intraprocedural code layout as part of the back-end action. While this would have</nobr></div>
<div style="position:absolute;top:24615;left:108"><nobr>been faster as it could be done in parallel or distributed, code layout would be limited to</nobr></div>
<div style="position:absolute;top:24638;left:108"><nobr>intra-procedural decisions. </nobr></div>
<div style="position:absolute;top:24683;left:108"><nobr>The time taken by Propeller and BOLT to profile the benchmark and convert the profiles was</nobr></div>
<div style="position:absolute;top:24705;left:108"><nobr>measured separately. For the Search1 benchmark, we profiled for 30 seconds, and the raw</nobr></div>
<div style="position:absolute;top:24728;left:108"><nobr>PMU profile size was ∼ 525M. Propeller needed a minute to convert this profile into its custom</nobr></div>
<div style="position:absolute;top:24750;left:108"><nobr>format, whereas, BOLT needed more than 4 minutes</nobr></div>
</span></font>
<font size="4" face="Times"><span style="font-size:21px;font-family:Times">
<div style="position:absolute;top:24822;left:108"><nobr>Object and Binary Size Bloats with Propeller</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:24886;left:108"><nobr>This section presents data on the bloats to native object file sizes and the final optimized binary</nobr></div>
<div style="position:absolute;top:24909;left:108"><nobr>file sizes with Propeller. This data also shows the bloats to the peak optimized binary from</nobr></div>
<div style="position:absolute;top:24931;left:108"><nobr>labelling basic blocks with unique symbols. The SPEC benchmark sizes are not shown as they</nobr></div>
<div style="position:absolute;top:24954;left:108"><nobr>are much smaller in comparison. </nobr></div>
</span></font>
<div style="position:absolute;top:25123;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="22"><b>Page 22</b></a></font></td></tr></tbody></table></div><font size="3" face="Times" color="#434343"><span style="font-size:18px;font-family:Times;color:#434343">
<div style="position:absolute;top:25256;left:108"><nobr>Object File Sizes Bloat</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:25313;left:108"><nobr>The table shows the increase in object files sizes over baseline with the two Propeller</nobr></div>
<div style="position:absolute;top:25335;left:108"><nobr>configurations, “List” and “All”. The figures also show the increase from Call Frame Information</nobr></div>
<div style="position:absolute;top:25358;left:108"><nobr>(CFI) alone which ends up in the .eh_frame sections in the object files. CFI alone is responsible</nobr></div>
<div style="position:absolute;top:25380;left:108"><nobr>for 10 % of the bloat with "All" configuration and, as mentioned in 5.2, reducing CFI sizes would</nobr></div>
<div style="position:absolute;top:25403;left:108"><nobr>help a lot. Creating basic block sections on demand is the scalable approach to keep the bloats</nobr></div>
<div style="position:absolute;top:25425;left:108"><nobr>tractable.</nobr></div>
<div style="position:absolute;top:25547;left:120"><nobr>Benchmarks</nobr></div>
<div style="position:absolute;top:25547;left:289"><nobr>Baseline</nobr></div>
<div style="position:absolute;top:25547;left:488"><nobr>List</nobr></div>
<div style="position:absolute;top:25547;left:689"><nobr>All</nobr></div>
<div style="position:absolute;top:25584;left:241"><nobr>.ehframe</nobr></div>
<div style="position:absolute;top:25584;left:350"><nobr>Total</nobr></div>
<div style="position:absolute;top:25584;left:425"><nobr>.ehframe</nobr></div>
<div style="position:absolute;top:25584;left:524"><nobr>Total</nobr></div>
<div style="position:absolute;top:25584;left:609"><nobr>.ehframe</nobr></div>
<div style="position:absolute;top:25584;left:735"><nobr>Total</nobr></div>
<div style="position:absolute;top:25621;left:136"><nobr>Search1</nobr></div>
<div style="position:absolute;top:25621;left:249"><nobr>83.3 M</nobr></div>
<div style="position:absolute;top:25621;left:349"><nobr>2.5 G</nobr></div>
<div style="position:absolute;top:25621;left:435"><nobr>117 M</nobr></div>
<div style="position:absolute;top:25621;left:519"><nobr>3.64 G</nobr></div>
<div style="position:absolute;top:25621;left:619"><nobr>996 M</nobr></div>
<div style="position:absolute;top:25621;left:729"><nobr>8.73 G</nobr></div>
<div style="position:absolute;top:25658;left:136"><nobr>Search2</nobr></div>
<div style="position:absolute;top:25658;left:249"><nobr>83.5 M</nobr></div>
<div style="position:absolute;top:25658;left:349"><nobr>2.5 G</nobr></div>
<div style="position:absolute;top:25658;left:435"><nobr>135 M</nobr></div>
<div style="position:absolute;top:25658;left:523"><nobr>3.7 G</nobr></div>
<div style="position:absolute;top:25658;left:619"><nobr>990 M</nobr></div>
<div style="position:absolute;top:25658;left:734"><nobr>8.7 G</nobr></div>
<div style="position:absolute;top:25695;left:137"><nobr>Storage</nobr></div>
<div style="position:absolute;top:25695;left:249"><nobr>24.4 M</nobr></div>
<div style="position:absolute;top:25695;left:349"><nobr>1.3 G</nobr></div>
<div style="position:absolute;top:25695;left:440"><nobr>45 M</nobr></div>
<div style="position:absolute;top:25695;left:523"><nobr>2.3 G</nobr></div>
<div style="position:absolute;top:25695;left:619"><nobr>165 M</nobr></div>
<div style="position:absolute;top:25695;left:734"><nobr>4.9 G</nobr></div>
<div style="position:absolute;top:25732;left:145"><nobr>Clang</nobr></div>
<div style="position:absolute;top:25732;left:249"><nobr>10.2 M</nobr></div>
<div style="position:absolute;top:25732;left:345"><nobr>351 M</nobr></div>
<div style="position:absolute;top:25732;left:433"><nobr>94.8 M</nobr></div>
<div style="position:absolute;top:25732;left:518"><nobr>64.7 M</nobr></div>
<div style="position:absolute;top:25732;left:619"><nobr>243 M</nobr></div>
<div style="position:absolute;top:25732;left:734"><nobr>1.4 G</nobr></div>
</span></font>
<font size="3" face="Times" color="#434343"><span style="font-size:18px;font-family:Times;color:#434343">
<div style="position:absolute;top:25884;left:108"><nobr>Input Binary File Size Bloat</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:25941;left:108"><nobr>The table below shows the increase in the binary sizes of the labelled Propeller binary that is</nobr></div>
<div style="position:absolute;top:25963;left:108"><nobr>used to collect profiles. It also shows the bloat in the BOLT binary that is used as input to BOLT</nobr></div>
<div style="position:absolute;top:25986;left:108"><nobr>with static relocations enabled. With Propeller, the bloat is mainly due to the symbol table which</nobr></div>
<div style="position:absolute;top:26008;left:108"><nobr>now accommodates an entry for every basic block symbol. The string table ("strtab") is also</nobr></div>
<div style="position:absolute;top:26031;left:108"><nobr>bloated (∼ 8% for large binaries) due to the extra symbol names but the unary naming scheme</nobr></div>
<div style="position:absolute;top:26053;left:108"><nobr>keeps this small. However, these bloats do not affect performance as the symbol table and the</nobr></div>
<div style="position:absolute;top:26076;left:108"><nobr>string table are not loaded onto memory. For BOLT, the bloat mainly comes from saving all the</nobr></div>
<div style="position:absolute;top:26098;left:108"><nobr>static relocations in the final binary, which is very useful to disassemble the binary, and does not</nobr></div>
<div style="position:absolute;top:26121;left:108"><nobr>affect performance. </nobr></div>
</span></font>
<div style="position:absolute;top:26311;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="23"><b>Page 23</b></a></font></td></tr></tbody></table></div><font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:26451;left:65"><nobr>Benchmarks</nobr></div>
<div style="position:absolute;top:26451;left:245"><nobr>Baseline</nobr></div>
<div style="position:absolute;top:26451;left:476"><nobr>Labels</nobr></div>
<div style="position:absolute;top:26451;left:712"><nobr>BOLT</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:12px;font-family:Times">
<div style="position:absolute;top:26489;left:187"><nobr>sym +</nobr></div>
<div style="position:absolute;top:26507;left:186"><nobr>str tab</nobr></div>
<div style="position:absolute;top:26489;left:257"><nobr>relocs</nobr></div>
<div style="position:absolute;top:26489;left:331"><nobr>Total</nobr></div>
<div style="position:absolute;top:26489;left:398"><nobr>sym +</nobr></div>
<div style="position:absolute;top:26507;left:397"><nobr>str tab</nobr></div>
<div style="position:absolute;top:26489;left:475"><nobr>relocs</nobr></div>
<div style="position:absolute;top:26489;left:560"><nobr>Total</nobr></div>
<div style="position:absolute;top:26489;left:629"><nobr>sym +</nobr></div>
<div style="position:absolute;top:26507;left:628"><nobr>str tab</nobr></div>
<div style="position:absolute;top:26489;left:701"><nobr>relocs</nobr></div>
<div style="position:absolute;top:26489;left:789"><nobr>Total</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:26541;left:81"><nobr>Search1</nobr></div>
<div style="position:absolute;top:26541;left:183"><nobr>251 M</nobr></div>
<div style="position:absolute;top:26541;left:271"><nobr>0</nobr></div>
<div style="position:absolute;top:26541;left:327"><nobr>1.7 G</nobr></div>
<div style="position:absolute;top:26541;left:396"><nobr>579 M</nobr></div>
<div style="position:absolute;top:26541;left:491"><nobr>0</nobr></div>
<div style="position:absolute;top:26541;left:557"><nobr>2.2 G</nobr></div>
<div style="position:absolute;top:26541;left:626"><nobr>253 M</nobr></div>
<div style="position:absolute;top:26541;left:698"><nobr>253 M</nobr></div>
<div style="position:absolute;top:26541;left:792"><nobr>2 G</nobr></div>
<div style="position:absolute;top:26578;left:81"><nobr>Search2</nobr></div>
<div style="position:absolute;top:26578;left:183"><nobr>252 M</nobr></div>
<div style="position:absolute;top:26578;left:271"><nobr>0</nobr></div>
<div style="position:absolute;top:26578;left:327"><nobr>1.7 G</nobr></div>
<div style="position:absolute;top:26578;left:396"><nobr>577 M</nobr></div>
<div style="position:absolute;top:26578;left:491"><nobr>0</nobr></div>
<div style="position:absolute;top:26578;left:557"><nobr>2.2 G</nobr></div>
<div style="position:absolute;top:26578;left:626"><nobr>257 M</nobr></div>
<div style="position:absolute;top:26578;left:698"><nobr>263 M</nobr></div>
<div style="position:absolute;top:26578;left:792"><nobr>2 G</nobr></div>
<div style="position:absolute;top:26616;left:82"><nobr>Storage</nobr></div>
<div style="position:absolute;top:26616;left:188"><nobr>65 M</nobr></div>
<div style="position:absolute;top:26616;left:271"><nobr>0</nobr></div>
<div style="position:absolute;top:26616;left:327"><nobr>1.1 G</nobr></div>
<div style="position:absolute;top:26616;left:396"><nobr>209 M</nobr></div>
<div style="position:absolute;top:26616;left:491"><nobr>0</nobr></div>
<div style="position:absolute;top:26616;left:557"><nobr>1.5 G</nobr></div>
<div style="position:absolute;top:26616;left:630"><nobr>68 M</nobr></div>
<div style="position:absolute;top:26616;left:698"><nobr>114 M</nobr></div>
<div style="position:absolute;top:26616;left:785"><nobr>1.2 G</nobr></div>
<div style="position:absolute;top:26653;left:90"><nobr>Clang</nobr></div>
<div style="position:absolute;top:26653;left:188"><nobr>15 M</nobr></div>
<div style="position:absolute;top:26653;left:271"><nobr>0</nobr></div>
<div style="position:absolute;top:26653;left:329"><nobr>89 M</nobr></div>
<div style="position:absolute;top:26653;left:401"><nobr>80 M</nobr></div>
<div style="position:absolute;top:26653;left:491"><nobr>0</nobr></div>
<div style="position:absolute;top:26653;left:554"><nobr>154 M</nobr></div>
<div style="position:absolute;top:26653;left:630"><nobr>15 M</nobr></div>
<div style="position:absolute;top:26653;left:702"><nobr>34 M</nobr></div>
<div style="position:absolute;top:26653;left:783"><nobr>124 M</nobr></div>
</span></font>
<font size="3" face="Times" color="#434343"><span style="font-size:18px;font-family:Times;color:#434343">
<div style="position:absolute;top:26796;left:108"><nobr>Final Optimized Binary Size Bloat</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:26853;left:108"><nobr>The table below shows the final optimized binary sizes with the Propeller configurations. BOLT’s</nobr></div>
<div style="position:absolute;top:26875;left:108"><nobr>binary sizes for Search1 and Storage benchmark are smaller than expected as there was a seg.</nobr></div>
<div style="position:absolute;top:26898;left:108"><nobr>fault with BOLT when trying to build with DebugInfo.</nobr></div>
<div style="position:absolute;top:26943;left:108"><nobr>With Propeller, the bloat in final binary sizes is mainly due to CFI information. This overhead</nobr></div>
<div style="position:absolute;top:26965;left:108"><nobr>makes a very compelling case for CFI to support dis-contiguous ranges like DebugInfo and</nobr></div>
<div style="position:absolute;top:26988;left:108"><nobr>almost all of this overhead can be eliminated with that support. There is no overhead due to</nobr></div>
<div style="position:absolute;top:27010;left:108"><nobr>basic block labels as unnecessary label symbols are deleted and only the first basic block of</nobr></div>
<div style="position:absolute;top:27033;left:108"><nobr>every function part that has been split is retained for debugging.</nobr></div>
<div style="position:absolute;top:27109;left:61"><nobr>Benchmarks</nobr></div>
<div style="position:absolute;top:27109;left:214"><nobr>Baseline</nobr></div>
<div style="position:absolute;top:27109;left:398"><nobr>List</nobr></div>
<div style="position:absolute;top:27109;left:573"><nobr>All</nobr></div>
<div style="position:absolute;top:27109;left:745"><nobr>BOLT</nobr></div>
<div style="position:absolute;top:27147;left:174"><nobr>.ehframe</nobr></div>
<div style="position:absolute;top:27147;left:270"><nobr>Total</nobr></div>
<div style="position:absolute;top:27147;left:340"><nobr>.ehframe</nobr></div>
<div style="position:absolute;top:27147;left:439"><nobr>Total</nobr></div>
<div style="position:absolute;top:27147;left:513"><nobr>.ehframe</nobr></div>
<div style="position:absolute;top:27147;left:613"><nobr>Total</nobr></div>
<div style="position:absolute;top:27147;left:693"><nobr>.ehframe</nobr></div>
<div style="position:absolute;top:27147;left:807"><nobr>Total</nobr></div>
<div style="position:absolute;top:27184;left:77"><nobr>Search1</nobr></div>
<div style="position:absolute;top:27184;left:181"><nobr>63.7 M</nobr></div>
<div style="position:absolute;top:27184;left:268"><nobr>1.7 G</nobr></div>
<div style="position:absolute;top:27184;left:348"><nobr>63.7 M</nobr></div>
<div style="position:absolute;top:27184;left:437"><nobr>1.9 G</nobr></div>
<div style="position:absolute;top:27184;left:515"><nobr>469.2 M</nobr></div>
<div style="position:absolute;top:27184;left:611"><nobr>2.7 G</nobr></div>
<div style="position:absolute;top:27184;left:708"><nobr>99 M</nobr></div>
<div style="position:absolute;top:27184;left:804"><nobr>1.1 G</nobr></div>
<div style="position:absolute;top:27221;left:77"><nobr>Search2</nobr></div>
<div style="position:absolute;top:27221;left:181"><nobr>63.9 M</nobr></div>
<div style="position:absolute;top:27221;left:268"><nobr>1.7 G</nobr></div>
<div style="position:absolute;top:27221;left:348"><nobr>72.1 M</nobr></div>
<div style="position:absolute;top:27221;left:437"><nobr>1.9 G</nobr></div>
<div style="position:absolute;top:27221;left:515"><nobr>466.4 M</nobr></div>
<div style="position:absolute;top:27221;left:611"><nobr>2.7 G</nobr></div>
<div style="position:absolute;top:27221;left:703"><nobr>116 M</nobr></div>
<div style="position:absolute;top:27221;left:804"><nobr>2.4 G</nobr></div>
<div style="position:absolute;top:27258;left:78"><nobr>Storage</nobr></div>
<div style="position:absolute;top:27258;left:181"><nobr>18.2 M</nobr></div>
<div style="position:absolute;top:27258;left:268"><nobr>1.1 G</nobr></div>
<div style="position:absolute;top:27258;left:348"><nobr>23.3 M</nobr></div>
<div style="position:absolute;top:27258;left:437"><nobr>1.4 G</nobr></div>
<div style="position:absolute;top:27258;left:515"><nobr>199.2 M</nobr></div>
<div style="position:absolute;top:27258;left:611"><nobr>1.9 G</nobr></div>
<div style="position:absolute;top:27258;left:708"><nobr>32 M</nobr></div>
<div style="position:absolute;top:27258;left:795"><nobr>717.8 M</nobr></div>
<div style="position:absolute;top:27295;left:86"><nobr>Clang</nobr></div>
<div style="position:absolute;top:27295;left:192"><nobr>7 M</nobr></div>
<div style="position:absolute;top:27295;left:259"><nobr>123.6 M</nobr></div>
<div style="position:absolute;top:27295;left:348"><nobr>46.4 M</nobr></div>
<div style="position:absolute;top:27295;left:428"><nobr>153.6 M</nobr></div>
<div style="position:absolute;top:27295;left:515"><nobr>121.6 M</nobr></div>
<div style="position:absolute;top:27295;left:609"><nobr>270 M</nobr></div>
<div style="position:absolute;top:27295;left:701"><nobr>11.3 M</nobr></div>
<div style="position:absolute;top:27295;left:795"><nobr>244.3 M </nobr></div>
</span></font>
<div style="position:absolute;top:27499;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="24"><b>Page 24</b></a></font></td></tr></tbody></table></div><font size="4" face="Times"><span style="font-size:21px;font-family:Times">
<div style="position:absolute;top:27635;left:108"><nobr>Performance Experiments</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:27699;left:108"><nobr>We could build Storage with BOLT but we could not get it to run with BOLT correctly. For code</nobr></div>
<div style="position:absolute;top:27721;left:108"><nobr>layout optimizations, both BOLT and Propeller show very similar speed-ups as the same</nobr></div>
<div style="position:absolute;top:27744;left:108"><nobr>algorithm has been used to re-order the code. </nobr></div>
<div style="position:absolute;top:27789;left:108"><nobr>The table below shows the performance numbers of the large benchmarks with Propeller and</nobr></div>
<div style="position:absolute;top:27811;left:108"><nobr>BOLT comparing it to the baseline binary. The performance wins for these large benchmarks</nobr></div>
<div style="position:absolute;top:27834;left:108"><nobr>with Propeller are impressive, ThinLTO gave similar speed-ups in performance over a</nobr></div>
<div style="position:absolute;top:27856;left:108"><nobr>non-ThinLTO baseline. Further, the "List" configuration does as well as the "All" configuration in</nobr></div>
<div style="position:absolute;top:27879;left:108"><nobr>performance. It is clear from these results that using the "List" configuration keeps the memory</nobr></div>
<div style="position:absolute;top:27901;left:108"><nobr>and time overheads low without losing out on performance. BOLT’s slow-down of the Search1</nobr></div>
<div style="position:absolute;top:27924;left:108"><nobr>benchmark is primarily due to text huge pages getting disabled with BOLT as it keeps the</nobr></div>
<div style="position:absolute;top:27946;left:108"><nobr>optimized code in a separate segment. We ran an experiment to measure BOLT’s performance</nobr></div>
<div style="position:absolute;top:27969;left:108"><nobr>improvement over the baseline binary when hugepages were explicitly disabled for both. In this</nobr></div>
<div style="position:absolute;top:27991;left:108"><nobr>case, BOLT’s performance improvements over the baseline binary were closer, though still</nobr></div>
<div style="position:absolute;top:28014;left:108"><nobr>smaller, to Propeller’s improvements</nobr></div>
</span></font>
<font size="3" face="Times" color="#434343"><span style="font-size:18px;font-family:Times;color:#434343">
<div style="position:absolute;top:28060;left:108"><nobr>Large Benchmarks</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:28169;left:132"><nobr>Benchmarks</nobr></div>
<div style="position:absolute;top:28169;left:295"><nobr>Metric</nobr></div>
<div style="position:absolute;top:28149;left:534"><nobr>% Improvements</nobr></div>
<div style="position:absolute;top:28186;left:493"><nobr>Propeller</nobr></div>
<div style="position:absolute;top:28186;left:713"><nobr>BOLT</nobr></div>
<div style="position:absolute;top:28223;left:443"><nobr>List</nobr></div>
<div style="position:absolute;top:28223;left:586"><nobr>All</nobr></div>
<div style="position:absolute;top:28260;left:146"><nobr>Search1</nobr></div>
<div style="position:absolute;top:28260;left:299"><nobr>QPS</nobr></div>
<div style="position:absolute;top:28260;left:435"><nobr>2.6 %</nobr></div>
<div style="position:absolute;top:28260;left:575"><nobr>2.6 %</nobr></div>
<div style="position:absolute;top:28260;left:714"><nobr>2.0 %</nobr></div>
<div style="position:absolute;top:28297;left:146"><nobr>Search2</nobr></div>
<div style="position:absolute;top:28297;left:299"><nobr>QPS</nobr></div>
<div style="position:absolute;top:28297;left:435"><nobr>5.8 %</nobr></div>
<div style="position:absolute;top:28297;left:575"><nobr>5.9 %</nobr></div>
<div style="position:absolute;top:28297;left:712"><nobr>SEGV</nobr></div>
<div style="position:absolute;top:28335;left:149"><nobr>Storage</nobr></div>
<div style="position:absolute;top:28335;left:288"><nobr>Latency</nobr></div>
<div style="position:absolute;top:28335;left:442"><nobr>6 %</nobr></div>
<div style="position:absolute;top:28335;left:582"><nobr>6 %</nobr></div>
<div style="position:absolute;top:28335;left:712"><nobr>SEGV</nobr></div>
<div style="position:absolute;top:28372;left:155"><nobr>Clang</nobr></div>
<div style="position:absolute;top:28372;left:263"><nobr>Bootstrap time</nobr></div>
<div style="position:absolute;top:28372;left:442"><nobr>9 %</nobr></div>
<div style="position:absolute;top:28372;left:582"><nobr>9 %</nobr></div>
<div style="position:absolute;top:28372;left:721"><nobr>9 % </nobr></div>
</span></font>
<div style="position:absolute;top:28687;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="25"><b>Page 25</b></a></font></td></tr></tbody></table></div><font size="3" face="Times" color="#434343"><span style="font-size:18px;font-family:Times;color:#434343">
<div style="position:absolute;top:28920;left:108"><nobr>SPEC benchmarks and PMU data</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:28999;left:108"><nobr>The figures below show the PMU data of all the benchmarks for the events : icache missses,</nobr></div>
<div style="position:absolute;top:29022;left:108"><nobr>branches not taken and cycles. The percentage difference in the counts of each event relative</nobr></div>
<div style="position:absolute;top:29044;left:108"><nobr>to the baseline is shown. For SPEC, BOLT and Propeller have a similar regression effect on the</nobr></div>
<div style="position:absolute;top:29067;left:108"><nobr>cycle count. On average, BOLT, Propeller-List, and Propeller-All increase the cycle count by</nobr></div>
<div style="position:absolute;top:29089;left:108"><nobr>1.0%, 1.5%, and 1.6%, respectively. Across all 8 benchmarks, Bolt wins over Propeller by more</nobr></div>
<div style="position:absolute;top:29112;left:108"><nobr>than 0.5% on 5 benchmarks (perlbench, x264, deepsjeng, leela, and xz), while Propeller-List</nobr></div>
<div style="position:absolute;top:29134;left:108"><nobr>wins on 1 (xalancbmk). The code layout optimization tends to increase the number of non_taken</nobr></div>
<div style="position:absolute;top:29157;left:108"><nobr>branches which consistently increases on all benchmarks, with BOLT and Propeller being very</nobr></div>
<div style="position:absolute;top:29179;left:108"><nobr>similar. The icache miss data, Figure 5c also show a lower number of misses with Propeller and</nobr></div>
<div style="position:absolute;top:29202;left:108"><nobr>BOLT compared to baseline, which is an expected outcome from the optimization. However, the</nobr></div>
<div style="position:absolute;top:29224;left:108"><nobr>dynamic instruction counts increase on 4/8 SPEC benchmarks with mcf being the highest. This</nobr></div>
<div style="position:absolute;top:29247;left:108"><nobr>is because changing the code layout could replace an implicit fall through with a direct branch.</nobr></div>
<div style="position:absolute;top:29292;left:108"><nobr>For SPEC benchmarks, even though the code layout optimization seems to be doing the right</nobr></div>
<div style="position:absolute;top:29314;left:108"><nobr>things, the performance numbers show no improvements on most benchmarks and regressions</nobr></div>
<div style="position:absolute;top:29337;left:108"><nobr>on a couple. We did a top-down analysis on these benchmarks and looked at the % of issue</nobr></div>
<div style="position:absolute;top:29359;left:108"><nobr>slots wasted waiting on front-end events, which is a metric for their frontend boundness. For</nobr></div>
<div style="position:absolute;top:29382;left:108"><nobr>SPEC benchmarks, this value is about 4% − 12%, whereas for the large benchmarks it is more</nobr></div>
<div style="position:absolute;top:29404;left:108"><nobr>than 20% and up to 38% for clang. Code layout optimizations reduce front-end stalls and hence</nobr></div>
<div style="position:absolute;top:29427;left:108"><nobr>tend to positively impact benchmarks that are more front-end bound. </nobr></div>
</span></font>
<div style="position:absolute;top:29875;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="26"><b>Page 26</b></a></font></td></tr></tbody></table></div><font size="3" face="Times" color="#434343"><span style="font-size:18px;font-family:Times;color:#434343">
<div style="position:absolute;top:30008;left:108"><nobr>Icache Data</nobr></div>
<div style="position:absolute;top:30447;left:108"><nobr>Branches Not taken Data </nobr></div>
</span></font>
<div style="position:absolute;top:31063;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="27"><b>Page 27</b></a></font></td></tr></tbody></table></div><font size="3" face="Times" color="#434343"><span style="font-size:18px;font-family:Times;color:#434343">
<div style="position:absolute;top:31196;left:108"><nobr>Cycles Data</nobr></div>
</span></font>
<font size="4" face="Times"><span style="font-size:27px;font-family:Times">
<div style="position:absolute;top:31660;left:108"><nobr>Conclusion</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:31732;left:108"><nobr>Propeller is a new framework for post link optimizations that has been built for scalability and</nobr></div>
<div style="position:absolute;top:31755;left:108"><nobr>has lower memory and time overheads compared to the state-of-the-art, Facebook’s BOLT.</nobr></div>
<div style="position:absolute;top:31777;left:108"><nobr>Propeller re-links the binary from optimized LLVM IR and achieves scalability by distributing the</nobr></div>
<div style="position:absolute;top:31800;left:108"><nobr>heavy weight optimization tasks to the compiler back-ends which can be executed in parallel</nobr></div>
<div style="position:absolute;top:31822;left:108"><nobr>with several threads or distributed across several machines. Propeller introduces basic block</nobr></div>
<div style="position:absolute;top:31845;left:108"><nobr>sections which keeps the actual link light-weight and hence the memory and time overheads</nobr></div>
<div style="position:absolute;top:31867;left:108"><nobr>smaller. Propeller has been implemented as a part of the LLVM framework and experimental</nobr></div>
<div style="position:absolute;top:31890;left:108"><nobr>results show that Propeller produces similar performance gains to BOLT but with much less</nobr></div>
<div style="position:absolute;top:31912;left:108"><nobr>memory and time overheads.</nobr></div>
</span></font>
<font size="4" face="Times"><span style="font-size:27px;font-family:Times">
<div style="position:absolute;top:32010;left:108"><nobr>References</nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:32082;left:135"><nobr>1. Facebook BOLT presentation:</nobr></div>
</span></font>
<font size="3" face="Times" color="#1155cc"><span style="font-size:14px;font-family:Times;color:#1155cc">
<div style="position:absolute;top:32104;left:162"><nobr><a href="https://llvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf">https://llvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf</a></nobr></div>
</span></font>
<div style="position:absolute;top:32251;left:0"><hr><table width="100%" border="0"><tbody><tr><td bgcolor="eeeeee" align="right"><font face="arial,sans-serif"><a name="28"><b>Page 28</b></a></font></td></tr></tbody></table></div><font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:32360;left:135"><nobr>2. Facebook BOLT video: ​<a href="https://www.youtube.com/watch?v=gw3iDO3By5Y"></a><font color="#1155cc"><a href="https://www.youtube.com/watch?v=gw3iDO3By5Y">https://www.youtube.com/watch?v=gw3iDO3By5Y</a></font></nobr></div>
<div style="position:absolute;top:32382;left:135"><nobr>3. BOLT blog post:</nobr></div>
</span></font>
<font size="3" face="Times" color="#1155cc"><span style="font-size:14px;font-family:Times;color:#1155cc">
<div style="position:absolute;top:32405;left:162"><nobr><a href="https://code.facebook.com/posts/605721433136474/accelerate-large-scale-applications-with-bolt/">https://code.facebook.com/posts/605721433136474/accelerate-large-scale-applications-</a></nobr></div>
<div style="position:absolute;top:32427;left:162"><nobr><a href="https://code.facebook.com/posts/605721433136474/accelerate-large-scale-applications-with-bolt/">with-bolt/</a></nobr></div>
</span></font>
<font size="3" face="Times"><span style="font-size:14px;font-family:Times">
<div style="position:absolute;top:32450;left:135"><nobr>4. Improved Basic Block Reordering: ​<font color="#1155cc"><a href="https://arxiv.org/abs/1809.04676">https://arxiv.org/abs/1809.04676</a></font></nobr></div>
<div style="position:absolute;top:32472;left:135"><nobr>5. Optimizing function placement for large-scale data-center applications:</nobr></div>
</span></font>
<font size="3" face="Times" color="#1155cc"><span style="font-size:14px;font-family:Times;color:#1155cc">
<div style="position:absolute;top:32495;left:162"><nobr><a href="https://dl.acm.org/citation.cfm?id=3049858">https://dl.acm.org/citation.cfm?id=3049858</a></nobr></div>
</span></font>
</div></body></html>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment