tanstar/FastPolly.md

## FastPolly.md

      
    Raw
  

              FastPolly.md
            
          
    FastPolly: Reducing LLVM-Polly Compile-Time Overhead

(A GSoC 2013 project proposal)
Personal Information:


Name: Star Tan
University: Peking University, China
Status: Graduate student in Computer Science
Email: tanmx_star@yeah.net

1. Summary

LLVM-Polly is a promising polyhedral optimizer for data-locality and parallelism, which takes advantage of multi-cores, cache hierarchies, short vector instructions as well as dedicated accelerators. However, Polly analysis and optimization can lead to significant compile-time overhead, which makes it much less attractive for LLVM users. I argue that maintaining fast compile-time when Polly is enabled is very important, especially if we want to think of enabling Polly in default. Based on this assumption, this project targets to reduce Polly compile-time overhead.
2. Motivation

LLVM is an amazing open-source project, which has been widely used in C/C++ compilers, high-level synthesis compilers, virtual machines, optimizing tools, etc. As a graduate student, I am going to work on compiler analysis and optimization, especially on vectorization and parallelization. I find Polly is a very useful and powerful polyhedral optimizer. I would like to use this tool and contribute to this project.
When I was using Polly, I found that Polly optimization can lead to significant compile-time overhead. On average, Polly optimization will increase the compile-time by 393% for PolyBench benchmarks and by 53% for MediaBench benchmarks compared. That means if you want to gain from Polly, you have to pay 4 times extra compile-time overhead. Even if you do not want to gain much from Polly, you still have to pay 53% compile-time overhead. Such expensive compile-time overhead would make the Polly much less attractive to LLVM users.
In this project, I try to reduce Polly compile-time overhead by revising Polly passes.  For this purpose, I firstly try to find out where the compile-time overhead comes from. When Polly optimizes a program, it takes the following steps:

step1 - Polly canonicalization: prepare basic information and complete some transformations.
step2 - LLVM IR to Polly description: detect valid scops and translates them into polyhedral representation.
step3 - Polly optimization: analyze and optimize polyhedral scops.
step4 - Polly description to LLVM-IR: translates the polyhedral description back into LLVM-IR.

In attached table 1 and 2, pBasic shows the overhead of loading the LLVMPolly.so; pNoGen shows the overhead of step1 and step2; pNoOpt shows the overhead of step1, step2 and step4; and pOpt shows the total overhead of step1~step4. The total compile-time overhead of Polly can be divided into three parts:


canonicalization
code generation
optimization


PolyBench
13%-1%=12%
248%-13%=235%
393%-248%=145%


MediaBench
9%-1%= 8%
43%- 9%= 34%
53%- 43%= 10%


Based on these results, I plan to revise all canonicalization passes, code generation passes and optimization passes in Polly. Then, I will try my best to reduce the compile-time overhead for each part of them.
In the following 12 weeks, my schedule for this project includes the following stages:

Stage1 (1 week): set up a Polly performance tester and integrate the automation testing environment.
Stage2 (1 week): collect and analyze the detailed profiling information for Polly compile process;
Stage3 (3 weeks): revise and improve some expensive Polly passes or LLVM passes that Polly depends on;
Stage4 (3 weeks): revise canonicalization passes of Polly and try to remove most of canonicalization passes;
Stage5 (1 week): revise the pass ordering problem and let Polly bail out if it cannot benefit programs;
Stage6 (1 week): revise and improve Polly code generation passes;
Stage7 (1 week): revise and improve Polly optimization passes;
Stage8 (1 week): revise other parts of Polly.

3. Project Details

Stage1 -- Set up a Polly performance tester to track the compile time. [Week 1]

The first important work is to pick criteria to evaluate my work in this project. This project targets to reduce compile-time overhead, but at the same time it must not degrade the performance. To simplify the testing, I will use the number of scops that optimized by Polly as the performance criteria. As a result, our Polly performance tester should contains two parts:

Code performance: the number of scops optimized by Polly should not be degraded.
Compile-time Overhead: both the total compile-time overhead and the percentage of each Polly pass are both important.

Furthermore, I currently use some scripts to collect the compile-time statistics, but it still requires some tricky and manual work. My plan is to set up an automation testing environment and integrate it into Polly.
Stage2 -- Collect and analyze detailed compile-time profiling information. [Week 2]

Detailed compile-time profiling information is critical for our further work. Besides those preliminary data presented in table 1~5, I am going to collect the further profiling information based on PolyBench and MediaBench. As shown in table 5, there are 28 extra passes when enabling Polly in opt. I will investigate the compile-time overhead of each Polly pass and each LLVM pass that Polly depends on. Furthermore, some hot functions should be identified in this stage.
Stage3 -- Revise and improve top ten expensive passes. [Week 3-5]

Table 3 and table 4 have show some expensive passes when optimizing programs randomly selected from PolyBench and MediaBench. Results show that some Polly passes dominate the whole compile-time overhead. Especially when optimizing doitgen from PolyBench as shown in table 3, the top three passes account for 81% of total compile-time overhead. According to Amdahl's low, I should pay special attention to these Polly passes. I plan to revise and improve top ten expensive passes in three weeks.
Week 3: revise the "Polly - Calculate dependence" pass. Since this pass leads to significant compile-time overhead in both PolyBench and MediaBench, I will investigate the details of this pass and try to reduce the compile-time overhead.

Week 4: revise the "Polly - Detect static control parts (SCoPs)" and "Polly - Create polyhedral description of Scops" pass. As shown in table 3 and table 4, these two passes are expensive. However, they are also very important to Polly because all Polly optimizations are based on scops detected in these two passes. I will step into the details of the two passes and make sure no performance loss happens no matter how much compile-time overhead is reduced.

Week 5: revise other top ten expensive Polly passes. For example, "Combine redundant instructions" pass runs several times because of Polly and it can causes a large compile-time overhead. I will try to eliminate or reduce the run of this pass.
In this stage, I will also step into some existing bugs related to those expensive passes discussed above. For example, the bug llvm.org/PR15643 is related to the "Polly - Calculate dependence" pass. Although it has brought some discussions in LLVM/Polly mail list, there is still no feasible solution. I will try my best to fix up such bugs when I step into the details of those passes.
Stage4 -- Revising canonicalization passes. [Week 6-8]

Polly relies on some canonicalization passes to simplify the subsequent analysis and optimization. Canonicalization passes include loop-simplify, region-simplify, Induction variable canonicalization and block independent. For example, region-simplify pass is run to simplify the region to single entry and single exit edge before -polly-detect. However, such approach will introduce unnecessary modifications that increase compile time even in the cases where Polly cannot optimize the code.
A first step is to remove -region-simplify pass. For this purpose, I have modified the scops detection pass and code generation pass to allow scops with multiple entry edges and multiple exit edges. Details can be referred to patch files  r179673, r179586, r179159, r179158, r179157 and r178530. This simple work shows that removing unnecessary canonicalization passes is very useful and effective, but it also requires some careful revising in many parts of Polly.
Reducing the compile-time overhead of canonicalization passes is very important, especially at the time when users compile programs that Polly cannot optimize. A challenge work is to integrate Polly into the -O3 pass chain without the need to have any additional canonicalization passes. In this project, I will try my best to remove most of expensive canonicalization passes in three weeks:
Week 6: Investigate the details of each canonicalization pass, including PromoteMemoryToRegisterPass, CFGSimplificationPass, ReassociatePass, LoopRotatePass, InstructionCombiningPass, IndVarSimplifyPass, CodePreparationPass and LoopSimplifyPass. I will collect and analyze the profiling information for each canonicalization pass.

Week 7: Remove some expensive canonicalization passes, such as -region-simplify and -polly-indvars. I have successfully removed -region-simplify. Another work is to remove -polly-indvars with the support of -polly-codegen-scev. However, -polly-codegen-scev produces a lot cleaner LLVM-IR at code generation time, which may take more time to generate but less time to be cleaned up. I will investigate the details of this problem.

Week 8: Revise some important canonicalization passes, such as "Canonicalize natural loops" pass, which seems to be also a hot pass in table 3.
Stage5 -- Revise the pass ordering problem and let Polly bail out early. [Week 9]

To reduce the compile-time overhead when Polly cannot benefit programs, I will revise the pass ordering problem discussed in previous LLVM/Polly mail list. Furthermore, I will try to let Polly boil out early if it cannot benefit programs. For example, Polly scops can be detected and checked before entering canonicalization passes and optimization passes. If no valid scop is detected, then Polly can immediately bail out without any canonicalization or optimization overhead. However, it requires further investigation because scop detection also depend on some other passes.
Stage6 -- revise and improve Polly code generation passes. [Week 10]

Results in table 1 show that Polly code generation passes are very expensive. Currently Polly can choose to use ISL library or Cloog library to generate LLVM IR. Table 3 shows that the Cloog library can leads to 37% of total compile-time overhead.  Our results also show that the ISL library can leads to about 24% of total compile-time overhead. My plan is to profile and revise all passes used for code generation and then try to improve some hot passes. Another work is to compare the performance and compile-time overhead between the two code generation library.
Stage7 -- revise and improve Polly optimization passes. [Week 11]

Polly contains a lot of optimization passes, such as dependence simplify, schedule optimization, and various loop transformation. Table 3 shows that the "Polly - Optimize schedule of SCoP" pass can be very expensive.  In this week, I will investigate those Polly optimization passes and try my best to remove some expensive transformations.
Stage8 -- Improve other parts. [Week 12]

To be investigated.
4. Profit for LLVM users and Polly users

This project can benefit both LLVM users and Polly users. For LLVM users who care more about compile-time overhead, our project enables Polly to provide extra performance gains within very little extra compile-time overhead. For Polly users who care more about code quality, this project will significantly reduce the compile-time overhead without performance loss.
Specifically, the goal of this project is that Polly can significantly optimize PolyBench benchmarks, while at the same time it does not lead to prohibitively large compile-time overhead for MediaBench programs.
5. Why LLVM-Polly and Why me?

I am currently a graduate student in Computer Science Department of Peking University. My research work is focused on advanced compiler optimizations for parallel computing system. I am familiar with C/C++/Java programming with more than 4 years  experience in software development. Prior to this project, I have some basic experience in GCC/LLVM compiler development. Based on GCC compiler, I developed an improved software pipelining algorithm based on the classic Swing Modulo Scheduling (SMS). Based on LLVM compiler, I also participated in LLVM-Polly project and committed some patch files such as r179673, r179586, r179159, r179158, r179157, r178530. I always hope to contribute to LLVM and Polly project, and I am confident that my work in this project can benefit both LLVM users and Polly users.
[Attachments]

Our evaluation is based on Intel Pentium Dual CPU T2390(1.86GHz) with 2GB DDR2 memory. Each benchmark is run multiple times and data are collected using ministat (https://github.com/codahale/ministat).  Polly is built with Cloog in Release+Assert mode. All Polly, LLVM and Cloog source code are checked out from SVN at April 24, 2013.
Table 1 and table 2 show the compile-time overhead of Polly. Five cases are tested:
(alias pollycc="clang -O3 -load LLVMPolly.so -mllvm -polly)

clang:  clang -O3
pBasic: clang -O3 -load LLVMPolly.so
pNoGen: pollycc -O3 -mllvm -polly-optimizer=none -mllvm -polly-code-generator=none
pNoOpt: pollycc -O3 -mllvm -polly-optimizer=none
pOpt:   pollycc -O3

Table 3 and table 4 show the compile-time overhead of top 15 passes in polly-opt. we first generate LLVM IR files by the command "clang -O3 -emit-llvm XX.c -o XX.ll", then we run opt to collect the compile time of each pass using the command "opt -basicaa  -load LLVMPolly.so -O3 -polly -polly-codegen-scev XX.ll -S -o XX.opt.ll -time-passes"
Table 5 shows the extra passes when Polly is enabled in optimizing LLVM IR files. Information is collected using the command "opt -basicaa  -load LLVMPolly.so -O3 -polly -polly-codegen-scev XX.ll -S -o XX.opt.ll -debug-pass=Structure"
Table 1: Compile time for PolyBench (Seconds, each benchmark is run 10 times)


clang
pBasic
pNoGen
pNoOpt
pOpt
pBasic
pNoGen
pNoOpt
pOpt


2mm
0.152
0.159
0.171
0.324
0.725
4.73%
12.49%
112.69%
376.5%


atax
0.138
0.134
0.145
0.207
0.313
0.00%
0.00%
49.06%
125.8%


covariance
0.150
0.152
0.153
0.356
0.770
1.27%
1.87%
137.72%
414.4%


gemver
0.156
0.159
0.172
0.267
0.393
1.60%
10.37%
71.19%
152.0%


instrument
0.106
0.108
0.112
0.123
0.122
0.00%
5.84%
15.82%
14.5%


ludcmp
0.157
0.160
0.201
1.076
1.318
2.04%
27.52%
585.41%
739.2%


3mm
0.153
0.156
0.183
0.413
1.044
1.96%
19.42%
170.37%
582.5%


bicg
0.124
0.127
0.135
0.198
0.283
1.93%
8.76%
58.92%
127.3%


doitgen
0.149
0.151
0.164
0.333
0.897
0.00%
10.19%
122.86%
501.3%


gesummv
0.122
0.128
0.134
0.200
0.294
4.49%
9.48%
63.32%
140.0%


jacobi
0.144
0.151
0.159
0.391
0.849
0.00%
10.25%
170.91%
488.2%


seidel
0.134
0.135
0.146
0.630
0.916
0.00%
9.35%
371.13%
584.7%


adi
0.159
0.162
0.184
1.438
1.849
1.76%
15.19%
802.39%
1061%


correlation
0.158
0.160
0.180
0.339
0.634
1.08%
14.12%
114.88%
301.3%


gemm
0.141
0.143
0.158
0.242
0.448
1.78%
12.01%
72.07%
218.2%


grams
0.133
0.135
0.151
0.307
0.414
0.00%
13.37%
130.58%
210.9%


lu
0.142
0.144
0.158
0.316
0.394
1.69%
11.42%
122.41%
177.9%


average


1.26%
13.22%
248.47%
393.8%


Table 2: Compile time for MediaBench (Seconds, each benchmark is run 5 times)


clang
pBasic
pNoGen
pNoOpt
pOpt
pBasic
pNoGen
pNoOpt
pOpt


adpcm
0.179
0.182
0.192
0.199
0.198
0.00%
0.00%
11.15%
10.26%


g721
0.519
0.523
0.587
0.726
0.739
0.00%
12.98%
39.79%
42.30%


gsm
2.779
2.875
3.018
4.616
4.629
3.45%
8.58%
66.09%
66.55%


jpeg
12.54
12.61
13.90
22.59
24.41
0.00%
10.89%
80.15%
94.66%


mpeg2
2.953
2.97
3.338
3.553
4.491
0.00%
13.05%
20.33%
52.10%


average


0.69%
9.10%
43.50%
53.17%


Table 3: Compile time percent of top 15 opt passes for doitgen.ll (from PolyBench)


Time
Percent
Pass name


0.1506
36.9%
Execute Cloog code generation


0.1497
36.7%
Polly - Calculate dependences


0.0308
7.5%
Polly - Optimize schedule of SCoP


0.0127
3.1%
Induction Variable Simplification


0.0080
2.1%
Polly - Detect static control parts (SCoPs)


0.0042
1.0%
Natural Loop Information


0.0040
1.0%
Polly - Create polyhedral description of Scops


0.0029
0.7%
Combine redundant instructions


0.0029
0.7%
Polly - Create LLVM-IR from SCoPs


0.0022
0.5%
Combine redundant instructions


0.0019
0.5%
Combine redundant instructions


0.0018
0.4%
Global Value Numbering


0.0017
0.4%
Jump Threading


0.0016
0.4%
Canonicalize natural loops


0.0014
0.3%
Canonicalize natural loops


0.0013
0.3%
Combine redundant instructions


0.4080
100%
TOTAL EXECUTION TIME


Table 4: Compile time percent of top 15 opt passes for jmemmgr.ll (from MediaBench/JPEG)


Time
Percent
Pass name


0.0179
9.6%
Polly - Calculate dependences


0.0128
6.9%
Polly - Detect static control parts (SCoPs)


0.0128
6.9%
Global Value Numbering


0.0110
5.9%
Execute Cloog code generation


0.0073
3.9%
Combine redundant instructions


0.0070
3.7%
Combine redundant instructions


0.0064
3.4%
Combine redundant instructions


0.0058
3.1%
Combine redundant instructions


0.0058
3.1%
Combine redundant instructions


0.0056
3.0%
Polly - Create polyhedral description of Scops


0.0056
3.0%
Combine redundant instructions


0.0052
2.8%
Induction Variable Simplification


0.0047
2.5%
Combine redundant instructions


0.0039
2.1%
Function Integration/Inlining


0.0038
2.1%
Polly - Create LLVM-IR from SCoPs


0.1760
100%
TOTAL EXECUTION TIME


Table 5: Extra passes when enabling Polly in optimizing doitgen.ll (from PolyBench)
Promote Memory to Register
Combine redundant instructions
Simplify the CFG
Tail Call Elimination
Simplify the CFG
Reassociate expressions 
Dominator Tree Construction
Natural Loop Information
Loop Pass Manager
  Canonicalize natural loops
  Loop-Closed SSA Form Pass
  Rotate Loops
Combine redundant instructions
Scalar Evolution Analysis
Polly - Prepare code for polly
Post-Dominator Tree Construction
Dominance Frontier Construction
Detect single entry single exit regions
Scalar Evolution Analysis
Polly - Detect static control parts (SCoPs)
Polly - Create independent blocks
Polly - Analyse the LLVM-IR in the detected regions
Region Pass Manager 
  Polly - Create polyhedral description of Scops
  Polly - Calculate dependences
  Polly - Optimize schedule of SCoP
  Execute Cloog code generation                                                                                                                          
  Polly - Create LLVM-IR from SCoPs
	canonicalization	code generation	optimization
PolyBench	13%-1%=12%	248%-13%=235%	393%-248%=145%
MediaBench	9%-1%= 8%	43%- 9%= 34%	53%- 43%= 10%

	clang	pBasic	pNoGen	pNoOpt	pOpt	pBasic	pNoGen	pNoOpt	pOpt
2mm	0.152	0.159	0.171	0.324	0.725	4.73%	12.49%	112.69%	376.5%
atax	0.138	0.134	0.145	0.207	0.313	0.00%	0.00%	49.06%	125.8%
covariance	0.150	0.152	0.153	0.356	0.770	1.27%	1.87%	137.72%	414.4%
gemver	0.156	0.159	0.172	0.267	0.393	1.60%	10.37%	71.19%	152.0%
instrument	0.106	0.108	0.112	0.123	0.122	0.00%	5.84%	15.82%	14.5%
ludcmp	0.157	0.160	0.201	1.076	1.318	2.04%	27.52%	585.41%	739.2%
3mm	0.153	0.156	0.183	0.413	1.044	1.96%	19.42%	170.37%	582.5%
bicg	0.124	0.127	0.135	0.198	0.283	1.93%	8.76%	58.92%	127.3%
doitgen	0.149	0.151	0.164	0.333	0.897	0.00%	10.19%	122.86%	501.3%
gesummv	0.122	0.128	0.134	0.200	0.294	4.49%	9.48%	63.32%	140.0%
jacobi	0.144	0.151	0.159	0.391	0.849	0.00%	10.25%	170.91%	488.2%
seidel	0.134	0.135	0.146	0.630	0.916	0.00%	9.35%	371.13%	584.7%
adi	0.159	0.162	0.184	1.438	1.849	1.76%	15.19%	802.39%	1061%
correlation	0.158	0.160	0.180	0.339	0.634	1.08%	14.12%	114.88%	301.3%
gemm	0.141	0.143	0.158	0.242	0.448	1.78%	12.01%	72.07%	218.2%
grams	0.133	0.135	0.151	0.307	0.414	0.00%	13.37%	130.58%	210.9%
lu	0.142	0.144	0.158	0.316	0.394	1.69%	11.42%	122.41%	177.9%
average						1.26%	13.22%	248.47%	393.8%
	clang	pBasic	pNoGen	pNoOpt	pOpt	pBasic	pNoGen	pNoOpt	pOpt
adpcm	0.179	0.182	0.192	0.199	0.198	0.00%	0.00%	11.15%	10.26%
g721	0.519	0.523	0.587	0.726	0.739	0.00%	12.98%	39.79%	42.30%
gsm	2.779	2.875	3.018	4.616	4.629	3.45%	8.58%	66.09%	66.55%
jpeg	12.54	12.61	13.90	22.59	24.41	0.00%	10.89%	80.15%	94.66%
mpeg2	2.953	2.97	3.338	3.553	4.491	0.00%	13.05%	20.33%	52.10%
average						0.69%	9.10%	43.50%	53.17%
Time	Percent	Pass name
0.1506	36.9%	Execute Cloog code generation
0.1497	36.7%	Polly - Calculate dependences
0.0308	7.5%	Polly - Optimize schedule of SCoP
0.0127	3.1%	Induction Variable Simplification
0.0080	2.1%	Polly - Detect static control parts (SCoPs)
0.0042	1.0%	Natural Loop Information
0.0040	1.0%	Polly - Create polyhedral description of Scops
0.0029	0.7%	Combine redundant instructions
0.0029	0.7%	Polly - Create LLVM-IR from SCoPs
0.0022	0.5%	Combine redundant instructions
0.0019	0.5%	Combine redundant instructions
0.0018	0.4%	Global Value Numbering
0.0017	0.4%	Jump Threading
0.0016	0.4%	Canonicalize natural loops
0.0014	0.3%	Canonicalize natural loops
0.0013	0.3%	Combine redundant instructions
0.4080	100%	TOTAL EXECUTION TIME
Time	Percent	Pass name
0.0179	9.6%	Polly - Calculate dependences
0.0128	6.9%	Polly - Detect static control parts (SCoPs)
0.0128	6.9%	Global Value Numbering
0.0110	5.9%	Execute Cloog code generation
0.0073	3.9%	Combine redundant instructions
0.0070	3.7%	Combine redundant instructions
0.0064	3.4%	Combine redundant instructions
0.0058	3.1%	Combine redundant instructions
0.0058	3.1%	Combine redundant instructions
0.0056	3.0%	Polly - Create polyhedral description of Scops
0.0056	3.0%	Combine redundant instructions
0.0052	2.8%	Induction Variable Simplification
0.0047	2.5%	Combine redundant instructions
0.0039	2.1%	Function Integration/Inlining
0.0038	2.1%	Polly - Create LLVM-IR from SCoPs
0.1760	100%	TOTAL EXECUTION TIME