tycho/GSoC 2010 Proposal.markdown

## GSoC 2010 Proposal.markdown

      
    Raw
  

              GSoC 2010 Proposal.markdown
            
          
    Abstract

Unladen Swallow offers some magnificent performance improvements over the CPython interpretation method. Unfortunately, the Just-in-Time compiler used by Unladen Swallow currently incurs a hefty memory overhead which needs to be reduced before Unladen Swallow can be merged into CPython.
Goals

The overall goal is to eliminate wasteful allocations, memory leaks, and generally poor heap management.
Ideally, we want:

No memory leaks
No memory usage over CPython when running with -Xjit=never
No more than 10% memory usage over CPython with -Xjit=whenhot

There's a significant discrepancy between CPython and Unladen Swallow with JIT disabled (as mentioned in Issue 123):
CPython 2.6.4: 8508 kb
Unladen Swallow -Xjit=whenhot: 26768 kb
Unladen Swallow -Xjit=never: 15144 kb
Considering that Unladen Swallows JIT is disabled and it's essentially solely using the CPython interpreter, we would expect to see no memory usage over CPython. More recent numbers are significantly better, but still not ideal. Below are some recent measurements I did using CPython 2.6.5 and the latest Unladen Swallow code at the time of this writing (r1142 with an LLVM 2.7 prerelease backend):
tycho@xerxes ~/Development/unladen-tests $ ./perf.py -f -m -a ',-Xjit=never' /usr/bin/python /opt/unladen/bin/python2.6

[ snip ]

Report on Linux xerxes 2.6.33-sabayon #1 SMP Tue Mar 2 20:35:55 UTC 2010 i686 Genuine Intel(R) CPU T2300 @ 1.66GHz
Total CPU cores: 2

### 2to3 ###
Mem max: 10092.000 -> 11872.000: 1.1764x larger
Usage over time: [http://tinyurl.com/y9fyct4](http://tinyurl.com/y9fyct4)
	
### django ###
Mem max: 10984.000 -> 11436.000: 1.0412x larger
Usage over time: [http://tinyurl.com/yzv8pxk](http://tinyurl.com/yzv8pxk)

### nbody ###
Mem max: 3084.000 -> 3500.000: 1.1349x larger
Usage over time: [http://tinyurl.com/yfr4s5n](http://tinyurl.com/yfr4s5n)

### rietveld ###
Mem max: 15752.000 -> 20364.000: 1.2928x larger
Usage over time: [http://tinyurl.com/yjusn6s](http://tinyurl.com/yjusn6s)

### slowpickle ###
Mem max: 3580.000 -> 3880.000: 1.0838x larger
Usage over time: [http://tinyurl.com/ygjznvr](http://tinyurl.com/ygjznvr)

### slowspitfire ###
Mem max: 86456.000 -> 113616.000: 1.3141x larger
Usage over time: [http://tinyurl.com/yhyalxl](http://tinyurl.com/yhyalxl)

### slowunpickle ###
Mem max: 3488.000 -> 3900.000: 1.1181x larger
Usage over time: [http://tinyurl.com/ydz55uv](http://tinyurl.com/ydz55uv)

### spambayes ###
Mem max: 7684.000 -> 9016.000: 1.1733x larger
Usage over time: [http://tinyurl.com/ycrnvkl](http://tinyurl.com/ycrnvkl)

There are some less easily pinpointed issues such as those mentioned in Issue 68. Some work has already been done on reducing this overhead:

With -Xjit=always, compile all executed functions rather than all functions in the system. (r775)
Changing the model which evaluates hotness helped significantly with -Xjit=whenhot. (r862)
Deleting LLVM IR code after it compiling it to native code. (r876)
Lazily initializing the runtime feedback DenseMaps. (r1034)

However, the JIT still takes as much as 3.5x the amount of memory that CPython 2.6.5 does:
tycho@xerxes ~/Development/unladen-tests $ ./perf.py -f -m -a ',-Xjit=whenhot' /usr/bin/python /opt/unladen/bin/python2.6
[ snip ]

Report on Linux xerxes 2.6.33-sabayon #1 SMP Tue Mar 2 20:35:55 UTC 2010 i686 Genuine Intel(R) CPU T2300 @ 1.66GHz
Total CPU cores: 2

### 2to3 ###
Mem max: 10088.000 -> 18232.000: 1.8073x larger
Usage over time: [http://tinyurl.com/ykmatc9](http://tinyurl.com/ykmatc9)

### django ###
Mem max: 11100.000 -> 20268.000: 1.8259x larger
Usage over time: [http://tinyurl.com/yzpt5wp](http://tinyurl.com/yzpt5wp)

### nbody ###
Mem max: 3224.000 -> 11052.000: 3.4280x larger
Usage over time: [http://tinyurl.com/yfftw29](http://tinyurl.com/yfftw29)

### rietveld ###
Mem max: 14876.000 -> 28532.000: 1.9180x larger
Usage over time: [http://tinyurl.com/yhbejfq](http://tinyurl.com/yhbejfq)

### slowpickle ###
Mem max: 3484.000 -> 10404.000: 2.9862x larger
Usage over time: [http://tinyurl.com/yck5bg6](http://tinyurl.com/yck5bg6)

### slowspitfire ###
Mem max: 85380.000 -> 116484.000: 1.3643x larger
Usage over time: [http://tinyurl.com/ycrznk2](http://tinyurl.com/ycrznk2)

### slowunpickle ###
Mem max: 2500.000 -> 6312.000: 2.5248x larger
Usage over time: [http://tinyurl.com/yaha37n](http://tinyurl.com/yaha37n)

### spambayes ###
Mem max: 6424.000 -> 19736.000: 3.0722x larger
Usage over time: [http://tinyurl.com/yenmftv](http://tinyurl.com/yenmftv)

Other areas that should be investigated:

Test cases that cause quadratic memory usage in LLVM (i.e. PR3944)
Memory held longer than needed (i.e. data structures used by optimization passes that aren't appropriately freed)
Differences when statically and dynamically linking LLVM
Varying usage of optimization passes in JIT/global_llvm_data.cc

New Benchmarks

Microbenchmarks should be added for each stage in the JIT process. All new benchmarks should be runnable with perf.py.
In order to find memory leaks in LLVM during compilation, benchmarks such as this one can be added to the tree:
def foo(x):
return x
def main():
   for x in range(11000):
       foo(x)
   for y in range(100):
       str(foo.__code__.co_llvm)

This essentially recompiles the function foo() over and over again, and checks can be done to see whether the memory usage goes up or stays constant during this process.
And of course, if a memory hotspot is discovered, a reduced test case needs to be developed to see whether or not things are improving after changes are made.
Tools

There are a few different tools which can assist in reducing the memory consumption of Unladen Swallow:

tcmalloc (heap checking and heap profiling)
valgrind (massif for finding large allocations, memcheck for finding leaks)
Mac OS X's MallocDebug tool

Goal Measurement

In order to measure progress and final results on memory consumption, I will be using Unladen Swallow's perf.py (used for examples above). Benchmarks can be added for specific test cases which are known to cause memory bloat or memory leaks.
To demonstrate elimination of memory leaks, I will be using valgrind's memcheck tool to show before and after results.
Timeline

Specific tasks:

Locate the source of extra memory consumption when Unladen Swallow's JIT is disabled. (1-2 weeks)
Improve valgrind support in LLVM and Unladen Swallow. This will probably just involve writing suppressions for known false positives in LLVM 2.7, since LLVM trunk supports valgrind, but 2.7 doesn't and won't. (2-3 weeks)
Link tcmalloc into Unladen Swallow, and analyze the results of heap checking/profiling. (1-2 weeks)

Generic tasks, which will be done constantly over the course of this project:

Write new benchmarks which stress the JIT at each phase.
Write reduced testcases for any discovered leaks or memory hotspots.
Find and eliminate memory leaks with the assistance of tcmalloc and valgrind.
Reduce memory overhead incurred by the JIT, primarily focusing on penalties with -Xjit=whenhot. Use -Xjit=always as an extreme case and -Xjit=never as a control variable.

The timeline above is extremely rough. It's not easy to estimate the duration of certain tasks. Depending on unexpected inhibiting factors, certain tasks may take longer than others. Of course, I will not be allowing any task to take too much time. I won't be chasing the 1-2% improvement when a 20-30% improvement is easily found elsewhere.
About Me

I'm Steven Noonan, a computer science major at Central Washington University. I can be reached in the following ways (listed in order of preference):

Skype: neunon
Email: steven@uplinklabs.net
Phone: 1-509-760-8431

I have a resumé visible online as well, which shows some of the open source work I've been doing over the past decade.
And of course I have a blog called Anomalous Anomaly which I can use for the Google Summer of Code weekly reports.