Skip to content

Instantly share code, notes, and snippets.

@AndreasPK
Last active August 14, 2018 15:26
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save AndreasPK/c4a6d8b5f199215ea7747d98d0f65de4 to your computer and use it in GitHub Desktop.
Save AndreasPK/c4a6d8b5f199215ea7747d98d0f65de4 to your computer and use it in GitHub Desktop.
Final report - SoC 2018 - Andreas Klebinger

Improvements to GHCs Code Generator

This is a writeup of the work done by me (Andreas Klebinger) during my SoC 2018 Project.

Improved code layout for GHC

The vast majority of my time was spent working on this part of the Project. While the patch was not yet merged there is not much missing.

The state of the patch can be viewed on GHCs Phab. Some discussion is also on the Bug Tracker

A explaination is also available on the GHC wiki

Preliminary results:

Numbers are speedups, higher is better.

Library Sandy Bridge (Linux) Haswell (Linux) Skylake (Win)
aeson +2.6% +2.3% +1.2%
containers +1.4% +1.1% +1.7%
megaparsec +3.2% +13.6% 1 ) +8.0%
perf-xml 2 ) +0.2% NA +1.1%
text +3.0% NA NA
Vector *2 +2.5% +2.5% +1.3%

Layout project review

Motivation

The idea was born out of neccesity. Some time ago I worked on adding static analysis to GHC to mark branches as likely or unlikely to be taken. Ultimately however this showed almost no benefit on runtime since code layout did not take advantage of this information. Even producing worse code at times.

Initial work

It became quickly clear that in order to improve on this GHC would need more information about control flow available when doing code layout.

There were two ways to approach this: Attach metadata to assembly instructions or have this information available out of band. I chose the later, building up a CFG from the CMM intermediate Language before generating assembly from it. Then doing static analysis on that to assign edge weights which approximate control flow patterns and using these to find a good code layout.

I started with a greedy algorithm for placing blocks based on edge weights. This went surprisingly well and I did not run into major issues during the implementation. It was my favourite part of the project.

First Roadblock

Next I wrote the code required to construct the CFG. However soon when testing the code I hit linker errors. After a good deal of debugging I realized that GHC adds (or removes!) basic blocks at non obvious places in the code. This meant blocks where invisible in the CFG, hence not placed and leading to issues at link time.

Finding all the places that modify the CFG was the first unplanned part which took a larger amount of time. But for the x64 backend the code now updates the CFG as we modify the code.

Benchmarks

Finally I was able to benchmark the actual performance. Which was worse initially. However after experimentation it became clear that this was mostly a matter of adjusting how we assign weights. With results improving ever since.

I initially only used nofib for benchmarks which has it's own drawbacks.

Over time I incorporated various library benchmark suits into my approach. Which in hindsight was important as these showcased both benefits and drawbacks which where not obvious looking only at nofib result.

However benchmarking in general took a lot of work and time. Unexpected issues where:

  • Benchmarks straight up being broken - some of the issues are listed below:
    • Nofib - Some benchmarks did not use the given compiler flags - Fixed during SoC
    • Nofib - Most of the benchmarks did not actually contribute to the reported runtime difference - Fixed during SoC
    • Vector - Various issues - Submitted a PR
    • Aeson - Broken dependency - Local workaround
    • text - benchmark not building on windows
  • Issues with tooling. I've run into a few issues with cabal new-build in particular. These invalidated some results so a good bit of time was spent reporting these issues and finding workarounds.
  • Computing time required - Getting a meaningful performance comparison took a lot of compute time. Especially since it was unclear which approach to assigning edge weights gives the best results. The longest ones are listed below.
    • nofib - >10 Hours.
    • containers - 4 Hours
    • text - 7 hours
  • Benchmarks being broken when using a unreleased GHC version.

I ended up writing a lot of scripts (bash/R) in order to compare the results of benchmarks. They are available here but are better used as a source of inspiration. As they rely a lot on hardcoded paths/flags and the like.

CMOV Support - Unfinished

Convert certain code patterns to branchless instructions. There is a working prototype on Phab.

However there are still bugs with certain edge cases. Ultimately the work on this was posponed in favour of the layout work which promised bigger benefits in more cases.

Improvements to nofib

Updated nofib default runtimes/settings

Driven by the need to get reliable benchmarks I reworked the default settings for the nofib benchmark suite in order to harmonize runtimes between the different benchmarks.

This has not yet been merged but has been accepted to be merged. Tracker, Patch

QOL Patch - Ignore warnings about tabs vs spaces for nofib.

This has been merged upstream.

Nofib pretty consistently uses tabs especially in old benchmarks. Given the nature of the code it doesn't make sense to emit warnings about this. Patch

Bugfix - Don't search for perl binary in a hotcoded path.

Merged upstream. Patch

Bugfix - Some benchmarks ignored given compiler options if these were overriden by O2

Merged upstream. Patch

Other work

Small documentation fixes about GHCs dump flags

Merged upstream. Patch 1, Patch 2

Allow users to hide most uniques when comparing Cmm code

Merged upstream. Patch

This makes it easier to compare Cmm dumps in cases where the uniques are different but the code is the same.

Small performance improvement for GHCs OrdList/Bag

Merged upstream: Patch

Use strict left folds in GHC.

Has been accepted but not yet merged. Patch

Alignment of symbols.

I wrote a patch to specify the alignment of generated functions at compile time to help rule out performance differences caused by alignment.

This work has been merged upstream. Tracker, Phab

Eliminate conditional branches which jump to the same target on true and false.

This has been merged upstream. Tracker,

Invert float comparisons to eliminate explicit parity checks.

Improve generated code by taking advantage of the fact that certain checks always fail on unordered floating point numbers.

The patch is complete but review and acceptance from maintainers is still required.

Tracker, Patch

Contributions to the wider eco system

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment