BertalanD/GSoC22_BertalanD.md

## GSoC22_BertalanD.md

      
    Raw
  

              GSoC22_BertalanD.md
            
          
    GSoC 2022 Final Report: Improvements to the Mach-O LLD linker

During the summer of 2022, I took part in Google Summer of Code, where I contributed to the Mach-O port of the LLD linker (ld64.lld) on behalf of the Chromium project.
Mach-O is the executable format used by Apple's operating systems, and the Chromium browser is built using LLVM and its open-source LLD linker for all platforms. The goal with my contributions was to benefit both the Chromium project and its developers, and the developer community at large.
My project

During the course of these 12 weeks, I

fixed some bugs,
improved diagnostics,
made the linker emit faster code,
reduced the size of Chromium.app by 250 kB, and
implemented support for a new, faster approach to program loading.

The full list of my contributions can be found on my Phabricator profile and in the LLVM GitHub repository. In this section, I'm going to describe what I worked on in detail.
Project 1: Diagnostics tweaks

I spent the first two weeks of the coding period improving ld64.lld's diagnostic messages.
While linker errors can be notoriously cryptic, I have always found the format of the ELF LLD's (ld.lld) easy to read and informative. My goal was to bring ld64.lld in line with it.
In D127696, I changed undefined symbol errors to print the name of the function where the symbol is referenced from.

Inputs
$ clang -c -o undef.o -x c - <<EOF
void foo();
int main() { foo(); }
EOF


# Before
$ clang -fuse-ld=lld undef.o
ld64.lld: error: undefined symbol: _foo
>>> referenced by undef.o

# After
$ clang -fuse-ld=lld undef.o
ld64.lld: error: undefined symbol: _foo
>>> referenced by undef.o:(symbol _main+0x8)

Still, these messages could get very overwhelming if the symbol was used in many places. In my next commit (D127753) I made LLD group the errors by the name of the referenced symbol.

Inputs
$ clang -c -o grouped.o -x c - <<EOF
void foo();
int main() { foo(); }
void bar() { foo(); }
EOF


# Before
$ clang -fuse-ld=lld grouped.o
ld64.lld: error: undefined symbol: _foo
>>> referenced by grouped.o:(symbol _main+0x8)

ld64.lld: error: undefined symbol: _foo
>>> referenced by grouped.o:(symbol _bar+0xc)

# After
$ clang -fuse-ld=lld grouped.o
ld64.lld: error: undefined symbol: _foo
>>> referenced by grouped.o:(symbol _main+0x8)
>>> referenced by grouped.o:(symbol _bar+0xc)

The last missing piece was source information. Linkers work on object files that have already been compiled, so most of the time symbol names are all that we can work with. But if debug information is enabled, we can correlate each instruction with a specific line in the source code. In D128184 and D128425, I changed undefined and duplicate symbol to show the source filenames and line numbers.

Inputs
$ cat > duplicate.c <<EOF
int main() { }
EOF
$ clang -c -g duplicate.c -o first.o
$ cp first.o second.o


# Before
$ clang -fuse-ld=lld first.o second.o
ld64.lld: error: duplicate symbol: _main
>>> defined in first.o
>>> defined in second.o

# After
$ clang -fuse-ld=lld first.o second.o
ld64.lld: error: duplicate symbol: _main
>>> defined in first.o
>>>            duplicate.c:1
>>> defined in second.o
>>>            duplicate.c:1

I am very happy to have chosen diagnostics as my first area of focus. These low-stakes changes helped me ease myself into LLVM's development process, and in the process, I got to familiarize myself with a big chunk of LLD's code base.
Project 2: Linker optimization hints

Linker optimization hints had caught my eyes back when I was writing my project proposal. I chose adding support for them for my first larger project.
What are linker optimization hints?

On RISC architectures like arm64, materializing a memory address generally takes multiple instructions. If the referenced symbol is located close enough in memory, fewer instructions are needed.
Linker optimization hints record where addresses are computed. After addresses have been assigned, we may be able to change them to a shorter sequence of instructions. The linker cannot delete instructions (except in the RISC-V ELF ABI), so we replace the eliminated instructions with no-ops instead. This still leads to faster code as the CPU can skip over NOPs quickly.
Take this code for instance, which loads a 64-bit integer from the second element of the global array _foo into the x2 register:
adrp x0, _foo@PAGE
add  x1, x0, _foo@PAGEOFF
ldr  x2, [x1, #8]
If the referenced symbol ends up close enough, we could use a single instruction to do the same operation:
nop
nop
ldr x2, [_foo + 8]
My implementation in LLD

I added the initial framework for processing linker optimization hints in D128093. This was one of the first commits that I could push myself to the LLVM git repository. Support for more transformation kinds arrived in D128942, D129059, D129427 and D130505. We are still missing support for two very rare optimization hint kinds. The progress is tracked in #50399.
This series of patches helped me learn about arm64 assembly, instruction encoding, and through the multiple rounds of code review I had the opportunity to improve the performance of this linker pass significantly.
Project 3: Miscellaneous tweaks

By the time I landed all the linker optimization hint work I wanted to do, it was already the second week of July. The LLVM 15 branching-off point of July 26th was just around the corner, so instead of working on a large feature that might miss the deadline, I opted to use the remaining time to make smaller improvements.
I spent a day building various open-source programs using LLD as the linker, and I found two issues. Firstly, applications built against Homebrew Qt would fail to link with an "LC_DYLD_INFO_ONLY not found" error. This was because Homebrew dylibs targeting macOS 12 have chained fixups (see below), which use a slightly different load command for specifying the list of exported symbols. I added support for reading the new LC_DYLD_EXPORTS_TRIE load command in D129430.
The second issue reared its head when building GCC. One of its dependencies, the libisl library would fail at configure-time. The log showed an LTO backend error where a temporary file failed to be created for the compiled LTO object. It turned out that Clang specified a temporary directory to the linker in some cases, and a file in others. LLD assumed that it would always be a directory. This issue was fixed in D129705.
I made two performance-related changes. In D130000, I devirtualized a function that's called thousands of times during a usual link job, while in D130234 I elimiated some string length calculations. In total, the two commits amounted to a 4% speedup.
macOS uses position-independent code, so whenever a local symbol's address is stored in a global constant, the object's base address has to be added to it at load time. For this operation, termed rebase, a bytecode is used, which has two main operations: increment a pointer and rebase the memory location to which it points. The bytecode provides a compact encoding for a series of evenly spaced rebases (relevant for CFString constants) and small pointer increments. A smarter algorithm that could generate these was added in D128798 and D130180. These changes amounted to a 55%/250 KiB decrease in the rebase section's size for Chromium.
In D130473 and D130529 I added support for the -load_hidden and -hidden-l options, closing #51505. These flags can be used for linking dylibs to static libraries without them re-exporting the static library's symbols.
Just after the LLVM 15 deadline, I fixed an issue in D130559 where incorrect N_SO stabs entries would be created for objects that had DWARF 5 debug information (#51668).
As my last contribution during GSoC, I fixed an assertion failure that occurred while generating LC_DATA_IN_CODE if -order_file or call graph profile sorting was used. When D133581 lands, it will be possible to bootstrap a PGO build of LLVM using LLD as the host linker, which may close Chromium issue 1265937.
Project 4: Chained fixups

The largest and still ongoing undertaking of my summer was adding support for chained fixups. This feature was introduced in Apple's Fall 2020 OS releases, and replaces the opcode-based bind and rebase tables I described above.
What are chained fixups?

In this format, most of the metadata necessary for binding symbols and rebasing addresses is stored directly in the memory location that will have the fixup applied. The fixups form singly linked lists; each one covering a single page in memory. The __LINKEDIT,__chainfixups section stores the page offset of the first fixup of each page; the rest can be found by walking the chain using the offset that is embedded in each entry.
This setup allows pages to be relocated lazily at page-in time and without being dirtied. The kernel can discard and load them again as needed. This technique, called page-in linking, was introduced in macOS 13.
The benefits of this format are:

smaller __LINKEDIT segment, as most of the fixup information is stored in the data segment
faster startup, since not all relocations need to be done upfront
slightly lower memory usage, as fewer pages are dirtied

More information about chained fixups can be found in this WWDC 2022 talk.
Our implementation in LLVM

As LLVM requires extensive regression tests to be added for every feature and bug fix, our first concern was to teach llvm-otool about chained fixups. Apple had started upstreaming support for reading them late last year, however that effort seems to have been put on hold. After discussing the matter with my mentor Nico and folks from Apple, we agreed that we would provide our own implementation, which we will remove when their upstreaming is ready to commence. The two of us submitted:

D131890 [llvm-objdump] Start on -chained_fixups for llvm-otool,
D131897 [llvm-objdump --macho] Rename --dyld_info to --dyld-info,
D131961 [llvm-objdump] Support dumping segment information in -chained_fixups,
D131982 [llvm-objdump] Complete -chained_fixups support,
D132036 [llvm-objdump] Add -dyld_info to llvm-otool.

I continued by submitting a couple of No Functional Changes commits which were small refactorings that will make it easier to add chained fixups: D132367, D132476, ae5d542 and 4f688d0.
In D133010 LLD was made to set the SG_READ_ONLY flag on the __DATA_CONST segment. My testing showed, that without this, page-in linking would not be enabled.
The last preparatory patch was D132947, which added support for synthesizing the __init_offsets section from __mod_init_func. ld64 performs this pass when chained fixups are enabled, as the new format lets dyld avoid needing to fix up the pointers to initializer functions.
The final patch that will add support for generating chained fixups is D132560. This has not landed yet, and I will focus on getting it to a mergeable state in the following weeks. The commit encompasses almost a thousand lines of changes, so I expect the review process to take a while.
Project 5: Technical debt removal

After I committed all preparatory work for chained fixups, Nico suggested that I should clean up and parallelize my implementation of linker optimization hints. While I was initially content with the final version of the patch from June which got the overhead down to 3-4%, I tried out a new approach.
In D133274 I simplified linker optimization hint processing by moving it to a separate pass instead of having it be interleaved with relocation handling.
While comparing LLD's output to that of ld64, I noticed that one of our test cases failed to link with Apple's linker. This was because the test was performing a load from an unaligned memory location, while the ldr instruction encodes the address as a multiple of the load's size. Starting with D133269 LLD rejects such unrepresentable relocations.
The last change in this series is D133439, which parallelizes the parsing of linker optimization hints. When the revision is accepted, it will decrease arm64 Chromium's link time by 2-3%, bringing the total overhead of the pass down to 25 milliseconds.
Future plans

Although the coding period has ended, I intend to follow LLD's development and contribute as my time permits. I am watching the lld:MachO label on the GitHub issue tracker and I'm a member of the lld-macho project on Phabricator, so I will be notified of any new issues or patches.
My current priority is getting my three pending patches ready for being committed. I will work towards enabling chained fixups by default, and fixing any issues that arise as a result of that change.
As discussed, I will take part in reverting our temporary llvm-objdump changes and ensuring that the LLD tests work with Apple's implementation.
My takeaways

During my project, I have grown a lot in terms of both technical and social skills. Some of the highlights are:

I learned the importance of regression tests, and how to write good ones.
I got to practice technical writing through long-form documentation comments.
I became more familiar with the AArch64 ISA and macOS debugging tools.
I used benchmarking tools pretty extensively and experimented with profiling.
I got a glimpse into what it's like to work with professional developers.
I improved by communication skills by responding to questions/suggestions in review comments and reasoning about my code.

Acknowledgements

I would like to thank Nico Weber, Jez Ng and the rest of the lld-macho developers for all the guidance they have provided to me and the great suggestions they made during code review. Working with and learning from them has been a fun experience.
I would like to thank the Chromium project for providing me with a GCE virtual machine which was very useful for performing low-noise benchmarks and running tests with sanitizers enabled.
Last but not least, I'm grateful to Google, the Chromium GSoC admins and my mentors for accepting my GSoC proposal, and the clear communication and guidance they provided throughout the program's duration.