MaskRay/optimize-lld-elf.md

## optimize-lld-elf.md

      
    Raw
  

              optimize-lld-elf.md
            
          
  theme
  class
  highlighter
  lineNumbers
  drawings
  css
  routerMode
  title
  
  
  ./theme
  text-center
  shiki
  false
  
  
  persist
  
  
  unocss
  hash
  Optimize lld/ELF
  
  
Optimize lld/ELF


<style>
h1 {
}
</style>

layout: 'intro'

MaskRay (宋方睿)


LLVM contributor since 2017 (4700+ commits), ld.lld and Clang Driver code owner, maintainer of a bunch of components

🛠 toolchain

DWARF Standards Committee

Contributed to AArch64, RISC-V, and x86-64 psABIs


  MaskRay
  
  maskray.me


layout: default
image: https://source.unsplash.com/collection/94734566/1920x1080

lld


LLVM linker: COFF, ELF, wasm, and Mach-O
New ELF linker started in July 2015 (by Michael J. Spencer)
Early years: Rui Ueyama, Rafael Avila de Espindola, George Rimar
(current) Peter Smith and I; my changes
Adopted by:

Chrome: Linux/x86-64 default since 2016, all Linux since 2018
FreeBSD: aarch64 system linker since 2016, amd64 since May 2018, riscv64 since Jan 2020, all ports since Mar 2020
Gentoo's clang profile: default-lld since 2020
Android: NDK default since r22 (2021)
Chimera Linux since 2021
and many companies


Some projects that are picky on the toolchain

Linux kernel: usable in 2018, pretty good for arm/arm64/powerpc/x86 in 2019, mature in 2020
glibc: good since 2021


ELF linkers and when they became relatively mature


      timeline
  1994 : GNU ld
  2008 : GNU gold
  2016 : lld
  2021 : mold

    
      Loading

  
Generic linker design

"Implement an ELF linker" https://gist.github.com/MaskRay/601b51e900d19c6ed09d256ba9ec4722

Parse command line options
Find and scan input files (.o, .so, .a), interleaved with symbol resolution
Global transforms (section based garbage collection, identical code folding, etc)
Create synthetic sections
Map input sections and synthetic (linker-generated) sections into output sections
Scan relocations
Layout and assign addresses (thunks, sections with varying sizes, symbol assignments)
Write headers
Copy section contents to output and resolve relocations


Breakdown of link time

Link clang (Release) with lld 13 and lld 17


Why isn't ld.lld faster?
In lld, the focus is probably more on parity on important features, ideal semantics, code readability, portability, and generality, but less on performance and specifically multi-threading performance.
In some cases, simpler code is preferred over little performance improvement.

Optimize archive member extraction

ld ref.o a.a. The archive member a.a(a.o) is extracted.
Each defined symbol in a.a(a.o) is unnecessarily interned when processing a.a's archive symbol table.


      stateDiagram
  Placeholder --> Undefined: Undefined(file=ref.o)
  extraction : Undefined pending extraction of a.a(a.o)
  Undefined --> extraction: LazyArchive(file=a.a), the intern result cannot be reused
  extraction --> Defined: Defined(file=a.a(a.o))

    
      Loading

  
void ArchiveFile::parse() {
  for (const Archive::Symbol &sym : file->symbols())
    symtab->addSymbol(LazyArchive{*this, sym});
  ...
}

template <class ELFT> void ObjFile<ELFT>::initializeSymbols() { ...
  // Some entries have been filled by LazyObjFile.
  for (size_t i = firstGlobal, end = eSyms.size(); i != end; ++i)
    if (!symbols[i])
      symbols[i] = symtab.insert(CHECK(eSyms[i].getName(stringTable), this));
  ...
}


Handle a.a as a collection of lazy object files like --start-lib ... --end-lib.
Archives and --start-lib


      stateDiagram
  Placeholder --> Undefined: Undefined(file=ref.o)
  extraction : Undefined pending extraction of a.a(a.o)
  Undefined --> extraction: LazyObject(file=a.a(a.o)), the intern result can be reused
  extraction --> Defined: Defined(file=a.a(a.o))

    
      Loading

  
template <class ELFT> void ObjFile<ELFT>::parseLazy() { ...
  for (size_t i = firstGlobal, end = eSyms.size(); i != end; ++i) {
    if (eSyms[i].st_shndx == SHN_UNDEF)
      continue;
    symbols[i] = symtab.insert(CHECK(eSyms[i].getName(stringTable), this));
    symbols[i]->resolve(LazyObject{*this});
    if (!lazy)
      break;
  }
}

template <class ELFT>
void ObjFile<ELFT>::initializeSymbols(const object::ELFFile<ELFT> &obj) { ...
  // Some entries have been filled by LazyObjFile.
  for (size_t i = firstGlobal, end = eSyms.size(); i != end; ++i)
    if (!symbols[i])
      symbols[i] = symtab.insert(CHECK(eSyms[i].getName(stringTable), this));
  ...
}


Tradeoff

For an archive member:

extracted: new code is strictly superior: avoid redundant hash table insertion and avoid accessing the archive symbol table
not extracted: new code accesses individual ELF symbol tables (larger) instead of the compact archive symbol table

As of 2022-02, when linking chrome in the default build, 79% archive members are extracted according to --print-archive-stats=.

Optimize "Parse input files"


initialization of sections (embarrassingly parallel)
COMDAT group resolution
initialization of local symbols (embarrassingly parallel)
initialization of non-local symbols
symbol resolution, possibly archive member extraction

mold has parallelized all the above.

Initialization of non-local symbols and symbol resolution may affect semantics
difficult to do with the current lld architecture
some features make full parallelization difficult: SHT_*_ATTRIBUTES, SHT_LLVM_DEPENDENT_LIBRARIES

That said, I managed to parallelize all the "initialization" passes as of lld 16.

for (size_t i = 0; i < files.size(); ++i)
  parseFile(files[i]);
Prior to lld 14, parsing a relocatable object file makes section initialization, symbol initialization, and symbol resolution tangled.
template <class ELFT> void ObjFile<ELFT>::parse(bool ignoreComdats) {
  // Initialize sections and handle COMDAT groups.
  initializeSections(ignoreComdats);
  // Initialize local and non-local symbols, and perform symbol resolution.
  // Lazy object file extraction may cause parsing of other files.
  initializeSymbols();
}


      flowchart
  comdat["Handle section groups and resolve COMDAT"]
  nonLocalSymbols["Initialize non-local symbols (possiblly interleaved with archive member extraction)"]
  section["(parallel) Initialize non-group sections and local symbols"]
  postParse["(parallel) postParse: report symbol errors, set symbol's section, and
  change symbols defined in non-prevailing sections to Undefined"]
  comdat --> nonLocalSymbols
  nonLocalSymbols --> section
  section --> postParse

  style comdat fill:#eee
  style nonLocalSymbols fill:#eee

    
      Loading

  
Parallelize local symbol initialization: D119909
Move section assignment from initializeSymbols to postParse: D120626
Parallelize input section initialization: D130810


Optimize "Write sections"

Prior to 14, lld processes one OutputSection at a time and for each OutputSection writes input sections in parallel.
This strategy does not leverage multi-threading well.
template <class ELFT> void Writer<ELFT>::writeSections() { ...
  for (OutputSection *sec : outputSections)
    if (sec->type != SHT_REL && sec->type != SHT_RELA)
      sec->writeTo<ELFT>(Out::bufferStart + sec->offset);
}

template <class ELFT> void OutputSection::writeTo(uint8_t *buf) {
  ...
  parallelForEachN(0, sections.size(), [&](size_t i) {
    InputSection *isec = sections[i];
    isec->writeTo<ELFT>(buf);
 });
}

[ELF] Parallelize writes of different OutputSections
Partition input sections by ~4MiB clusters and parallelize writes of different output sections

template <class ELFT> void Writer<ELFT>::writeSections() { ...
  parallel::TaskGroup tg;
  for (OutputSection *sec : outputSections)
    if (sec->type != SHT_REL && sec->type != SHT_RELA)
      sec->writeTo<ELFT>(Out::bufferStart + sec->offset, tg);
}

template <class ELFT>
void OutputSection::writeTo(uint8_t *buf, parallel::TaskGroup &tg) { ...
  auto fn = [=](size_t begin, size_t end) {
    for (size_t i = begin; i != end; ++i)
      sections[i]->writeTo<ELFT>(buf);
  };
  const size_t taskSizeLimit = 4 << 20;
  for (size_t begin = 0, i = 0, taskSize = 0;;) {
    taskSize += sections[i]->getSize();
    bool done = ++i == numSections;
    if (done || taskSize >= taskSizeLimit) {
      tg.spawn([=] { fn(begin, i); });
      if (done)
        break;
      begin = i;
      taskSize = 0;
    }
  }
}

Optimize "Scan relocations"

For each relocation updating a SHF_ALLOC section, determine whether it is resolved to a link-time constant or requires special treatment (GOT, PLT, dynamic relocation, copy relocation, etc)
  RelocationScanner scanner;
  for (InputSectionBase *sec : inputSections)
    if (sec->isLive() && (sec->flags & SHF_ALLOC))
      scanner.template scanSection<ELFT>(*sec);

template <class ELFT, class RelTy>
static void processRelocAux(InputSectionBase &sec, RelExpr expr, RelType type,
                            uint64_t offset, Symbol &sym, const RelTy &rel,
                            int64_t addend) {
  if (isStaticLinkTimeConstant(expr, type, sym, sec, offset) ||
      (!config->isPic && sym.isUndefWeak())) {
    sec.relocations.push_back({expr, type, offset, addend, &sym});
    return;
  }
  bool canWrite = (sec.flags & SHF_WRITE) || !config->zText;
  if (canWrite) {
    RelType rel = target->getDynRel(type);
    if (expr == R_GOT || (rel == target->symbolicRel && !sym.isPreemptible)) {
      addRelativeReloc(&sec, offset, sym, addend, expr, type);
      return;
    } ...
  }

[ELF] Parallelize relocation scanning
  bool serial = !config->zCombreloc || config->emachine == EM_MIPS || config->emachine == EM_PPC64;
  parallel::TaskGroup tg;
  for (ELFFileBase *f : ctx->objectFiles) {
    auto fn = [f]() {
      RelocationScanner scanner;
      for (InputSectionBase *s : f->getSections())
        if (s && s->kind() == SectionBase::Regular && s->isLive() && (s->flags & SHF_ALLOC))
          scanner.template scanSection<ELFT>(*s);
    };
    tg.spawn(fn, serial);
  }

void RelocationScanner::processAux(RelExpr expr, RelType type, uint64_t offset,
                                   Symbol &sym, int64_t addend) const { ...
  if (needsGot(expr)) {
    if (config->emachine == EM_MIPS) {
      in.mipsGot->addEntry(*sec->file, sym, addend, expr);
    } else if (!sym.isTls() || config->emachine != EM_LOONGARCH) {
      sym.setFlags(NEEDS_GOT);
    }
  } else if (needsPlt(expr)) {
    sym.setFlags(NEEDS_PLT);
  } else if (LLVM_UNLIKELY(isIfunc)) {
    sym.setFlags(HAS_DIRECT_RELOC);
  }
}

Optimize --build-id

--build-id was introduced by Fedora as "approximation of true uniqueness across all binaries that might be used by overlapping sets of people".
llvm/lib/Support/SHA1.cpp is derived from NetBSD and is not sufficiently optimized.
[ELF] Implement --build-id={md5,sha1} with truncated BLAKE3

Optimize SHF_MERGE|SHF_STRINGS duplicate elimination

C++ [lex.string]

Whether all string-literals are distinct (that is, are stored in nonoverlapping objects) and whether successive evaluations of a string-literal yield the same or a different object is unspecified.

Linkers can eliminate duplicates.

  
      pie title numbers of string literals of the given length range in a debug build of clang-16
  "1-3" : 393434
  "4-8" : 2084056
  "9-16" : 2846249
  "17-128" : 5598928
  "129-240" : 1317989
  "241-" : 328058

    
      Loading

  
XXH3 is faster than XXH64.
https://github.com/Cyan4973/xxHash is complex (compiler workarounds and so on)
Devin Hussey's xxhash-clean is clean but not well optimized.
Take the best part of xxHash and take simplification ideas from xxhash-clean.

Make XXH3_len_129to240_64b and XXH3_hashLong_64b noinline
Unroll XXH3_len_17to128_64b

Patches:

[Support] Add llvm::xxh3_64bits
[ELF] Use llvm::xxh3_64bits for MergeInputSection::splitStrings


Do less work in common cases


[ELF] Move gotIndex/pltIndex/globalDynIndex to SymbolAux
[ELF] Remove unneeded SyntheticSection memset(*, 0, *)


[ELF] Optimize --wrap to only check non-local symbols

  parallelForEach(objectFiles, [&](InputFile *file) {
    MutableArrayRef<Symbol *> syms =
      file->getMutableSymbols();
    for (size_t i = 0, e = syms.size(); i != e; ++i)
      if (Symbol *s = map.lookup(syms[i]))
        syms[i] = s;
  });
  parallelForEach(objectFiles, [&](ELFFileBase *file) {
    for (Symbol *&sym : file->getMutableGlobalSymbols())
      if (Symbol *s = map.lookup(sym))
        sym = s;
  });

MutableArrayRef<Symbol *> getMutableGlobalSymbols() {
  return makeMutableArrayRef(symbols.data(),
      symbols.size()).slice(firstGlobal);
}


COMMON symbols are obsoleted. See All about COMMON symbols
[ELF] Optimize replaceCommonSymbols

static void replaceCommonSymbols() {
  for (Symbol *sym : symtab->symbols()) {
    auto *s = dyn_cast<CommonSymbol>(sym);
    if (!s) continue;
    auto *bss = make<BssSection>("COMMON", s->size,
      s->alignment);
    bss->file = s->file;
    inputSections.push_back(bss);
    s->replace(Defined{s->file, s->getName(), s->binding,
      s->stOther, s->type, /*value=*/0, s->size, bss});
  }
}
class ELFFileBase : public InputFile {
  ...
public:
  bool hasCommonSyms = false;
};

static void replaceCommonSymbols() {
  for (ELFFileBase *file : objectFiles) {
    if (!file->hasCommonSyms)
      continue;
    for (Symbol *sym : file->getGlobalSymbols()) {
      auto *s = dyn_cast<CommonSymbol>(sym);
      if (!s) continue;
      auto *bss = make<BssSection>("COMMON", s->size,
        s->alignment);
      bss->file = s->file;
      inputSections.push_back(bss);
      s->replace(Defined{s->file, s->getName(), s->binding,
        s->stOther, s->type, /*value=*/0, s->size, bss});
    }
  }
}


[ELF] Parallelize --compress-debug-sections=zlib
Split a section into 1MiB shards and calls zlib deflake in parallel.

use Z_SYNC_FLUSH for all shards but the last to flush the output to a byte boundary to be concatenated with the next shard
use Z_FINISH for the last shard to set the BFINAL flag to indicate the end of the output stream (per RFC1951)


mold links against mimalloc by default.
Build llvm-project with -DCMAKE_EXE_LINKER_FLAGS=/path/to/libmimalloc.a for a relatively fair comparison.
Link a -DCMAKE_BUILD_TYPE=Release build of clang
Benchmark 1: numactl -C 20-27 /tmp/out/custom-13/bin/ld.lld @response.txt --threads=8
  Time (mean ± σ):     812.0 ms ±   5.1 ms    [User: 646.8 ms, System: 644.0 ms]
Benchmark 2: numactl -C 20-27 /tmp/out/custom-17/bin/ld.lld @response.txt --threads=8
  Time (mean ± σ):     525.7 ms ±   9.1 ms    [User: 571.9 ms, System: 700.1 ms]
Benchmark 1: numactl -C 20-27 ~/Dev/mold/out/release/mold --no-fork @response.txt --threads=8
  Time (mean ± σ):     230.6 ms ±   2.8 ms    [User: 759.7 ms, System: 349.4 ms]

Link a -DCMAKE_BUILD_TYPE=RelWithDebInfo build of clang
Benchmark 1: numactl -C 20-27 /tmp/out/custom-13/bin/ld.lld @response.txt --threads=8
  Time (mean ± σ):      1.604 s ±  0.023 s    [User: 2.887 s, System: 2.675 s]
Benchmark 2: numactl -C 20-27 /tmp/out/custom-17/bin/ld.lld @response.txt --threads=8
  Time (mean ± σ):      1.263 s ±  0.038 s    [User: 2.595 s, System: 2.878 s]
Benchmark 1: numactl -C 20-27 ~/Dev/mold/out/release/mold --no-fork @response.txt --threads=8
  Time (mean ± σ):     755.1 ms ±   6.1 ms    [User: 3276.0 ms, System: 1047.7 ms]


Link a release with debug info build of scylladb
Benchmark 1: numactl -C 20-27 /tmp/out/custom-13/bin/ld.lld @response.txt --threads=8
  Time (mean ± σ):      2.609 s ±  0.032 s    [User: 8.913 s, System: 3.375 s]
Benchmark 2: numactl -C 20-27 /tmp/out/custom-17/bin/ld.lld @response.txt --threads=8
  Time (mean ± σ):      1.853 s ±  0.012 s    [User: 6.999 s, System: 3.772 s]
Benchmark 1: numactl -C 20-27 ~/Dev/mold/out/release/mold --no-fork @response.txt --threads=8
  Time (mean ± σ):      1.371 s ±  0.011 s    [User: 7.429 s, System: 1.474 s]


Exercise

template <class ELFT, class RelTy> InputSection::relocateNonAlloc has performance problems：

arch-specific checks in generic code: R_386_GOTPC workaround, x86 R_SIZE, RISC-V ADD/SUB relocations, (future) RISC-V ULEB128
virtual function overhead (X86, X86_64, ARM, AArch64, etc : TargetInfo)

addend += target.getImplicitAddend(bufLoc, type);
RelExpr expr = target.getRelExpr(type, sym, bufLoc);
target.relocateNoSym(bufLoc, type, SignExtend64<bits>(sym.getVA(addend)));


How to make InputSection::relocateNonAlloc arch-specific (lld/ELF/Arch/*.cpp) and eliminate the target dispatch overhead？
The result should resemble arch-specific InputSection::relocateAlloc