Skip to content

Instantly share code, notes, and snippets.

@MaskRay
Created October 26, 2023 05:29
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save MaskRay/a44a7eb24ba4da029d6bca87b3ae306f to your computer and use it in GitHub Desktop.
Save MaskRay/a44a7eb24ba4da029d6bca87b3ae306f to your computer and use it in GitHub Desktop.
Optimize lld/ELF
theme class highlighter lineNumbers drawings css routerMode title
./theme
text-center
shiki
false
persist
unocss
hash
Optimize lld/ELF

Optimize lld/ELF

<style> h1 { } </style>

layout: 'intro'

MaskRay (宋方睿)

LLVM contributor since 2017 (4700+ commits), ld.lld and Clang Driver code owner, maintainer of a bunch of components
🛠 toolchain
DWARF Standards Committee
Contributed to AArch64, RISC-V, and x86-64 psABIs

lld

  • LLVM linker: COFF, ELF, wasm, and Mach-O
  • New ELF linker started in July 2015 (by Michael J. Spencer)
  • Early years: Rui Ueyama, Rafael Avila de Espindola, George Rimar
  • (current) Peter Smith and I; my changes
  • Adopted by:
    • Chrome: Linux/x86-64 default since 2016, all Linux since 2018
    • FreeBSD: aarch64 system linker since 2016, amd64 since May 2018, riscv64 since Jan 2020, all ports since Mar 2020
    • Gentoo's clang profile: default-lld since 2020
    • Android: NDK default since r22 (2021)
    • Chimera Linux since 2021
    • and many companies
  • Some projects that are picky on the toolchain
    • Linux kernel: usable in 2018, pretty good for arm/arm64/powerpc/x86 in 2019, mature in 2020
    • glibc: good since 2021

ELF linkers and when they became relatively mature

timeline
  1994 : GNU ld
  2008 : GNU gold
  2016 : lld
  2021 : mold

Generic linker design

"Implement an ELF linker" https://gist.github.com/MaskRay/601b51e900d19c6ed09d256ba9ec4722

  • Parse command line options
  • Find and scan input files (.o, .so, .a), interleaved with symbol resolution
  • Global transforms (section based garbage collection, identical code folding, etc)
  • Create synthetic sections
  • Map input sections and synthetic (linker-generated) sections into output sections
  • Scan relocations
  • Layout and assign addresses (thunks, sections with varying sizes, symbol assignments)
  • Write headers
  • Copy section contents to output and resolve relocations

Breakdown of link time

Link clang (Release) with lld 13 and lld 17

link clang (Release) with lld 13

link clang (Release) with lld 17

Why isn't ld.lld faster?

In lld, the focus is probably more on parity on important features, ideal semantics, code readability, portability, and generality, but less on performance and specifically multi-threading performance. In some cases, simpler code is preferred over little performance improvement.


Optimize archive member extraction

ld ref.o a.a. The archive member a.a(a.o) is extracted.

Each defined symbol in a.a(a.o) is unnecessarily interned when processing a.a's archive symbol table.

stateDiagram
  Placeholder --> Undefined: Undefined(file=ref.o)
  extraction : Undefined pending extraction of a.a(a.o)
  Undefined --> extraction: LazyArchive(file=a.a), the intern result cannot be reused
  extraction --> Defined: Defined(file=a.a(a.o))
void ArchiveFile::parse() {
  for (const Archive::Symbol &sym : file->symbols())
    symtab->addSymbol(LazyArchive{*this, sym});
  ...
}

template <class ELFT> void ObjFile<ELFT>::initializeSymbols() { ...
  // Some entries have been filled by LazyObjFile.
  for (size_t i = firstGlobal, end = eSyms.size(); i != end; ++i)
    if (!symbols[i])
      symbols[i] = symtab.insert(CHECK(eSyms[i].getName(stringTable), this));
  ...
}

Handle a.a as a collection of lazy object files like --start-lib ... --end-lib. Archives and --start-lib

stateDiagram
  Placeholder --> Undefined: Undefined(file=ref.o)
  extraction : Undefined pending extraction of a.a(a.o)
  Undefined --> extraction: LazyObject(file=a.a(a.o)), the intern result can be reused
  extraction --> Defined: Defined(file=a.a(a.o))
template <class ELFT> void ObjFile<ELFT>::parseLazy() { ...
  for (size_t i = firstGlobal, end = eSyms.size(); i != end; ++i) {
    if (eSyms[i].st_shndx == SHN_UNDEF)
      continue;
    symbols[i] = symtab.insert(CHECK(eSyms[i].getName(stringTable), this));
    symbols[i]->resolve(LazyObject{*this});
    if (!lazy)
      break;
  }
}

template <class ELFT>
void ObjFile<ELFT>::initializeSymbols(const object::ELFFile<ELFT> &obj) { ...
  // Some entries have been filled by LazyObjFile.
  for (size_t i = firstGlobal, end = eSyms.size(); i != end; ++i)
    if (!symbols[i])
      symbols[i] = symtab.insert(CHECK(eSyms[i].getName(stringTable), this));
  ...
}

Tradeoff

For an archive member:

  • extracted: new code is strictly superior: avoid redundant hash table insertion and avoid accessing the archive symbol table
  • not extracted: new code accesses individual ELF symbol tables (larger) instead of the compact archive symbol table

As of 2022-02, when linking chrome in the default build, 79% archive members are extracted according to --print-archive-stats=.


Optimize "Parse input files"

  • initialization of sections (embarrassingly parallel)
  • COMDAT group resolution
  • initialization of local symbols (embarrassingly parallel)
  • initialization of non-local symbols
  • symbol resolution, possibly archive member extraction

mold has parallelized all the above.

  • Initialization of non-local symbols and symbol resolution may affect semantics
  • difficult to do with the current lld architecture
  • some features make full parallelization difficult: SHT_*_ATTRIBUTES, SHT_LLVM_DEPENDENT_LIBRARIES

That said, I managed to parallelize all the "initialization" passes as of lld 16.


for (size_t i = 0; i < files.size(); ++i)
  parseFile(files[i]);

Prior to lld 14, parsing a relocatable object file makes section initialization, symbol initialization, and symbol resolution tangled.

template <class ELFT> void ObjFile<ELFT>::parse(bool ignoreComdats) {
  // Initialize sections and handle COMDAT groups.
  initializeSections(ignoreComdats);
  // Initialize local and non-local symbols, and perform symbol resolution.
  // Lazy object file extraction may cause parsing of other files.
  initializeSymbols();
}

flowchart
  comdat["Handle section groups and resolve COMDAT"]
  nonLocalSymbols["Initialize non-local symbols (possiblly interleaved with archive member extraction)"]
  section["(parallel) Initialize non-group sections and local symbols"]
  postParse["(parallel) postParse: report symbol errors, set symbol's section, and
  change symbols defined in non-prevailing sections to Undefined"]
  comdat --> nonLocalSymbols
  nonLocalSymbols --> section
  section --> postParse

  style comdat fill:#eee
  style nonLocalSymbols fill:#eee
  • Parallelize local symbol initialization: D119909
  • Move section assignment from initializeSymbols to postParse: D120626
  • Parallelize input section initialization: D130810

Optimize "Write sections"

Prior to 14, lld processes one OutputSection at a time and for each OutputSection writes input sections in parallel. This strategy does not leverage multi-threading well.

template <class ELFT> void Writer<ELFT>::writeSections() { ...
  for (OutputSection *sec : outputSections)
    if (sec->type != SHT_REL && sec->type != SHT_RELA)
      sec->writeTo<ELFT>(Out::bufferStart + sec->offset);
}

template <class ELFT> void OutputSection::writeTo(uint8_t *buf) {
  ...
  parallelForEachN(0, sections.size(), [&](size_t i) {
    InputSection *isec = sections[i];
    isec->writeTo<ELFT>(buf);
 });
}

[ELF] Parallelize writes of different OutputSections

Partition input sections by ~4MiB clusters and parallelize writes of different output sections


template <class ELFT> void Writer<ELFT>::writeSections() { ...
  parallel::TaskGroup tg;
  for (OutputSection *sec : outputSections)
    if (sec->type != SHT_REL && sec->type != SHT_RELA)
      sec->writeTo<ELFT>(Out::bufferStart + sec->offset, tg);
}

template <class ELFT>
void OutputSection::writeTo(uint8_t *buf, parallel::TaskGroup &tg) { ...
  auto fn = [=](size_t begin, size_t end) {
    for (size_t i = begin; i != end; ++i)
      sections[i]->writeTo<ELFT>(buf);
  };
  const size_t taskSizeLimit = 4 << 20;
  for (size_t begin = 0, i = 0, taskSize = 0;;) {
    taskSize += sections[i]->getSize();
    bool done = ++i == numSections;
    if (done || taskSize >= taskSizeLimit) {
      tg.spawn([=] { fn(begin, i); });
      if (done)
        break;
      begin = i;
      taskSize = 0;
    }
  }
}

Optimize "Scan relocations"

For each relocation updating a SHF_ALLOC section, determine whether it is resolved to a link-time constant or requires special treatment (GOT, PLT, dynamic relocation, copy relocation, etc)

  RelocationScanner scanner;
  for (InputSectionBase *sec : inputSections)
    if (sec->isLive() && (sec->flags & SHF_ALLOC))
      scanner.template scanSection<ELFT>(*sec);

template <class ELFT, class RelTy>
static void processRelocAux(InputSectionBase &sec, RelExpr expr, RelType type,
                            uint64_t offset, Symbol &sym, const RelTy &rel,
                            int64_t addend) {
  if (isStaticLinkTimeConstant(expr, type, sym, sec, offset) ||
      (!config->isPic && sym.isUndefWeak())) {
    sec.relocations.push_back({expr, type, offset, addend, &sym});
    return;
  }
  bool canWrite = (sec.flags & SHF_WRITE) || !config->zText;
  if (canWrite) {
    RelType rel = target->getDynRel(type);
    if (expr == R_GOT || (rel == target->symbolicRel && !sym.isPreemptible)) {
      addRelativeReloc(&sec, offset, sym, addend, expr, type);
      return;
    } ...
  }

[ELF] Parallelize relocation scanning

  bool serial = !config->zCombreloc || config->emachine == EM_MIPS || config->emachine == EM_PPC64;
  parallel::TaskGroup tg;
  for (ELFFileBase *f : ctx->objectFiles) {
    auto fn = [f]() {
      RelocationScanner scanner;
      for (InputSectionBase *s : f->getSections())
        if (s && s->kind() == SectionBase::Regular && s->isLive() && (s->flags & SHF_ALLOC))
          scanner.template scanSection<ELFT>(*s);
    };
    tg.spawn(fn, serial);
  }

void RelocationScanner::processAux(RelExpr expr, RelType type, uint64_t offset,
                                   Symbol &sym, int64_t addend) const { ...
  if (needsGot(expr)) {
    if (config->emachine == EM_MIPS) {
      in.mipsGot->addEntry(*sec->file, sym, addend, expr);
    } else if (!sym.isTls() || config->emachine != EM_LOONGARCH) {
      sym.setFlags(NEEDS_GOT);
    }
  } else if (needsPlt(expr)) {
    sym.setFlags(NEEDS_PLT);
  } else if (LLVM_UNLIKELY(isIfunc)) {
    sym.setFlags(HAS_DIRECT_RELOC);
  }
}

Optimize --build-id

--build-id was introduced by Fedora as "approximation of true uniqueness across all binaries that might be used by overlapping sets of people".

llvm/lib/Support/SHA1.cpp is derived from NetBSD and is not sufficiently optimized.

[ELF] Implement --build-id={md5,sha1} with truncated BLAKE3


Optimize SHF_MERGE|SHF_STRINGS duplicate elimination

C++ [lex.string]

Whether all string-literals are distinct (that is, are stored in nonoverlapping objects) and whether successive evaluations of a string-literal yield the same or a different object is unspecified.

Linkers can eliminate duplicates.

pie title numbers of string literals of the given length range in a debug build of clang-16
  "1-3" : 393434
  "4-8" : 2084056
  "9-16" : 2846249
  "17-128" : 5598928
  "129-240" : 1317989
  "241-" : 328058

XXH3 is faster than XXH64.

https://github.com/Cyan4973/xxHash is complex (compiler workarounds and so on)

Devin Hussey's xxhash-clean is clean but not well optimized.

Take the best part of xxHash and take simplification ideas from xxhash-clean.

  • Make XXH3_len_129to240_64b and XXH3_hashLong_64b noinline
  • Unroll XXH3_len_17to128_64b

Patches:


Do less work in common cases


[ELF] Optimize --wrap to only check non-local symbols

  parallelForEach(objectFiles, [&](InputFile *file) {
    MutableArrayRef<Symbol *> syms =
      file->getMutableSymbols();
    for (size_t i = 0, e = syms.size(); i != e; ++i)
      if (Symbol *s = map.lookup(syms[i]))
        syms[i] = s;
  });
  parallelForEach(objectFiles, [&](ELFFileBase *file) {
    for (Symbol *&sym : file->getMutableGlobalSymbols())
      if (Symbol *s = map.lookup(sym))
        sym = s;
  });

MutableArrayRef<Symbol *> getMutableGlobalSymbols() {
  return makeMutableArrayRef(symbols.data(),
      symbols.size()).slice(firstGlobal);
}

COMMON symbols are obsoleted. See All about COMMON symbols

[ELF] Optimize replaceCommonSymbols

static void replaceCommonSymbols() {
  for (Symbol *sym : symtab->symbols()) {
    auto *s = dyn_cast<CommonSymbol>(sym);
    if (!s) continue;
    auto *bss = make<BssSection>("COMMON", s->size,
      s->alignment);
    bss->file = s->file;
    inputSections.push_back(bss);
    s->replace(Defined{s->file, s->getName(), s->binding,
      s->stOther, s->type, /*value=*/0, s->size, bss});
  }
}
class ELFFileBase : public InputFile {
  ...
public:
  bool hasCommonSyms = false;
};

static void replaceCommonSymbols() {
  for (ELFFileBase *file : objectFiles) {
    if (!file->hasCommonSyms)
      continue;
    for (Symbol *sym : file->getGlobalSymbols()) {
      auto *s = dyn_cast<CommonSymbol>(sym);
      if (!s) continue;
      auto *bss = make<BssSection>("COMMON", s->size,
        s->alignment);
      bss->file = s->file;
      inputSections.push_back(bss);
      s->replace(Defined{s->file, s->getName(), s->binding,
        s->stOther, s->type, /*value=*/0, s->size, bss});
    }
  }
}

[ELF] Parallelize --compress-debug-sections=zlib

Split a section into 1MiB shards and calls zlib deflake in parallel.

  • use Z_SYNC_FLUSH for all shards but the last to flush the output to a byte boundary to be concatenated with the next shard
  • use Z_FINISH for the last shard to set the BFINAL flag to indicate the end of the output stream (per RFC1951)

mold links against mimalloc by default. Build llvm-project with -DCMAKE_EXE_LINKER_FLAGS=/path/to/libmimalloc.a for a relatively fair comparison.

Link a -DCMAKE_BUILD_TYPE=Release build of clang

Benchmark 1: numactl -C 20-27 /tmp/out/custom-13/bin/ld.lld @response.txt --threads=8
  Time (mean ± σ):     812.0 ms ±   5.1 ms    [User: 646.8 ms, System: 644.0 ms]
Benchmark 2: numactl -C 20-27 /tmp/out/custom-17/bin/ld.lld @response.txt --threads=8
  Time (mean ± σ):     525.7 ms ±   9.1 ms    [User: 571.9 ms, System: 700.1 ms]
Benchmark 1: numactl -C 20-27 ~/Dev/mold/out/release/mold --no-fork @response.txt --threads=8
  Time (mean ± σ):     230.6 ms ±   2.8 ms    [User: 759.7 ms, System: 349.4 ms]

Link a -DCMAKE_BUILD_TYPE=RelWithDebInfo build of clang

Benchmark 1: numactl -C 20-27 /tmp/out/custom-13/bin/ld.lld @response.txt --threads=8
  Time (mean ± σ):      1.604 s ±  0.023 s    [User: 2.887 s, System: 2.675 s]
Benchmark 2: numactl -C 20-27 /tmp/out/custom-17/bin/ld.lld @response.txt --threads=8
  Time (mean ± σ):      1.263 s ±  0.038 s    [User: 2.595 s, System: 2.878 s]
Benchmark 1: numactl -C 20-27 ~/Dev/mold/out/release/mold --no-fork @response.txt --threads=8
  Time (mean ± σ):     755.1 ms ±   6.1 ms    [User: 3276.0 ms, System: 1047.7 ms]

Link a release with debug info build of scylladb

Benchmark 1: numactl -C 20-27 /tmp/out/custom-13/bin/ld.lld @response.txt --threads=8
  Time (mean ± σ):      2.609 s ±  0.032 s    [User: 8.913 s, System: 3.375 s]
Benchmark 2: numactl -C 20-27 /tmp/out/custom-17/bin/ld.lld @response.txt --threads=8
  Time (mean ± σ):      1.853 s ±  0.012 s    [User: 6.999 s, System: 3.772 s]
Benchmark 1: numactl -C 20-27 ~/Dev/mold/out/release/mold --no-fork @response.txt --threads=8
  Time (mean ± σ):      1.371 s ±  0.011 s    [User: 7.429 s, System: 1.474 s]

Exercise

template <class ELFT, class RelTy> InputSection::relocateNonAlloc has performance problems:

  • arch-specific checks in generic code: R_386_GOTPC workaround, x86 R_SIZE, RISC-V ADD/SUB relocations, (future) RISC-V ULEB128
  • virtual function overhead (X86, X86_64, ARM, AArch64, etc : TargetInfo)
    • addend += target.getImplicitAddend(bufLoc, type);
    • RelExpr expr = target.getRelExpr(type, sym, bufLoc);
    • target.relocateNoSym(bufLoc, type, SignExtend64<bits>(sym.getVA(addend)));

How to make InputSection::relocateNonAlloc arch-specific (lld/ELF/Arch/*.cpp) and eliminate the target dispatch overhead? The result should resemble arch-specific InputSection::relocateAlloc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment