theme | class | highlighter | lineNumbers | drawings | css | routerMode | title | |
---|---|---|---|---|---|---|---|---|
./theme |
text-center |
shiki |
false |
|
unocss |
hash |
Optimize lld/ELF |
🛠 toolchain
DWARF Standards Committee
Contributed to AArch64, RISC-V, and x86-64 psABIs
layout: default image: https://source.unsplash.com/collection/94734566/1920x1080
- LLVM linker: COFF, ELF, wasm, and Mach-O
- New ELF linker started in July 2015 (by Michael J. Spencer)
- Early years: Rui Ueyama, Rafael Avila de Espindola, George Rimar
- (current) Peter Smith and I; my changes
- Adopted by:
- Chrome: Linux/x86-64 default since 2016, all Linux since 2018
- FreeBSD: aarch64 system linker since 2016, amd64 since May 2018, riscv64 since Jan 2020, all ports since Mar 2020
- Gentoo's clang profile: default-lld since 2020
- Android: NDK default since r22 (2021)
- Chimera Linux since 2021
- and many companies
- Some projects that are picky on the toolchain
- Linux kernel: usable in 2018, pretty good for arm/arm64/powerpc/x86 in 2019, mature in 2020
- glibc: good since 2021
timeline
1994 : GNU ld
2008 : GNU gold
2016 : lld
2021 : mold
"Implement an ELF linker" https://gist.github.com/MaskRay/601b51e900d19c6ed09d256ba9ec4722
- Parse command line options
- Find and scan input files (
.o
,.so
,.a
), interleaved with symbol resolution - Global transforms (section based garbage collection, identical code folding, etc)
- Create synthetic sections
- Map input sections and synthetic (linker-generated) sections into output sections
- Scan relocations
- Layout and assign addresses (thunks, sections with varying sizes, symbol assignments)
- Write headers
- Copy section contents to output and resolve relocations
Link clang (Release) with lld 13 and lld 17
In lld, the focus is probably more on parity on important features, ideal semantics, code readability, portability, and generality, but less on performance and specifically multi-threading performance. In some cases, simpler code is preferred over little performance improvement.
ld ref.o a.a
. The archive member a.a(a.o)
is extracted.
Each defined symbol in a.a(a.o)
is unnecessarily interned when processing a.a
's archive symbol table.
stateDiagram
Placeholder --> Undefined: Undefined(file=ref.o)
extraction : Undefined pending extraction of a.a(a.o)
Undefined --> extraction: LazyArchive(file=a.a), the intern result cannot be reused
extraction --> Defined: Defined(file=a.a(a.o))
void ArchiveFile::parse() {
for (const Archive::Symbol &sym : file->symbols())
symtab->addSymbol(LazyArchive{*this, sym});
...
}
template <class ELFT> void ObjFile<ELFT>::initializeSymbols() { ...
// Some entries have been filled by LazyObjFile.
for (size_t i = firstGlobal, end = eSyms.size(); i != end; ++i)
if (!symbols[i])
symbols[i] = symtab.insert(CHECK(eSyms[i].getName(stringTable), this));
...
}
Handle a.a
as a collection of lazy object files like --start-lib ... --end-lib
.
Archives and --start-lib
stateDiagram
Placeholder --> Undefined: Undefined(file=ref.o)
extraction : Undefined pending extraction of a.a(a.o)
Undefined --> extraction: LazyObject(file=a.a(a.o)), the intern result can be reused
extraction --> Defined: Defined(file=a.a(a.o))
template <class ELFT> void ObjFile<ELFT>::parseLazy() { ...
for (size_t i = firstGlobal, end = eSyms.size(); i != end; ++i) {
if (eSyms[i].st_shndx == SHN_UNDEF)
continue;
symbols[i] = symtab.insert(CHECK(eSyms[i].getName(stringTable), this));
symbols[i]->resolve(LazyObject{*this});
if (!lazy)
break;
}
}
template <class ELFT>
void ObjFile<ELFT>::initializeSymbols(const object::ELFFile<ELFT> &obj) { ...
// Some entries have been filled by LazyObjFile.
for (size_t i = firstGlobal, end = eSyms.size(); i != end; ++i)
if (!symbols[i])
symbols[i] = symtab.insert(CHECK(eSyms[i].getName(stringTable), this));
...
}
For an archive member:
- extracted: new code is strictly superior: avoid redundant hash table insertion and avoid accessing the archive symbol table
- not extracted: new code accesses individual ELF symbol tables (larger) instead of the compact archive symbol table
As of 2022-02, when linking chrome in the default build, 79% archive members are extracted according to --print-archive-stats=
.
- initialization of sections (embarrassingly parallel)
- COMDAT group resolution
- initialization of local symbols (embarrassingly parallel)
- initialization of non-local symbols
- symbol resolution, possibly archive member extraction
mold has parallelized all the above.
- Initialization of non-local symbols and symbol resolution may affect semantics
- difficult to do with the current lld architecture
- some features make full parallelization difficult:
SHT_*_ATTRIBUTES
,SHT_LLVM_DEPENDENT_LIBRARIES
That said, I managed to parallelize all the "initialization" passes as of lld 16.
for (size_t i = 0; i < files.size(); ++i)
parseFile(files[i]);
Prior to lld 14, parsing a relocatable object file makes section initialization, symbol initialization, and symbol resolution tangled.
template <class ELFT> void ObjFile<ELFT>::parse(bool ignoreComdats) {
// Initialize sections and handle COMDAT groups.
initializeSections(ignoreComdats);
// Initialize local and non-local symbols, and perform symbol resolution.
// Lazy object file extraction may cause parsing of other files.
initializeSymbols();
}
flowchart
comdat["Handle section groups and resolve COMDAT"]
nonLocalSymbols["Initialize non-local symbols (possiblly interleaved with archive member extraction)"]
section["(parallel) Initialize non-group sections and local symbols"]
postParse["(parallel) postParse: report symbol errors, set symbol's section, and
change symbols defined in non-prevailing sections to Undefined"]
comdat --> nonLocalSymbols
nonLocalSymbols --> section
section --> postParse
style comdat fill:#eee
style nonLocalSymbols fill:#eee
- Parallelize local symbol initialization: D119909
- Move section assignment from
initializeSymbols
topostParse
: D120626 - Parallelize input section initialization: D130810
Prior to 14, lld processes one OutputSection at a time and for each OutputSection writes input sections in parallel. This strategy does not leverage multi-threading well.
template <class ELFT> void Writer<ELFT>::writeSections() { ...
for (OutputSection *sec : outputSections)
if (sec->type != SHT_REL && sec->type != SHT_RELA)
sec->writeTo<ELFT>(Out::bufferStart + sec->offset);
}
template <class ELFT> void OutputSection::writeTo(uint8_t *buf) {
...
parallelForEachN(0, sections.size(), [&](size_t i) {
InputSection *isec = sections[i];
isec->writeTo<ELFT>(buf);
});
}
[ELF] Parallelize writes of different OutputSections
Partition input sections by ~4MiB clusters and parallelize writes of different output sections
template <class ELFT> void Writer<ELFT>::writeSections() { ...
parallel::TaskGroup tg;
for (OutputSection *sec : outputSections)
if (sec->type != SHT_REL && sec->type != SHT_RELA)
sec->writeTo<ELFT>(Out::bufferStart + sec->offset, tg);
}
template <class ELFT>
void OutputSection::writeTo(uint8_t *buf, parallel::TaskGroup &tg) { ...
auto fn = [=](size_t begin, size_t end) {
for (size_t i = begin; i != end; ++i)
sections[i]->writeTo<ELFT>(buf);
};
const size_t taskSizeLimit = 4 << 20;
for (size_t begin = 0, i = 0, taskSize = 0;;) {
taskSize += sections[i]->getSize();
bool done = ++i == numSections;
if (done || taskSize >= taskSizeLimit) {
tg.spawn([=] { fn(begin, i); });
if (done)
break;
begin = i;
taskSize = 0;
}
}
}
For each relocation updating a SHF_ALLOC
section, determine whether it is resolved to a link-time constant or requires special treatment (GOT, PLT, dynamic relocation, copy relocation, etc)
RelocationScanner scanner;
for (InputSectionBase *sec : inputSections)
if (sec->isLive() && (sec->flags & SHF_ALLOC))
scanner.template scanSection<ELFT>(*sec);
template <class ELFT, class RelTy>
static void processRelocAux(InputSectionBase &sec, RelExpr expr, RelType type,
uint64_t offset, Symbol &sym, const RelTy &rel,
int64_t addend) {
if (isStaticLinkTimeConstant(expr, type, sym, sec, offset) ||
(!config->isPic && sym.isUndefWeak())) {
sec.relocations.push_back({expr, type, offset, addend, &sym});
return;
}
bool canWrite = (sec.flags & SHF_WRITE) || !config->zText;
if (canWrite) {
RelType rel = target->getDynRel(type);
if (expr == R_GOT || (rel == target->symbolicRel && !sym.isPreemptible)) {
addRelativeReloc(&sec, offset, sym, addend, expr, type);
return;
} ...
}
[ELF] Parallelize relocation scanning
bool serial = !config->zCombreloc || config->emachine == EM_MIPS || config->emachine == EM_PPC64;
parallel::TaskGroup tg;
for (ELFFileBase *f : ctx->objectFiles) {
auto fn = [f]() {
RelocationScanner scanner;
for (InputSectionBase *s : f->getSections())
if (s && s->kind() == SectionBase::Regular && s->isLive() && (s->flags & SHF_ALLOC))
scanner.template scanSection<ELFT>(*s);
};
tg.spawn(fn, serial);
}
void RelocationScanner::processAux(RelExpr expr, RelType type, uint64_t offset,
Symbol &sym, int64_t addend) const { ...
if (needsGot(expr)) {
if (config->emachine == EM_MIPS) {
in.mipsGot->addEntry(*sec->file, sym, addend, expr);
} else if (!sym.isTls() || config->emachine != EM_LOONGARCH) {
sym.setFlags(NEEDS_GOT);
}
} else if (needsPlt(expr)) {
sym.setFlags(NEEDS_PLT);
} else if (LLVM_UNLIKELY(isIfunc)) {
sym.setFlags(HAS_DIRECT_RELOC);
}
}
--build-id
was introduced by Fedora as "approximation of true uniqueness across all binaries that might be used by overlapping sets of people".
llvm/lib/Support/SHA1.cpp
is derived from NetBSD and is not sufficiently optimized.
[ELF] Implement --build-id={md5,sha1} with truncated BLAKE3
C++ [lex.string]
Whether all string-literals are distinct (that is, are stored in nonoverlapping objects) and whether successive evaluations of a string-literal yield the same or a different object is unspecified.
Linkers can eliminate duplicates.
pie title numbers of string literals of the given length range in a debug build of clang-16
"1-3" : 393434
"4-8" : 2084056
"9-16" : 2846249
"17-128" : 5598928
"129-240" : 1317989
"241-" : 328058
XXH3 is faster than XXH64.
https://github.com/Cyan4973/xxHash is complex (compiler workarounds and so on)
Devin Hussey's xxhash-clean is clean but not well optimized.
Take the best part of xxHash and take simplification ideas from xxhash-clean.
- Make
XXH3_len_129to240_64b
andXXH3_hashLong_64b
noinline - Unroll
XXH3_len_17to128_64b
Patches:
- [ELF] Move gotIndex/pltIndex/globalDynIndex to SymbolAux
- [ELF] Remove unneeded SyntheticSection memset(*, 0, *)
[ELF] Optimize --wrap to only check non-local symbols
parallelForEach(objectFiles, [&](InputFile *file) {
MutableArrayRef<Symbol *> syms =
file->getMutableSymbols();
for (size_t i = 0, e = syms.size(); i != e; ++i)
if (Symbol *s = map.lookup(syms[i]))
syms[i] = s;
});
parallelForEach(objectFiles, [&](ELFFileBase *file) {
for (Symbol *&sym : file->getMutableGlobalSymbols())
if (Symbol *s = map.lookup(sym))
sym = s;
});
MutableArrayRef<Symbol *> getMutableGlobalSymbols() {
return makeMutableArrayRef(symbols.data(),
symbols.size()).slice(firstGlobal);
}
COMMON symbols are obsoleted. See All about COMMON symbols
[ELF] Optimize replaceCommonSymbols
static void replaceCommonSymbols() {
for (Symbol *sym : symtab->symbols()) {
auto *s = dyn_cast<CommonSymbol>(sym);
if (!s) continue;
auto *bss = make<BssSection>("COMMON", s->size,
s->alignment);
bss->file = s->file;
inputSections.push_back(bss);
s->replace(Defined{s->file, s->getName(), s->binding,
s->stOther, s->type, /*value=*/0, s->size, bss});
}
}
class ELFFileBase : public InputFile {
...
public:
bool hasCommonSyms = false;
};
static void replaceCommonSymbols() {
for (ELFFileBase *file : objectFiles) {
if (!file->hasCommonSyms)
continue;
for (Symbol *sym : file->getGlobalSymbols()) {
auto *s = dyn_cast<CommonSymbol>(sym);
if (!s) continue;
auto *bss = make<BssSection>("COMMON", s->size,
s->alignment);
bss->file = s->file;
inputSections.push_back(bss);
s->replace(Defined{s->file, s->getName(), s->binding,
s->stOther, s->type, /*value=*/0, s->size, bss});
}
}
}
[ELF] Parallelize --compress-debug-sections=zlib
Split a section into 1MiB shards and calls zlib deflake
in parallel.
- use
Z_SYNC_FLUSH
for all shards but the last to flush the output to a byte boundary to be concatenated with the next shard - use
Z_FINISH
for the last shard to set theBFINAL
flag to indicate the end of the output stream (per RFC1951)
mold links against mimalloc by default.
Build llvm-project with -DCMAKE_EXE_LINKER_FLAGS=/path/to/libmimalloc.a
for a relatively fair comparison.
Link a -DCMAKE_BUILD_TYPE=Release
build of clang
Benchmark 1: numactl -C 20-27 /tmp/out/custom-13/bin/ld.lld @response.txt --threads=8
Time (mean ± σ): 812.0 ms ± 5.1 ms [User: 646.8 ms, System: 644.0 ms]
Benchmark 2: numactl -C 20-27 /tmp/out/custom-17/bin/ld.lld @response.txt --threads=8
Time (mean ± σ): 525.7 ms ± 9.1 ms [User: 571.9 ms, System: 700.1 ms]
Benchmark 1: numactl -C 20-27 ~/Dev/mold/out/release/mold --no-fork @response.txt --threads=8
Time (mean ± σ): 230.6 ms ± 2.8 ms [User: 759.7 ms, System: 349.4 ms]
Link a -DCMAKE_BUILD_TYPE=RelWithDebInfo
build of clang
Benchmark 1: numactl -C 20-27 /tmp/out/custom-13/bin/ld.lld @response.txt --threads=8
Time (mean ± σ): 1.604 s ± 0.023 s [User: 2.887 s, System: 2.675 s]
Benchmark 2: numactl -C 20-27 /tmp/out/custom-17/bin/ld.lld @response.txt --threads=8
Time (mean ± σ): 1.263 s ± 0.038 s [User: 2.595 s, System: 2.878 s]
Benchmark 1: numactl -C 20-27 ~/Dev/mold/out/release/mold --no-fork @response.txt --threads=8
Time (mean ± σ): 755.1 ms ± 6.1 ms [User: 3276.0 ms, System: 1047.7 ms]
Link a release with debug info build of scylladb
Benchmark 1: numactl -C 20-27 /tmp/out/custom-13/bin/ld.lld @response.txt --threads=8
Time (mean ± σ): 2.609 s ± 0.032 s [User: 8.913 s, System: 3.375 s]
Benchmark 2: numactl -C 20-27 /tmp/out/custom-17/bin/ld.lld @response.txt --threads=8
Time (mean ± σ): 1.853 s ± 0.012 s [User: 6.999 s, System: 3.772 s]
Benchmark 1: numactl -C 20-27 ~/Dev/mold/out/release/mold --no-fork @response.txt --threads=8
Time (mean ± σ): 1.371 s ± 0.011 s [User: 7.429 s, System: 1.474 s]
template <class ELFT, class RelTy> InputSection::relocateNonAlloc has performance problems:
- arch-specific checks in generic code:
R_386_GOTPC
workaround, x86R_SIZE
, RISC-V ADD/SUB relocations, (future) RISC-V ULEB128 - virtual function overhead (
X86, X86_64, ARM, AArch64
, etc :TargetInfo
)addend += target.getImplicitAddend(bufLoc, type);
RelExpr expr = target.getRelExpr(type, sym, bufLoc);
target.relocateNoSym(bufLoc, type, SignExtend64<bits>(sym.getVA(addend)));
How to make InputSection::relocateNonAlloc
arch-specific (lld/ELF/Arch/*.cpp
) and eliminate the target
dispatch overhead?
The result should resemble arch-specific InputSection::relocateAlloc