MaskRay/lto.md

## lto.md

      
    Raw
  

              lto.md
            
          
    Concept

A LTO unit is the subset of the linkage unit that is linked together using link-time optimization.
Add module

LTO::add(...) {
}
Symbol resolution


'p': Prevailing. Only one module can be prevailing.
dead stripping should retain non-prevailing linkonce_odr/weak_odr/available_externally symbols.
Mixing ELF objects and IR objects can cause non-prevailing state within IR objects.
We want to retain them for interprocedural optimizations.
Non-prevailing definitions of other linkages can be discarded.
'l': FinalDefinitionInLinkageUnit. Set dso_local.
'x': VisibileToRegularObj
'r': LinkerRedefined. If Prevailing, set the weak linkage.

GlobalResolution


lto::LTO::GlobalResolution::Partition

External: VisibileToRegularObj or LinkerRedefined or llvm.used or llvm.compiler.used or referenced by a different partition
0: regular LTO
1~n: Thin LTO


lto::LTO::GlobalResolution::VisibileOutsideSummary: VisibileToRegularObj or llvm.used or llvm.compiler.used or has no summary. Affects GUIDPreservedSymbols.

LTO::run() {
  for (auto &Res : *GlobalResolutions) {
    if (Res.second.IRName.empty())
      continue;
    if (Res.second.VisibleOutsideSummary && Res.second.Prevailing)
      GUIDPreservedSymbols.insert(GUID);
    if (Res.second.ExportDynamic)
      DynamicExportSymbols.insert(GUID);
    GUIDPrevailingResolutions[GUID] = Res.second.Prevailing ? PrevailingType::Yes : PrevailingType::No;
  }

  // Walk refs() and calls() edges and set liveness for non-alias symbols.
  // For non-prevailing symbols, we only set live if linkonce_odr/weak_odr/available_externally.
  // If OptLevel > 0, propagate attributes.
  computeDeadSymbolsWithConstProp(ThinLTO.CombinedIndex, GUIDPreservedSymbols, isPrevailing, Conf.OptLevel > 0);

  runRegularLTO(...);
  runThinLTO(...);
}
lto::runRegularLTO()
  updateVCallVisibilityInModule(...);

  print 0.preopt

  for (auto &R : GlobalResolutions) {
    if (R.second.partition != 0 && R.second.partition != GlobalResolution::External)
      continue;
    if (!GV || GV->hasLocalLinkage() || GV->isDeclaration())
      continue;
    if (EnableLTOInternalization && R.second.Partition == 0)
      GV->setLinkage(GlobalValue::InternalLinkage);
  }

  RegularLTO.CombinedModule->addModuleFlag(Module::Error, "LTOPostLink", 1);

  print 2.internalize
  backend(...);
LTO::runThinLTO()
  If -dump-thin-cg-sccs, dump SCC.

  dump combinedindex file *.index.bc

  ThinLTO.CombinedIndex.collectDefinedGVSummariesPerModule(ModuleToDefinedGVSummaries);

  whole program devirt

  // If -O[123], compute import and export lists ignoring dead global values.
  // -O0 suppresses imports/exports. The import suppression is similar to -import-instr-limit=0.
  if (Conf.OptLevel > 0)
    ComputeCrossModuleImport
  // Compute ExportedGUIDs for prevailing External partition symbols.
  for (auto &Res : GlobalResolutions)
    if (Res.second.Partition == GlobalResolution::External && Res.second.isPrevailingIRSymbol() &&
        ThinLTO.CombinedIndex.isGUIDLive(GUID))
      ExportedGUIDs.insert(GUID);

  // thinLTOInternalizeAndPromoteInIndex(...);
  // Set linkage on ModuleSummaryIndex.
  for (auto &I : ThinLTO.CombinedIndex) {
    ValueInfo VI = ThinLTO.CombinedIndex.getValueInfo(I);
    for (auto &S : VI.getSummaryList())
      if (isExported(S->modulePath(), VI)) {
        if (isLocalLinkage(S->linkage()))
          S->setLinkage(ExternalLinkage);
      } else if (-mllvm=-enable-lto-internalization && (external||linkonce_odr||weak_odr||linkonce||weak||common)) {
        int ExternallyVisibleCopies = non-local-linkage copies in the summary list;
        if (external || prevailing && ExternallyVisibleCopies == 1)
          S->setLinkage(InternalLinkage);
      }
  }

  // thinLTOResolvePrevailingInIndex
  Change prevailing linkonce{,_odr} to weak{,_odr}.
  Change non-prevailing non-local non-alias values to available_externally.
  Compute visibility.

  // Propagate attributes such as memory(none), memory(read), norecurse, and nounwind (https://reviews.llvm.org/D36850)
  thinLTOPropagateFunctionAttrs(ThinLTO.CombinedIndex, isPrevailing);

  // For AddressSanitizer
  generateParamAccessSummary(ThinLTO.CombinedIndex);

  // If cache-dir, try cache first before spawning thinBackend.
  ThinLTO.Backend(Conf, ThinLTO.CombinedIndex, ModuleToDefinedGVSummaries, AddStream, Cache);
lto::thinBackend()
  Mod.setPartialSampleProfileRatio(CombinedIndex);

  print 0.preopt

  // For exporting.
  // Set "thinlto-internalize" attribute on readonly or writeonly GlobalVariables in ModuleSummaryIndex.
  // For GlobalValues which should be promoted, change linkage from internal/private to external and set HiddenVisibility.
  // If ClearDSOLocalOnDeclarations is enabled, clear dso_local if GV.isDeclarationForLinker();
  // otherwise, set dso_local if applicable.
  // If isDeclarationForLinker(), drop comdat.
  renameModuleForThinLTO(Mod, CombinedIndex, ClearDSOLocalOnDeclarations);

  dropDeadSymbols(Mod, DefinedGlobals, CombinedIndex); // if -compute-dead=1

  // Set the most constraining visibility.
  // Set visibility/linkage if GV is definition and both original linkage and summary linkage are non-local.
  //
  // If the original linkage is interposably and the new linkage is available_externally, convert to declaration instead
  // (otherwise interposability is lost and incorrect inlining may happen).
  // If the new linkage is weak_odr and summary index can auto hide, (original linkage must be linkonce_odr & unnamed_addr),
  // set HiddenVisibility.
  //
  // If isDeclarationForLinker(), drop comdat.
  thinLTOFinalizeInModule(Mod, DefinedGlobals, /*PropagateAttrs=*/true);

  print 1.promote

  // For definitions that are not available_externally/dllexport/isExternallyInitializedConstant
  // not always preserved (e.g. llvm.used, llvm.global_ctors, __stack_chk_fail), if the linkage
  // recorded in the GlobalValueSummary map is not local, internalize it.
  if (!DefinedGlobals.empty())
    thinLTOInternalizeModule(Mod, DefinedGlobals);

  print 2.internalize

  // Set "thinlto-internalize" attribute on readonly or writeonly GlobalVariables in ModuleSummaryIndex.
  // If -enable-import-metadata=true (default: false), set !thinlto_src_module on imported Function.
  //
  // Call renameModuleForThinLTO for importing.
  // For imported definition, change external/linkonce_odr/promoted local non-GlobalAlias to available_externally, available_externally definition to external declaration, weak_odr to external, local (if not promoted) to local, common to common.
  // Non-imported definitions may be set to Hidden.
  // Set or clear dso_local.
  // If isDeclarationForLinker(), drop comdat.
  //
  // if !GV.isDeclaration() && GV.hasAttribute("thinlto-internalize"), internalize.
  Importer.importFunctions(Mod, ImportList)

  updateMemProfAttributes(Mod, CombinedIndex);
  updatePublicTypeTestCalls(Mod, CombinedIndex.withWholeProgramVisibility());

  print 3.import

  // available_externally may be converted to declarations.
  Call lto::opt with ImportSummary and no ExportSummary.
  print 4.opt

  print 5.precodegen
  Run codegen passes.
// Compute import and export lists.
llvm::ComputeCrossModuleImport()
  for (const auto &DefinedGVSummaries : ModuleToDefinedGVSummaries)
    ComputeImportForModule(...)

ComputeImportForModule()
  ...
  SmallVector<EdgeInfo, 128> Worklist;
  for (const auto &GVSummary : DefinedGVSummaries) {
    ...
    Call computeImportForFunction
  }
  while (!Worklist.empty()) {
    Worklist.pop_bak_val();
    Call computeImportForFunction
  }
  if (-mllvm=-print-import-failures) {
    dbgs() << "Missed imports into module " << ModName << "\n";
    ...
  }

computeImportForFunction()
  for (const auto &Edge : Summary.calls()) {
    auto NewThreshold = -import-instr-limit value * (critical ? 100 : hot ? 10 : !cold ? 1 : 0);
    Call selectCallee for VI.getGUID().

    if (ExportLists)
      (*ExportLists)[ExportModulePath].insert(VI);

    // Insert the newly imported function to the worklist.
    auto AdjThreshold = IsHotCallsite ? Threshold : Threshold * 0.7;
    Worklist.emplace_back(ResolvedCalleeSummary, AdjThreshold);
  }

selectCallee()
  for (auto QualifiedValue : qualifyCalleeCandidates(Index, CalleeSummaryList, CallerModulePath)) {
    if (Reason != FunctionImporter::ImportFailureReason::None)
      continue;
    auto *Summary = cast<FunctionSummary>(QualifiedValue.second->getBaseObject());
    if ((Summary->instCount() > Threshold) && !Summary->fflags().AlwaysInline &&
        !ForceImportAll) {
      Reason = FunctionImporter::ImportFailureReason::TooLarge;
      continue;
    }
    if (Summary->fflags().NoInline && !ForceImportAll) {
      Reason = FunctionImporter::ImportFailureReason::NoInline;
      continue;
    }
    return Summary;
  }

FunctionImporter::importFunctions()
  SetVector<GlobalValue *> GlobalsToImport;
  // Collect Function/GlobalVariable/GlobalAlias to be collected into GlobalsToImport.

  UpgradeDebugInfo(*SrcModule);

  SrcModule->setPartialSampleProfileRatio(Index);

  // For importing (isPerformingImport()).
  // For an imported definition (one of external/linkonce_odr/weak_odr/internal/private/available_externally),
  // change the linkage to available_externally.
  // If ClearDSOLocalOnDeclarations is enabled, clear dso_local if GV.isDeclarationForLinker() || !doImportAsDefinition(&GV);
  // otherwise, set dso_local if applicable.
  // If isDeclarationForLinker(), drop comdat.
  renameModuleForThinLTO(*SrcModule, Index, ClearDSOLocalOnDeclarations, &GlobalsToImport);

  // Handle -print-imports

  Mover.move(std::move(SrcModule), GlobalsToImport.getArrayRef(), nullptr, /*IsPerformingImport=*/true);
foo.5.precodegen.bc can be fed into llc to test codegen issues.
output.index.bc is a bitcode file that contains the combined module summary index.
Symbols unknown to the IR symbol table

Middle-end library function optimizations may reference a runtime library function that is not in the referencer's symbol table.
;--- a.ll
define void @_start(ptr %a, ptr %b) {
entry:
  call void @llvm.memcpy.p0.p0.i64(ptr %a, ptr %b, i64 1024, i1 false)
  ret void
}

;--- b.ll
define void @memcpy() {
  ret void
}

If the definition is provided by a lazy bitcode file, we will find that we need to extract the lazy bitcode file after LTO compilation.
However, the bitcode file did not participate the LTO compilation and our model does not allow repeated LTO compilation.
As a result, the runtime library function will be either undefined or defined as a symbol without an associated section.
To address this issue, we make two changes:

Change the linker to extract all runtime library functions defined in lazy bitcode files. We cannot fortell what runtime library functions will be referenced, so we conservatively retain all (https://reviews.llvm.org/D50017).
Set the VisibileToRegularObj bit for all runtime library functions in an IR symbol table. This prevents the symbol from being internalized or discarded.

Similar to middle-end library function optimizations, codegen passes may reference a runtime library function that is not in the referencer's symbol table, e.g. __stack_chk_guard.
Importing

There are many reasons that a function/global variable cannot be imported.

A local linkage variable with an explicit section cannot be imported.
If function-local inline assembly references a local symbol defined in module-level inline assembly, the function cannot be imported.
Conservatively, if module-level inline assembly defines a local symbol, all functions with inline assembly are marked as not eligible.
A variable containing blockaddress cannot be imported.
Simlarly, a function that indirectly branches to a blockaddress cannot be imported.

-thinlto-workload-def= (#74545)
Internalize

TODO vtable internalization
Two usage

In-process ThinLTO

aka implicit ThinLTO.
clang -flto=thin -c a.c b.c
clang -fuse-ld=lld -flto=thin a.o b.o
The initial compilation creates bitcode files as linker input.
Clang calls clang/lib/CodeGen/BackendUtil.cpp EmitAssemblyHelper::RunOptimizationPipeline -> llvm/lib/Passes/PassBuilderPipelines.cpp PassBuilder::buildThinLTOPreLinkDefaultPipeline.
lld calls lld/ELF/Driver.cpp compileBitcodeFiles -> lld/ELF/LTO.cpp compile -> llvm/lib/LTO/LTO.cpp LTO::run.
A backend calls PassBuilder::buildThinLTODefaultPipeline.
https://reviews.llvm.org/D87966 introduced a module ordering to improve parallelism.
The concurrency can be set by --thinlto-jobs= (--plugin-opt=jobs=).
The Clang Driver option -flto-jobs= passes the linker option to the linker.
Distributed ThinLTO

Let's say we want to compile a.c, b.c, and c.c with LTO and link them with two ELF relocatable files elf0.o and elf1.o.
We link LLVM bitcode files b.o and c.o as lazy files, which have archive semantics (surrounded by --start-lib and --end-lib).
echo 'int bb(); int main() { return bb(); }' > a.c
echo 'int elf0(); int bb() { return elf0(); }' > b.c
echo 'int cc() { return 0; }' > c.c
echo 'int elf0() { return 0; }' > elf0.c && clang -c elf0.c
echo '' > elf1.c && clang -c elf1.c

clang -c -O2 -flto=thin a.c b.c c.c
clang -flto=thin -fuse-ld=lld -Wl,--thinlto-index-only=a.rsp,--thinlto-emit-imports-files -Wl,--thinlto-prefix-replace=';lto/' elf0.o a.o -Wl,--start-lib b.o c.o -Wl,--end-lib elf1.o
clang -c -O2 -fthinlto-index=lto/a.o.thinlto.bc a.o -o lto/a.o
clang -c -O2 -fthinlto-index=lto/b.o.thinlto.bc b.o -o lto/b.o
clang -c -O2 -fthinlto-index=lto/c.o.thinlto.bc c.o -o lto/c.o
clang -fuse-ld=lld @a.rsp elf0.o elf1.o  # a.rsp contains lto/a.o and lto/b.o
--thinlto-index-only (--plugin-opt=thinlto-index-only) performs a thin link.
Here we use a variant --thinlto-index-only=a.rsp which additionally creates a response file.
The response file lists ELF relocatable files whose names are derived from the input file names. Unextracted lazy LLVM bitcode files are omitted.
The generated *.thinlto.bc files are special bitcode files that contain minimum metadata information and a module summary.
c.o is an unextracted lazy LLVM bitcode file. It gets a nearly empty .thinlto.bc.
If --thinlto-emit-imports-files is specified, ld.lld will create import files lto/[abc].o.imports.
lto/a.o.imports lists files from which compiling a.o will import.
lto/c.o.imports will be empty: the build system does not need to know whether a lazy LLVM bitcode file is extracted or not.
clang -fthinlto-index= calls clang/lib/CodeGen/BackendUtil.cpp clang::EmitBackendOutput -> runThinLTOBackend -> llvm/lib/LTO/LTOBackend.cpp lto::thinBackend.
The response file @a.rsp is reordered before all ELF relocatable files. This may cause strange behaviors in presence of ODR violations.
Archives are unsupported because it's unclear where to emit the generated files for an archive member and clang -fthinlto-index= doesn't support taking an archive member as input. See https://discourse.llvm.org/t/running-distributed-thinlto-without-thin-archives/52261.
--lto-obj-path= sets c.AlwaysEmitRegularLTOObj and forces a regular LTO backend compile.
Minimized bitcode files

When compiling a source file with -flto=thin, we additionally specify -fthin-link-bitcode= to create a minimized bitcode file.
(See https://reviews.llvm.org/D31027 and https://reviews.llvm.org/D31050. 1dec57d5b0fb6b7044c9afa80e7c3b6295d36fd3 for size reduction.)
Here is an example (simplified from how Bazel implements distributed ThinLTO):
for i in a b c; do clang -c -O2 -flto=thin -fthin-link-bitcode=$i.min.o $i.c; done
clang -flto=thin -fuse-ld=lld -Wl,--thinlto-index-only=a.rsp -Wl,--thinlto-prefix-replace=';lto/',--thinlto-object-suffix-replace='.min.o;.o' elf0.o a.min.o -Wl,--start-lib [bc].min.o -Wl,--end-lib elf1.o
clang -c -O2 -fthinlto-index=lto/a.o.thinlto.bc a.o -o lto/a.o
clang -c -O2 -fthinlto-index=lto/b.o.thinlto.bc b.o -o lto/b.o
clang -c -O2 -fthinlto-index=lto/c.o.thinlto.bc c.o -o lto/c.o
clang -fuse-ld=lld @a.rsp elf0.o elf1.o  # a.rsp contains lto/a.o and lto/b.o
In the thin link, we specify --thinlto-object-suffix-replace=.min.o;.o to change IR module names.
In the emitted index files (*.thinlink.bc) and import files (*.imports), we will see [abc].o instead of [abc].min.o.
% llvm-bcanalyzer --dump lto/a.o.thinlto.bc | grep '<ENTRY'
    <ENTRY abbrevid=6 op0=0 op1=97 op2=46 op3=111/> record string = 'a.o'
    <ENTRY abbrevid=6 op0=1 op1=98 op2=46 op3=111/> record string = 'b.o'
% llvm-dis lto/a.o.thinlto.bc -o -

^0 = module: (path: "a.o", hash: (779884205, 3375561811, 318634325, 320943247, 1976514647))
^1 = module: (path: "b.o", hash: (2355201238, 1356860354, 536522957, 2663460937, 3889312581))
^2 = gv: (guid: 580323572204678433, summaries: (function: (module: ^1, flags: (linkage: external, visibility: default, notEligibleToImport: 0, live: 1, dsoLocal: 1, canAutoHide: 0), insts: 2, funcFlags: (readNone: 0, readOnly: 0, noRecurse: 0, returnDoesNotAlias: 0, noInline: 0, alwaysInline: 0, noUnwind: 1, mayThrow: 0, hasUnknownCall: 0, mustBeUnreachable: 0))))
^3 = gv: (guid: 15822663052811949562, summaries: (function: (module: ^0, flags: (linkage: external, visibility: default, notEligibleToImport: 0, live: 1, dsoLocal: 1, canAutoHide: 0), insts: 2, funcFlags: (readNone: 0, readOnly: 0, noRecurse: 0, returnDoesNotAlias: 0, noInline: 0, alwaysInline: 0, noUnwind: 1, mayThrow: 0, hasUnknownCall: 0, mustBeUnreachable: 0), calls: ((callee: ^2, tail: 1)))))
^4 = flags: 97
^5 = blockcount: 0

opt -thinlto-bc -thin-link-bitcode-file= also generates a minimized bitcode file.
Note, thin links only require minimized bitcode files, while backend compiles only require full bitcode files.
In Bazel, when we have an action that may take an tree artifact as an input, the action takes as input all the files inside the directory of the tree artifact or none of them.
If these bitcode files are within the same directory, Bazel will have to ship both full and minimized bitcode files to a backend compile action, causing waste.
To ship just full bitcode files in a backend compile action, full and minimized bitcode files need to be in different directories (e.g. a.bc and minimized/a.bc).
While we're here, let's make two more changes. First, let's use .bc instead of .o for the bitcode file extension name.
Second, let's place the initially compiled bitcode files in thin/ instead of the current working directory for clarity.
mkdir -p thin minimized
for i in a b c; do clang -c -O2 -flto=thin -fthin-link-bitcode=minimized/$i.bc $i.c -o thin/$i.bc; done
clang -flto=thin -fuse-ld=lld -Wl,--thinlto-index-only=a.rsp -Wl,--thinlto-prefix-replace='thin/;index/;obj/;minimized/' elf0.o minimized/a.bc -Wl,--start-lib minimized/[bc].bc -Wl,--end-lib elf1.o
clang -c -O2 -fthinlto-index=index/a.bc.thinlto.bc thin/a.bc -o obj/a.o  # index/a.bc.imports specifies the needed full bitcode files, e.g. thin/a.bc
clang -c -O2 -fthinlto-index=index/b.bc.thinlto.bc thin/b.bc -o obj/b.o
clang -c -O2 -fthinlto-index=index/c.bc.thinlto.bc thin/c.bc -o obj/c.o
clang -fuse-ld=lld @a.rsp elf0.o elf1.o  # a.rsp specifies obj/a.o and obj/b.o
In -Wl,--thinlto-prefix-replace=';index/;obj/;minimized/', we introduce two new components.
The third component, obj/, specifies the output directory for the backend compile.
For tree artifacts, an ouput directory can only be used by one action type.
Since index/ is created by the thin link, we cannot place backend compiled ELF relocatable object files (*.o) there.
Backend compiled ELF relocatable object files have to use a separate directory (obj/).
The fourth component, minimized/, specifies the directory for minimized bitcode files and replaces the previous --thinlto-object-suffix-replace='.minimized.o;.o'.
When ld.lld communicates module names to LLVM LTO, it forms the module identifier a.bc from the input filename minimized/a.bc (strip the first component (an empty string) and prepends the fourth component minimized/).
Pipelines

Pre-link
ThinLTO pre-link compiles invoke buildThinLTOPreLinkDefaultPipeline.
buildModuleSimplificationPipeline is important to ensure that the thin link
will make a better decision on function importing.
Some function attributes are important for the thin link to infer attributes.
Note: OptimizerEarlyEPCallbacks and OptimizerLastEPCallbacks are called here,
Clang can't add post-link OptimizerLastEPCallbacks for in-process ThinLTO, missing sanitizer instrumentation.
In addition, pre-link IR simplification makes the linker-input bitcode smaller.
However, pre-link compiles don't do buildModuleOptimizationPipeline, which will be made redundant after function importing and inlining (i.e. less return-on-investment).
Prelink compilation is the phase to provide best scalability (distributed/cached).
Without distributed ThinLTO (which has rigid build system support requirement), post-link compiles often don't have as good scalability as wanted.
Import
Post-link
debug

% llvm-lto -thinlto -print-summary-global-ids a.bc -o /dev/null
GUID 12110082535487358335(12110082535487358335) is A
GUID 15822663052811949562(15822663052811949562) is main

find ./ -iname '*.o' | xargs llvm-modextract -n=0 -o - | llvm-lto -thin-lto -print-summary-global-ids - >& hashnames.txt
FunctionImport

The clang -fthinlto-index= step can use -mllvm=-print-imports to dump the imported functions.
D28806: allow linkonce/weak non-prevailing definitions to be marked as available_externally in the summary index. LTOBackend thinLTOResolvePrevailingInModule converts the definitions to declarations.
Because ThinLTO cache skips thinBackend, --save-temps will not produce .0.preopt.bc and various temporary files.
--thinlto-cache-dir
Implicit ThinLTO and distributed ThinLTO are two different build system strategies.
My understanding of distributed ThinLTO is: explicit and separate thin link step (-Wl,--thinlto-index-only (-Wl,-plugin-opt=thinlto-index-only)) + (a clang process for each translation unit) clang -fthinlto-index=a.thinlto.bc a.o (a.o is -flto=thin compile output; a.thinlto.bc is thinlto-index-only output).
The thin link, ELF object file generation by the ThinLTO backend, final ELF link are mixed in the one ld.lld/LLVMgold.so process.
Inline assembly

LLVM IR's inline assembly representation does not track the list of defined and undefined symbols.
This has caused several problems and LTO parses some assembly IR to alleviate the problems.
We know that ThinLTO needs a module summary. Regular LTO needs a module summary as well for a non-Apple triple (shouldEmitRegularLTOSummary).
When emitting a module summary, ModuleSymbolTable::CollectAsmSymbols is called to parse module-level inline assembly and collects defined and undefined symbols.
Function-local inline assembly is not parsed.
For ThinLTO, the collected symbols are serialized in the output bitcode file.
Symbols in the unprocessed function-local inline assembly are not collected into the IR symbol table.
For regular LTO, the symbols are not serialized, but they are collected again at link time (addRegularLTO=>ModuleSymbolTable::addModule=>CollectAsmSymbols).

Inaccurate target CPU/features

At -c -flto={full,thin} time, ModuleSymbolTable::CollectAsmSymbols needs a MCSubtargetInfo to parse inline assembly.
It uses an empty feature string and therefore certain instructions needing specific features cannot be parsed, causing Issue #67698 (discussion).
cat > a.c <<e
asm(".globl func; func: cm.mvsa01 s1, s0; ret");
int main() { return 0; }
e
clang --target=riscv64 -march=rv64izcmp -fuse-ld=lld -nostdlib a.c
clang --target=riscv64 -march=rv64izcmp a.c -c -flto      # error: instruction requires the following: 'Zcmp' (sequenced instuctions for code-size reduction)
clang --target=riscv64 -march=rv64izcmp a.c -c -flto=thin # error: instruction requires the following: 'Zcmp' (sequenced instuctions for code-size reduction)
Clang since 18 onwards will suppress the output and exit with 1 for the -flto compiles.
Even if we manage to create a bitcode file, the LTO link will cause the same error.
For regular LTO, CollectAsmSymbols is called again and fails for the same reason due to an empty feature string.
For ThinLTO, thinBackend constructs the feature string from the default target features from the triple plus Conf.MAttrs specified features.
The feature string is used to create MCSubtargetInfo and TargetMachine.
In code generation, AsmPrinter::emitInlineAsm will fail due to missing the required feature.
Since Conf.MAttrs is consulted, the failure can be worked around using ld.lld a.o -mllvm='-mattr=+zcmp'.
Sometimes -Wl,-mllvm,-mattr=+c,-mllvm,-mattr=+relax is needed to enable C and relaxation.
Unknown definitions in function-local inline assembly

If a symbol is defined in function-local inline assembly and referenced by a bitcode file, lld will see an undefined symbol.
After bitcode compilation, the output file will contain the definition. The definition will be used to resolve references.
This scheme works in ELF and Mach-O as they use the model that a undefined symbol is only an error when referenced.
There is no relocation information before bitcode compilation and therefore ELF/Mach-O ports just don't report errors at that stage.
However, in COFF, an undefined symbol error is reported regardless of whether the symbol is referenced.
This can be worked around by specifying lld-link /force:unresolved.
A better approach is to create a weak definition:
# foo is defined in function-local inline assembly, unknown to the IR symbol table.
cat > a.c <<e
void def() { asm(".globl foo; foo: ret"); }
void foo();
void mainCRTStartup() { foo(); }
int main() { def(); foo(); }
e
# Dummy weak definition
echo '__attribute__((weak)) void foo() {}' > b.c
clang -c --target=x86_64-windows-msvc -flto a.c b.c
lld-link a.o b.o  # good
PGO

-fprofile-generate

non-LTO: instrument
ThinLTO prelink: instrument
ThinLTO postlink: no-op
linking: link against clang_rt.profile.a

-fcs-profile-generate

non-LTO: instrument
ThinLTO prelink: runs PGOInstrumentationGenCreateVar to create __llvm_profile_raw_version
ThinLTO postlink: instrument
linking: link against clang_rt.profile.a

# -fprofile-generate instruments the module
clang -O2 -c -flto=thin -fprofile-generate=profile a.c b.c
# -fprofile-generate just links in the runtime
clang -flto=thin -fprofile-generate=profile -fuse-ld=lld a.o b.o
# collect profile
./a.out
llvm-profdata merge profile/*.profraw -o default.profdata

# -fprofile-use performs optimization. -fcs-profile-generate at ThinLTO prelink phase just runs PGOInstrumentationGenCreateVar to create __llvm_profile_raw_version.
clang -O2 -c -flto=thin -fprofile-use=default.profdata -fcs-profile-generate=csprofile a.c b.c
# -fcs-profile-generate at ThinLTO postlink phase instruments the module. Moreover, link in the runtime.
clang -flto=thin -fcs-profile-generate=csprofile -fuse-ld=lld a.o b.o
# collect CS profile
./a.out
# merge it into the regular PGO profile
llvm-profdata merge default.profdata csprofile/*.profraw -o default.profdata

clang -O2 -c -flto=thin -fprofile-use=default.profdata a.c b.c
clang -flto=thin -fuse-ld=lld a.o b.o
# final executable
./a.out
-Xclang -flto-unit

Using -flto= passes -flto-unit to cc1. This can be disabled by -Xclang -fno-lto-unit.
cc1 -flto-unit causes Clang to emit type metadata in LLVM IR to support LTO unit features (CFI, whole program devirtualization).
https://lists.llvm.org/pipermail/llvm-dev/2019-January/128973.html

but currently by default each module is split into
two - a regular LTO module containing vtable defs and a thin LTO module
containing the rest. This is required by CFI so that the vtables can be
globally laid out via regular LTO. If you don't disable this splitting
(via "-Xclang -fno-lto-unit"), then you need to capture the result of this
regular LTO portion via the obj-path plugin option and link it as well.

With split LTO units, a symbol may occur twice in the symbol table, one as undefined and one as defined.
-fsplit-lto-unit

-fsanitize=cfi and whole program devirtualization default to -fsplit-lto-unit.
(as an exception, PlayStation and Apple platforms default to -fno-split-lto-unit for -flto=thin.)
-fno-split-lto-unit cannot use -fsanitize=cfi.
In the -fsplit-lto-unit mode, the emitted LLVM IR contains a module flag metadata "EnableSplitLTOUnit".
If a module contains both type metadata and "EnableSplitLTOUnit", ThinLTOBitcodeWriter.cpp will write two modules into the bitcode file.
The first module looks like a regular ThinLTO module. Definitions with type metadata are converted to declarations.
The second module is a regular LTO module with a module summary index. It has a module flag metadata "ThinLTO" with a value of 0.
The second module contains global variables with type metadata (vtables), certain eligible virtual functions cloned as available_externally, and if a comdat is cloned, comdat members (local linkage variables).
The cloned available_externally virtual functions can be used to perform optimizations like constant propagation during linking.
In a rare circumstance that llvm::getUniqueModuleId returns an empty string (there is no external non-comdat definition), the first module (ThinLTO) will be skipped.
If -fthin-link-bitcode= is specified, the minimized bitcode file gets two modules like the full bitcode file. The first module (regular ThinLTO) is minimized while the second module (regular LTO) is not modified.

IPO: Introduce ThinLTOBitcodeWriter pass.
[LTO] Record whether LTOUnit splitting is enabled in index
[LTO] Add option to enable LTOUnit splitting, and disable unless needed

Mixing -fsplit-lto-unit and -fno-split-lto-unit object files will result to error: Function Import: link error: linking module flags 'EnableSplitLTOUnit': IDs have conflicting values in 'a.o' and 'b.o'.
--lto-partitions=

ld.lld defaults to --lto-partitions=1: number of regular LTO partitions.
LLVMLTO calls llvm::SplitModule and performs parallel code generation passes.
Whole program devirtualization

Type metadata

https://llvm.org/docs/TypeMetadata.html introduced by Dead Virtual Function Elimination
--lto-whole-program-visibility

TODO
LTO visibility

cat > a.h <<'eof'
struct A { virtual int foo(); };
int bar(A *a);
eof
cat > a.cc <<'eof'
#include "a.h"
int A::foo() { return 1; }
int bar(A *a) { return a->foo(); }
eof
cat > b.cc <<'eof'
#include "a.h"
struct B : A { int foo() { return 2; } };
int baz() { B b; return bar(&b); }
eof
cat > main.cc <<'eof'
#include "a.h"
#include <stdio.h>
extern int baz();
int main() {
  A a;
  printf("%d %d\n", bar(&a), baz());
}
eof

clang++ -c -flto=thin -fwhole-program-vtables -O main.cc a.cc b.cc
clang++ -c -O b.cc -o b0.o
% clang++ -flto=thin -Wl,--lto-whole-program-visibility -fuse-ld=lld main.o a.o b.o && ./a.out
1 2
% clang++ -flto=thin -Wl,--lto-whole-program-visibility -fuse-ld=lld main.o a.o b0.o && ./a.out
1 1

Bazel configuration

Say, our binary target is named $package:b and has a source file $package/b.cc along with other dependencies.
The -flto=thin -fthin-link-bitcode= compile creates two files under bazel-bin/$package:

bazel-bin/$package/_objs/b/b.pic.o: bitcode file
bazel-bin/$package/_objs/b/b.pic.indexing.o: minimized bitcode file

Bazel generates bazel-bin/$package/b-lto-index.params, which references minimized bitcode files.
We perform a Thin link (-Wl,--thinlto-index-only=bazel-bin/$package/b-lto-final.params,--thinlto-emit-imports-files,--thinlto-prefix-replace=;bazel-bin/$package/b.lto/;bazel-bin/$package/b.lto-obj/,--thinlto-object-suffix-replace=.indexing.o;.o,--lto-obj-path=bazel-bin/$package/b.lto.merged.o) to create the following files:

bazel-bin/$package/b-lto-final.params: response file for the final native link (it references ELF relocatable object files after backend compiles, e.g. bazel-bin/$package/b.lto/$package/_objs/b/b.pic.o)
bazel-bin/$package/b.lto/$package/_objs/b/b.pic.o.thinlto.bc: index file
bazel-bin/$package/b.lto/$package/_objs/b/b.pic.o.imports: import file (it may reference full bitcode files from other targets like bazel-bin/$package/_objs/c/c.pic.o)
bazel-bin/$package/b.lto.merged.o: regular LTO output file

The ThinLTO backend compile creates bazel-bin/$package/b.lto-obj/bazel-bin/$package/_objs/b/b.pic.o (ELF relocatable object file).
We perform a native link using bazel-bin/$package/b-2.params as the response file.
The file references the --thinlto-index-only= output bazel-bin/$package/b-lto-final.params and the --lto-obj-path output bazel-bin/$package/b.lto.merged.o.
Older Bazel

The thin link command is different: -Wl,--thinlto-index-only=bazel-bin/$package/b-lto-final.params,--thinlto-emit-imports-files,--thinlto-prefix-replace=bazel-bin;bazel-bin/$package/b.lto,--thinlto-object-suffix-replace=.indexing.o;.o,--lto-obj-path=bazel-bin/$package/b.lto.merged.o).
The ThinLTO backend compile creates bazel-bin/$package/b.lto/$package/_objs/b/b.pic.o (ELF relocatable object file).
Misc

--opt-remarks-filename=a.yaml
If a build system ensures all bitcode files are produced from Clang of the same version, consider enabling -mllvm=-disable-auto-upgrade-debug-info for backend compiles to decrease compile time.
-Xclang=-mlink-bitcode-file -Xclang=other.bc can specify a bitcode file that should be linked with the main file.
PGOInstrumentationGenCreateVar needs to add a symbol before thin link.
-plugin-opt=stats-file=a.stats dumps llvm::Statistic counters.
--lto-CGO[0-3] is available to control CodeGenOpt::Level independent of the LTO optimization level.
Code size reduction

LTO can optimize out small functions (inlining is better than a function call) and larger functions that are called just once (inlining is better than a standalone definition).
The code size benefit is probably most significant with -Os or -Oz.
I just analyzed llvm-objdump built with ThinLTO.
An abstract class MCAsmBackend's vtable is retained in a non-LTO build because its constructor references the vtable.
In a LTO build, the constructor is inlined and discarded, causing the vtable to be discarded as well.
auto hidden

See Mach-O .weak_def_can_be_hidden
-mllvm -import-instr-limit=N

Linux kernel's top-level Makefile uses -mllvm -import-instr-limit=5
FatLTO

.llvm.lto SHT_LLVM_BITCODE
Unified LTO

clang -funified-lto
-funified-lto bitcode files cannot be linked together with a non--funified-lto bitcode file.
-foffload-lto

See OpenMP Offloading Features in LLVM 15