luqmana/Inline Assembly Miscompilation.md Secret

## Inline Assembly Miscompilation.md

      
    Raw
  

              Inline Assembly Miscompilation.md
            
          
    EDIT: Finally got around to making a blog so the canonical location of this post is now here.
$ cargo build --example basic --features usdt-probes
[...snip...]
error: could not compile `dropshot`

Caused by:
  process didn't exit successfully: `rustc [...snip...]` (signal: 11, SIGSEGV: invalid memory reference)
Achievement unlocked: rustc segfault.

Stack trace
fffffc7fce3fcbc0 librustc_driver-77cef3efbfa7284c.so`llvm::BranchProbabilityInfo::computeEestimateBlockWeight(llvm::Function const&, llvm::DominatorTree*, llvm::PostDominatorTree*)+0xd84()
fffffc7fce3fd370 librustc_driver-77cef3efbfa7284c.so`llvm::BranchProbabilityInfo::calculate(llvm::Function const&, llvm::LoopInfo const&, llvm::TargetLibraryInfo const*, llvm::DominatorTree*, llvm::PostDominatorTree*)+0x131()
fffffc7fce3fd3c0 librustc_driver-77cef3efbfa7284c.so`llvm::BranchProbabilityAnalysis::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&)+0x134()
fffffc7fce3fd5f0 librustc_driver-77cef3efbfa7284c.so`llvm::detail::AnalysisPassModel<llvm::Function, llvm::BranchProbabilityAnalysis, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Function>::Invalidator>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&)+0x2f()
fffffc7fce3fd6a0 librustc_driver-77cef3efbfa7284c.so`llvm::AnalysisManager<llvm::Function>::getResultImpl(llvm::AnalysisKey*, llvm::Function&)+0x2de()
fffffc7fce3fd6d0 librustc_driver-77cef3efbfa7284c.so`llvm::BlockFrequencyAnalysis::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&)+0x3f()
fffffc7fce3fd710 librustc_driver-77cef3efbfa7284c.so`llvm::detail::AnalysisPassModel<llvm::Function, llvm::BlockFrequencyAnalysis, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Function>::Invalidator>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&)+0x26()
fffffc7fce3fd7c0 librustc_driver-77cef3efbfa7284c.so`llvm::AnalysisManager<llvm::Function>::getResultImpl(llvm::AnalysisKey*, llvm::Function&)+0x2de()
fffffc7fce3fdc30 librustc_driver-77cef3efbfa7284c.so`llvm::AlwaysInlinerPass::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&)+0xa2c()
fffffc7fce3fdc50 librustc_driver-77cef3efbfa7284c.so`llvm::detail::PassModel<llvm::Module, llvm::AlwaysInlinerPass, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&)+0x15()
fffffc7fce3fddc0 librustc_driver-77cef3efbfa7284c.so`llvm::PassManager<llvm::Module, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&)+0x4b5()
fffffc7fce3ff170 librustc_driver-77cef3efbfa7284c.so`LLVMRustOptimizeWithNewPassManager+0x7f2()
fffffc7fce3ff3a0 librustc_driver-77cef3efbfa7284c.so`rustc_codegen_llvm::back::write::optimize_with_new_llvm_pass_manager+0x372()
fffffc7fce3ff5b0 librustc_driver-77cef3efbfa7284c.so`rustc_codegen_llvm::back::write::optimize+0x388()
fffffc7fce3ff900 librustc_driver-77cef3efbfa7284c.so`rustc_codegen_ssa::back::write::execute_work_item::<rustc_codegen_llvm::LlvmCodegenBackend>+0x1f3()
fffffc7fce3ffdb0 librustc_driver-77cef3efbfa7284c.so`std::sys_common::backtrace::__rust_begin_short_backtrace::<<rustc_codegen_llvm::LlvmCodegenBackend as rustc_codegen_ssa::traits::backend::ExtraBackendMethods>::spawn_named_thread<rustc_codegen_ssa::back::write::spawn_work<rustc_codegen_llvm::LlvmCodegenBackend>::{closure#0}, ()>::{closure#0}, ()>+0xf7()
fffffc7fce3fff60 librustc_driver-77cef3efbfa7284c.so`<<std::thread::Builder>::spawn_unchecked_<<rustc_codegen_llvm::LlvmCodegenBackend as rustc_codegen_ssa::traits::backend::ExtraBackendMethods>::spawn_named_thread<rustc_codegen_ssa::back::write::spawn_work<rustc_codegen_llvm::LlvmCodegenBackend>::{closure#0}, ()>::{closure#0}, ()>::{closure#1} as core::ops::function::FnOnce<()>>::call_once::{shim:vtable#0}+0xa9()
fffffc7fce3fffb0 libstd-ef15f81a900bedf3.so`std::sys::unix::thread::Thread::new::thread_start::h24133bfe318082b5+0x27()
fffffc7fce3fffe0 libc.so.1`_thrp_setup+0x6c(fffffc7fed642280)
fffffc7fce3ffff0 libc.so.1`_lwp_start()


Ok, so we're faulting somewhere in LLVM it seems like. From Cliff's initial investigation:

Anyway, yeah, something about the CFG construction there is generating either an empty basic block or a basic block ending in an unexpected type of instruction (something that is not an LLVM IR terminator instruction) and triggering https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/IR/BasicBlock.h#L121

First order of business then is to just check if the IR is valid. LLVM has a pass to do just that and we can ask rustc to run it first by passing -Z verify-llvm-ir=yes (note we need to switch to nightly to use -Z flags):
$ RUSTFLAGS="-Z verify-llvm-ir=yes" cargo +nightly build --example basic --features usdt-probes
Haha, nope:
Basic Block in function '_ZN8dropshot6server24http_request_handle_wrap28_$u7b$$u7b$closure$u7d$$u7d$17h503b14ddd4edd1deE' does not have terminator!
label %bb24
LLVM ERROR: Broken module found, compilation aborted!

# Demangle w/ rustfilt (c++filt works well enough too)
# Single quotes important to not misinterpret $ as shell vars!

$ rustfilt '_ZN8dropshot6server24http_request_handle_wrap28_$u7b$$u7b$closure$u7d$$u7d$17h503b14ddd4edd1deE'
dropshot::server::http_request_handle_wrap::{{closure}}

The IR generated for a closure in dropshot::server::http_request_handle_wrap is invalid—some basic block is missing a terminator.
Ok, is it rustc generating the bad IR directly or the result of some transformation pass miscompiling it?
But first, let's cheat and just get the final failing rustc command so we don't need to rebuild all the deps anytime we change RUSTFLAGS. Re-running the failing cargo command should just output the failing rustc invocation:
$ cargo +nightly build --example basic --features usdt-probes
   Compiling dropshot v0.6.1-dev (/src/dropshot/dropshot)
error: could not compile `dropshot`

Caused by:
  process didn't exit successfully: `rustc [...snip...]` (signal: 11, SIGSEGV: invalid memory reference)
From this point we can just directly run the rustc command as outputted with a few modifications:

add +nightly otherwise the rustc wrapper will attempt to use the rust version mentioned in rust-toolchain.toml
remove the --error-format=json and --json=... flags for human-readable output
add -Z verify-llvm-ir=yes
change the --emit argument to --emit=llvm-ir because that should be enough to trigger the issue and we'd like to look at the IR later

Stick this in a simple shell script to easily modify it and run it; call it repro.sh. Verify it still fails as expected:
$ ./repro.sh
Basic Block in function '_ZN8dropshot6server24http_request_handle_wrap28_$u7b$$u7b$closure$u7d$$u7d$17h503b14ddd4edd1deE' does not have terminator!
label %bb24
LLVM ERROR: Broken module found, compilation aborted!
Now back to figuring out where this invalid IR is coming from. Even though we're doing a debug build, there are still some LLVM passes that get run. So if we want to verify the IR that rustc directly generated, we need to make sure no LLVM passes are run at all (aside from the verify pass itself). The way to do that is via -C no-prepopulate-passes so let's edit our repro.sh and run it again:
$ ./repro.sh
$ echo $?
0
Ok rustc has been proven innocent. Looks like some LLVM pass generates invalid IR which really shouldn't happen! ⚠️
Well, now what? Let's try to find out what pass is responsible!
Our first attempt is by asking LLVM to print the IR after each pass—maybe we'll get lucky and see the offending pass last. We do this by modifying repro.sh again:

remove -C no-prepopulate-passes & -Z verify-llvm-ir=yes
add -C llvm-args=--print-after-all to print the IR after every pass
add -C codegen-units=1 -Z no-parallel-llvm to make the output a bit more readable

Alas, this doesn't go the way we want as we get the same segfault as before without any of the actual output we wanted :(
Ok, new attempt. Let's skip rustc and see if we can just invoke the LLVM machinery directly via opt. For that, let's first install it:
$ rustup component add --toolchain nightly llvm-tools-preview
It is not the most discoverable because it just gets plopped somewhere into rustc's sysroot directory:
$ OPT=$(find $(rustc +nightly --print sysroot) -name opt)
We also need the actual IR to pass to opt so let's go back and modify our repro.sh to only pass -C no-prepopulate-passes. We should find our initial rustc generated IR. It's also worth remove the -C debuginfo=2 to make the IR a bit smaller:
$ ls ./target/debug/examples/basic*.ll
./target/debug/examples/basic-5f5f0491fbb5b7d3.ll
Let's try something simple first and just run the IR through opt without any flags as a smoke test:
$ $OPT ./target/debug/examples/basic-5f5f0491fbb5b7d3.ll 
opt: ./target/debug/examples/basic-5f5f0491fbb5b7d3.ll:425470:1: error: expected instruction opcode
bb25:                                             ; preds = %bb24
^
😐 Wat. Looking at the IR around that line, we find this:
bb24:
; [...snip...]
  %186 = invoke i64 asm sideeffect inteldialect "990:   clr rax\0A\0A                    .pushsection set_dtrace_probes,\22aw\22,\22progbits\22\0A                    .balign 8\0A            991:\0A                    .4byte 992f-991b    // length\0A                    .byte 1\0A                    .byte 0\0A                    .2byte 1\0A                    .8byte 990b         // address\0A                    .asciz \22dropshot\22\0A                    .asciz \22request-start\22\0A                             // null-terminated strings for each argument\0A                    .balign 8\0A            992:    .popsection\0A                    \0A                    .pushsection yeet_dtrace_probes\0A                    .8byte 991b\0A                    .popsection\0A                \0A        ", "=&{ax}"() #23
          to label %bb25 unwind label %cleanup26, !srcloc !38
  store i64 %186, i64* %is_enabled, align 8

bb25:                                             ; preds = %bb24
  %_78 = load i64, i64* %is_enabled, align 8
  %187 = icmp eq i64 %_78, 0
  br i1 %187, label %bb46, label %bb26
Well that looks awfully like the error the LLVM IR verifier was telling us about (%bb24 not having a terminator)!
So looks like our assumption about rustc not being the one generating valid IR is wrong. Where did we go wrong?
Some light digging into the rustc source reveals that using the new LLVM pass manager (default for LLVM >= 13 thus Rust >= 1.56) means -Z verify-llvm-ir=yes is ignored when combined with -C no-prepopulate-passes. Whelp :/
(To be clear, it's not really the new pass manager's fault but rather the way it is setup in LLVMRustOptimizeWithNewPassManager).
So we've found one (minor) rustc bug so far but that doesn't help solve our original question. No worries, we can either switch to the old pass manager (-Z new-llvm-pass-manager=no) or just manually add the verifier pass (-C passes="verify"), either way we get the same ol error:
$ ./repro.sh
Basic Block in function '_ZN8dropshot6server24http_request_handle_wrap28_$u7b$$u7b$closure$u7d$$u7d$17h503b14ddd4edd1deE' does not have terminator!
label %bb24
LLVM ERROR: Broken module found, compilation aborted!
This brings us back to the actual culprit: rustc!

(We really should have suspected this after the brief foray with trying to print the resulting IR after every LLVM pass didn't give us anything: the IR we feed it was botched to begin with!).
So back to the invalid IR: at the end of bb24 we are using invoke (a terminator) with our inline assembly from the usdt probes in dropshot followed by a store instruction. Clearly this is wrong because store isn't a terminator and thus we shouldn't end a basic block with it. Let's see what the corresponding MIR (Rust's Mid-level IR) looks like.
Since the failing code is coming from dropshot and not the basic example itself, we can't use our repro.sh hack and so back we go to cargo and RUSTFLAGS:
Using -Z dump-mir='http_request_handle_wrap':
$ RUSTFLAGS="-Z dump-mir='http_request_handle_wrap'" cargo +nightly build --example basic --features usdt-probes
[...snip...]
(signal: 11, SIGSEGV: invalid memory reference)
$ ls mir_dump
mir_dump: No such file or directory
Ok, that isn't working as expected (-Z dump-mir=F should print just the MIR for functions matches the filter F and place it in a mir_dump folder). A bit annoying and we can only shave so many yaks right now but nothing a bigger hammer can't fix (just use --emit=mir to dump out all the mir into the target folder and find the corresponding one for dropshot):
$ RUSTFLAGS="--emit=mir" cargo +nightly build --example basic --features usdt-probes
$ ls target/debug/deps/dropshot-*.mir
target/debug/deps/dropshot-30d947b7471013cc.mir
Ok, now this looks like reasonable:
bb24: {
[...snip...]
        asm!("990:   clr rax

                    .pushsection set_dtrace_probes,\"aw\",\"progbits\"
                    .balign 8
            991:
                    .4byte 992f-991b    // length
                    .byte 1
                    .byte 0
                    .2byte 1
                    .8byte 990b         // address
                    .asciz \"dropshot\"
                    .asciz \"request-start\"
                             // null-terminated strings for each argument
                    .balign 8
            992:    .popsection
                    
                    .pushsection yeet_dtrace_probes
                    .8byte 991b
                    .popsection
                
        ", out("ax") _77, options(NOMEM | PRESERVES_FLAGS | NOSTACK)) -> [return: bb25, unwind: bb217]; // scope 10 at dropshot/src/lib.rs:581:1: 581:41
    }

    bb25: {
        _78 = _77;                       // scope 9 at dropshot/src/lib.rs:581:1: 581:41
        switchInt(move _78) -> [0_u64: bb46, otherwise: bb26]; // scope 9 at dropshot/src/lib.rs:581:1: 581:41
    }
Note in MIR, asm itself is a terminator and so bb24 here appropriately says that under normal control flow to go to bb25 or if unwinding go to bb217. In bb25 we see a simple statement, _78 = _77;, which is assigning the output (_77 i.e. is_enabled) from the asm and this should correspond to the store we saw in the LLVM IR.
So looks like something in the lowering from Rust MIR to LLVM IR is not quite right. Just eyeballing it, it seems like the store of the asm output is getting added to the wrong LLVM basic block. Wherein it should be part of the "normal" basic block taken by the invoke (indicated by the to label %bb25 argument), instead it is incorrectly placed directly after the invoke.
That seems to happens here in rustc.
Now that we have a pretty good idea of why things fail we should be able to make a smaller repro (*):
#![feature(asm_unwind)]

fn main() {
    let _x = String::from("string here just cause we need something with a non-trivial drop");
    let foo: u64;
    unsafe {
        std::arch::asm!(
            "mov {}, 1",
            out(reg) foo,
            options(may_unwind)
        );
    }
    println!("{}", foo);
}
$ rustc +nightly asm-miscompile.rs
[1]    6057 segmentation fault (core dumped)  rustc +nightly asm-miscompile.rs
Cool, we've got our smaller repro segfaulting, but is it the same issue? Let's take a look at what the LLVM IR says:
$ rustc +nightly asm-miscompile.rs --emit=llvm-ir -C no-prepopulate-passes
$ less asm-miscompile.ll

[...snip...]
bb1:                                              ; preds = %start
  %1 = invoke i64 asm sideeffect alignstack inteldialect unwind "mov ${0:q}, 1", "=&r,~{dirflag},~{fpsr},~{flags},~{memory}"()
          to label %bb2 unwind label %cleanup, !srcloc !9
  store i64 %1, i64* %foo, align 8

bb2:
[...snip...]
Would you look at that, we've got a store as the last instruction in the basic block with the asm.
Now that we have a small repro and know roughly where the issue is in rustc we can try fixing it:
diff --git a/compiler/rustc_codegen_llvm/src/asm.rs b/compiler/rustc_codegen_llvm/src/asm.rs
index 03c390b4bd4..91d132eb343 100644
--- a/compiler/rustc_codegen_llvm/src/asm.rs
+++ b/compiler/rustc_codegen_llvm/src/asm.rs
@@ -290,6 +290,11 @@ fn codegen_inline_asm(
         }
         attributes::apply_to_callsite(result, llvm::AttributePlace::Function, &{ attrs });
 
+        // Switch to the 'normal' basic block if we did an `invoke` instead of a `call`
+        if let Some((dest, _, _)) = dest_catch_funclet {
+            self.switch_to_block(dest);
+        }
+
         // Write results to outputs
         for (idx, op) in operands.iter().enumerate() {
             if let InlineAsmOperandRef::Out { reg, place: Some(place), .. }
A little waiting later and we can try compiling our small repro again:
$ rustc +stage1 asm-miscompile.rs
$ echo $?
0
🎉 Success! So what does the LLVM IR look like now?
$ rustc +stage1 asm-miscompile.rs --emit=llvm-ir -C no-prepopulate-passes
$ less asm-miscompile.ll

[...snip...]
bb1:                                              ; preds = %start
  %1 = invoke i64 asm sideeffect alignstack inteldialect unwind "mov ${0:q}, 1", "=&r,~{dirflag},~{fpsr},~{flags},~{memory}"()
          to label %bb2 unwind label %cleanup, !srcloc !9

bb2:
  store i64 %1, i64* %foo, align 8
[...snip...]
The store has moved down into %bb2 right where it should be.
(*) Note: we had to add options(may_unwind) and an unused String variable to actually get it to fail. Removing either of those will stop it from segfaulting. The difference being, the LLVM IR that rustc generates. Without both pieces, rustc just uses a simple call instruction for the inline asm whereas in the broken case, it's using invoke which is considered a terminator unlike call. After the invoke, the control transfer goes to either the 'normal' label or the 'unwind' label. By marking our asm with options(may_unwind) we essentially tell rustc to opt our inline assembly into participating in unwinding. The unused string is there so that there's actually something to cleanup in the case that we do unwind.
But, looking at the asm from the original failing code in dropshot there's no mention of MAY_UNWIND:
asm!("[...snip...]", out("ax") _77, options(NOMEM | PRESERVES_FLAGS | NOSTACK)) -> [return: bb25, unwind: bb217]; // scope 10 at dropshot/src/lib.rs:581:1: 581:41
usdt certainly isn't adding it so what gives?
Poking through rustc, it seems like that may be due to the way MIR inlining is implemented. A cleanup block may be assigned if a terminator (like inline asm) gets inlined. The lowering pass will use the presence of such a cleanup target to ultimately decide whether to use call or invoke.
But that theory is quickly shot down because the MIR inliner is still disabled by default.
Just searching through rustc for other places the unwind target of a terminator might be set yields something promising when it comes to generators. This would track as the original dropshot failure case is in the context of an async method. Further confirmation that this is where our InlineAsm terminator is getting an unwind target set is that the posion_block mentioned in that bit of code lines up. It has a single statement and if we hop back to the MIR of the failing dropshot example, lo and behold:
    asm!("[...snip...]", out("ax") _77, options(NOMEM | PRESERVES_FLAGS | NOSTACK)) -> [return: bb25, unwind: bb217];

[...snip...]

bb217 (cleanup): {
    discriminant((*(_1.0: &mut [static generator@dropshot/src/server.rs:651:43: 741:2]))) = 2; // scope 0 at dropshot/src/server.rs:651:43: 741:2
    resume;                          // scope 0 at dropshot/src/server.rs:651:43: 741:2
}
bb217 there is the unwind target and it matches exactly with the poison_block as constructed in the rustc.
Armed with some new evidence, we can adapt our simple repro to more closely match the async scenario encountered in dropshot:
extern crate futures; // 0.3.21

async fn bar() {
    let foo: u64;
    unsafe {
        std::arch::asm!(
            "mov {}, 1",
            out(reg) foo,
        );
    }
    println!("{}", foo);
}

fn main() {
    futures::executor::block_on(bar());
}
(Segfaults on playground.)
Thus all the mysteries are solved:

the MIR -> LLVM IR lowering for inline assembly outputted invalid LLVM IR when generated with an invoke instruction (Fix submitted here).
every async fn in rust is implemented as a generator and as part of that, terminators in the basic blocks of such a function are modified to include a cleanup target if they can unwind (*) so as to poison the generator (**).
We also discovered a bug with -Z verify-llvm-ir=yes and -C no-prepopulate-passes along the way (Fix submitted here).

(*) note the segfault in the above playground goes away if you remove the println! from foo because that is the only part that may actually unwind.
(**) if you've ever seen a panic saying "async fn resumed after panicking", this is why.
Update: The above bugs are fixed as of rustc 1.62.0-nightly (52ca603da 2022-04-12).