Skip to content

Instantly share code, notes, and snippets.

@davepacheco
Last active June 22, 2020 18:02
Show Gist options
  • Select an option

  • Save davepacheco/3b5b1d48d76a4484ad2f9bdbc2c61c9d to your computer and use it in GitHub Desktop.

Select an option

Save davepacheco/3b5b1d48d76a4484ad2f9bdbc2c61c9d to your computer and use it in GitHub Desktop.
notes from debugging an easy Rust issue

I've figured this bug out. I'm thinking out loud here trying to figure out what could be better about this debugging experience because this seemed way more circuitous (and time-consuming) than it should have been.

In my test suite, I have:

let testctx = test_setup().await.external_api;

where the return value of test_setup().await is a struct into which I've stashed an HTTP server that I had successfully started (but it's not in the external_api field). The behavior I observed was that when I went to make a request against that HTTP server, I got ECONNREFUSED.

(I first confirmed with Wireshark that the client was connecting to the correct IP and port of the server I started, and also confirmed that the server was successfully responding to requests (by inserting a sleep after starting the server and hitting it with curl). So far, this is mostly reasonable.)

I then checked the log from the test and saw that the server was initiating a graceful shutdown early in the test process. I sort of suspected RAII + accidentally letting something go out of scope was causing this. I figured this should be completely obvious from a stack trace and/or core file. I saved a core file where the log message was being printed (using dtrace), but this was useless because the logging is async. (I already have a TODO to fix this.)

At this point I looked briefly at rust-usdt, which seems to be a thing, and would have helped, but since I had to change the program anyway to put a probe in, I just put a panic at the point where we generate the log message. Like I mentioned in chat, I found that there's no way to configure panic to generate a core file in the test suite. I did get a stack trace, from which I cannot for the life of me figure out where the code is that's triggering the shutdown.

I gave up and fell back to guessing, which really feels like defeat. Turns out it was easy enough to guess: I gather that because I was never assigning the result of test_setup().await, the compiler figured it could move the external_api field out of it, drop the containing struct -- including the other fields in the object, including my HTTP server, shutting it down.

There are a bunch of unfortunate things here:

  • I need to fix logging in this program to not be async, for this and other reasons.
  • It should would be nice to have USDT probes at log points like we did in Node!
  • No core files from panic in tests?!
  • For the HTTP server: there are some objects where it'd be nice to be able to say "I'll manage the lifecycle explicitly, thanks", if for no other reason than to assert program correctness. Basically, I don't want it to just shut down because it went out of scope -- I want to have to close it myself.
  • The stack trace seemed to leave a lot to be desired.

Here's the stack trace:

thread 'test_disks' panicked at 'dropshot server shutting down without invoking close(): RecvError(())', dropshot/src/server.rs:144:13
stack backtrace:
   0: backtrace::backtrace::libunwind::trace
             at /Users/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/libunwind.rs:88
   1: backtrace::backtrace::trace_unsynchronized
             at /Users/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/mod.rs:66
   2: std::sys_common::backtrace::_print_fmt
             at src/libstd/sys_common/backtrace.rs:77
   3: <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt
             at src/libstd/sys_common/backtrace.rs:59
   4: core::fmt::write
             at src/libcore/fmt/mod.rs:1052
   5: std::io::Write::write_fmt
             at src/libstd/io/mod.rs:1426
   6: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:62
   7: std::sys_common::backtrace::print
             at src/libstd/sys_common/backtrace.rs:49
   8: std::panicking::default_hook::{{closure}}
             at src/libstd/panicking.rs:204
   9: std::panicking::default_hook
             at src/libstd/panicking.rs:224
  10: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:472
  11: rust_begin_unwind
             at src/libstd/panicking.rs:380
  12: std::panicking::begin_panic
  13: std::panicking::begin_panic
  14: core::result::Result<T,E>::expect
             at /rustc/b8cedc00407a4c56a3bda1ed605c6fc166655447/src/libcore/result.rs:991
  15: dropshot::server::HttpServer::new::{{closure}}
             at dropshot/src/server.rs:144
  16: <std::future::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/b8cedc00407a4c56a3bda1ed605c6fc166655447/src/libstd/future.rs:43
  17: <hyper::server::shutdown::Graceful<I,S,F,E> as core::future::future::Future>::poll
             at /Users/dap/.cargo/registry/src/github.com-1ecc6299db9ec823/hyper-0.13.4/src/server/shutdown.rs:71
  18: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/b8cedc00407a4c56a3bda1ed605c6fc166655447/src/libcore/future/future.rs:119
  19: std::future::poll_with_tls_context
             at /rustc/b8cedc00407a4c56a3bda1ed605c6fc166655447/src/libstd/future.rs:98
  20: dropshot::server::HttpServer::run::{{closure}}
             at dropshot/src/server.rs:99
  21: <std::future::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/b8cedc00407a4c56a3bda1ed605c6fc166655447/src/libstd/future.rs:43
  22: tokio::task::core::Core<T>::poll
             at /Users/dap/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-0.2.13/src/task/core.rs:128
  23: tokio::task::harness::Harness<T,S>::poll::{{closure}}::{{closure}}
             at /Users/dap/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-0.2.13/src/task/harness.rs:119
  24: core::ops::function::FnOnce::call_once
             at /rustc/b8cedc00407a4c56a3bda1ed605c6fc166655447/src/libcore/ops/function.rs:232
  25: <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/b8cedc00407a4c56a3bda1ed605c6fc166655447/src/libstd/panic.rs:318
  26: std::panicking::try::do_call
             at /rustc/b8cedc00407a4c56a3bda1ed605c6fc166655447/src/libstd/panicking.rs:305
  27: __rust_maybe_catch_panic
             at src/libpanic_unwind/lib.rs:86
  28: std::panicking::try
             at /rustc/b8cedc00407a4c56a3bda1ed605c6fc166655447/src/libstd/panicking.rs:281
  29: std::panic::catch_unwind
             at /rustc/b8cedc00407a4c56a3bda1ed605c6fc166655447/src/libstd/panic.rs:394
  30: tokio::task::harness::Harness<T,S>::poll::{{closure}}
             at /Users/dap/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-0.2.13/src/task/harness.rs:100
  31: tokio::loom::std::causal_cell::CausalCell<T>::with_mut
             at /Users/dap/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-0.2.13/src/loom/std/causal_cell.rs:41
  32: tokio::task::harness::Harness<T,S>::poll
             at /Users/dap/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-0.2.13/src/task/harness.rs:99
  33: tokio::task::raw::poll
             at /Users/dap/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-0.2.13/src/task/raw.rs:162
  34: tokio::task::raw::RawTask::poll
             at /Users/dap/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-0.2.13/src/task/raw.rs:113
  35: tokio::task::Task<S>::run
             at /Users/dap/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-0.2.13/src/task/mod.rs:381
  36: tokio::runtime::basic_scheduler::SchedulerPriv::tick
             at /Users/dap/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-0.2.13/src/runtime/basic_scheduler.rs:192
  37: tokio::runtime::basic_scheduler::BasicScheduler<P>::block_on
             at /Users/dap/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-0.2.13/src/runtime/basic_scheduler.rs:142
  38: tokio::runtime::Runtime::block_on::{{closure}}
             at /Users/dap/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-0.2.13/src/runtime/mod.rs:413
  39: tokio::runtime::context::enter
             at /Users/dap/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-0.2.13/src/runtime/context.rs:72
  40: tokio::runtime::handle::Handle::enter
             at /Users/dap/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-0.2.13/src/runtime/handle.rs:34
  41: tokio::runtime::Runtime::block_on
             at /Users/dap/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-0.2.13/src/runtime/mod.rs:410
  42: test_disks::test_disks
             at tests/test_disks.rs:44
  43: test_disks::test_disks::{{closure}}
             at tests/test_disks.rs:44
  44: core::ops::function::FnOnce::call_once
             at /rustc/b8cedc00407a4c56a3bda1ed605c6fc166655447/src/libcore/ops/function.rs:232
  45: <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once
             at /rustc/b8cedc00407a4c56a3bda1ed605c6fc166655447/src/liballoc/boxed.rs:1015
  46: __rust_maybe_catch_panic
             at src/libpanic_unwind/lib.rs:86
  47: std::panicking::try
             at /rustc/b8cedc00407a4c56a3bda1ed605c6fc166655447/src/libstd/panicking.rs:281
  48: std::panic::catch_unwind
             at /rustc/b8cedc00407a4c56a3bda1ed605c6fc166655447/src/libstd/panic.rs:394
  49: test::run_test_in_process
             at src/libtest/lib.rs:539
  50: test::run_test::run_test_inner::{{closure}}
             at src/libtest/lib.rs:452

My test is at frame 42. I think what happened is that in that function, the compiler generated a call to drop, which closed a sending channel, which caused a future that was waiting for a message on that channel to complete -- that's what's happening at frame 15. So the context where the drop happened is lost. This is a pretty classic async/await debugging issue.

@davepacheco
Copy link
Copy Markdown
Author

davepacheco commented Jun 22, 2020

With @cbiffle's help, I've got a better understanding of what likely happened.

  • drop() was invoked for the HttpServer, as I suspected above
  • that caused the channel Sender to be dropped
  • at this point, the stack likely unwound back up to the executor. (That is, the following step doesn't happen synchronously with the Sender's drop.)
  • the Receiver's rx.await returned with an error indicating the sender was dropped, but at this point the cause (the drop call) isn't on the stack any more

This explains why drop isn't on the stack, and it's a good example of how even with long stack traces in async/await, debugging can still be harder than with synchronous code.

@davepacheco
Copy link
Copy Markdown
Author

Also noteworthy: tests/test_disks.rs:44 appears twice, once as test_disks and once as test_disks::{{closure}} because the #[tokio::test] macro wraps our async function with a sync function that creates a single-thread executor and uses block_on to run our function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment