Create a gist now

Instantly share code, notes, and snippets.

Embed
What would you like to do?
The case for deadlines

The case for deadlines

Rust aims to be the foundation of fast and robust software. A thorn in that side is the choice of timeout-centric APIs for I/O and multithreading. I posit this was a bad choice and it needs to be remedied as soon as possible before more APIs proliferate this (possibly accidental) design decision.

Motivating example: time limited network client request

Task: connect to a remote server, send a request, receive response with an end-to-end timeout of 10 seconds.

Attempt #1

fn attempt1(remote: &SocketAddr, req: &[u8], resp: &mut Vec<u8>) -> Result<()> {
    let start = Instant::now();
    let timeout = Duration::from_secs(10);
    let mut stream = TcpStream::connect_timeout(remote, timeout)?;
    stream.set_write_timeout(Some(timeout - start.elapsed()))?;
    stream.write_all(req);
    stream.set_read_timeout(Some(timeout - start.elapsed()))?;
    stream.read_to_end(resp)?;
    Ok(())
}

While the above code is setting timeouts, the total runtime is not bound by the timeouts at all. A remote server that sends 100 kilobytes at 1 kilobyte per second will cause the client to stay connected for 100 seconds. This is because read_to_end will keep calling read and the read_timeout we have set will apply to each individual read call.

write_all exhibits similar behavior.

To fix this we need to try harder and practically handroll both write_all and read_to_end, while setting timeouts before each call.

Attempt #2

fn attempt2(remote: &SocketAddr, mut req: &[u8], resp: &mut Vec<u8>) -> Result<()> {
    let start = Instant::now();
    let timeout = Duration::from_secs(10);
    let mut stream = TcpStream::connect_timeout(remote, timeout)?;
    while !req.is_empty() {
        stream.set_write_timeout(Some(timeout - start.elapsed()))?;
        let n = stream.write(req)?;
        if n == 0 {
            return Err(Error::new(ErrorKind::UnexpectedEof, ""));
        }
        req = &req[n..];
    }
    let mut buf = [0; 1024];
    loop {
        stream.set_read_timeout(Some(timeout - start.elapsed()))?;
        let n = stream.read(&mut buf)?;
        if n == 0 {
            break;
        }
        resp.extend_from_slice(&buf[..n]);
    }
    Ok(())
}

What we learned so far:

  • Timeouts on the socket level are hard to use and make little sense in the context of higher level functions which call read or write many times.
  • write_all, read_to_end, and similar functions are at best unintuitive. There is very little one can do about documenting their behavior in the presence of timeouts.
  • To implement the correct semantics and cap our end-to-end time to 10 secs we had to set the timeout before each individual read or write operation.

The code is also not ergonomic at all and difficult to get right. In addition, even attempt #2 is not entirely correct since start.elapsed() can be greater than timeout in which case it will panic. So much for robustness...

Attempt #3

If we add a timeout parameter to each call, perhaps we can improve the ergonomics.

fn write_for(stream: &mut TcpStream, buf: &[u8], timeout: Duration) -> Result<usize> {
    stream.set_write_timeout(Some(timeout))?;
    stream.write(buf)
}

fn read_for(stream: &mut TcpStream, buf: &mut [u8], timeout: Duration) -> Result<usize> {
    stream.set_read_timeout(Some(timeout))?;
    stream.read(buf)
}

fn attempt3(remote: &SocketAddr, mut req: &[u8], resp: &mut Vec<u8>) -> Result<()> {
    let start = Instant::now();
    let timeout = Duration::from_secs(10);
    let mut stream = TcpStream::connect_timeout(remote, timeout)?;
    while !req.is_empty() {
        let n = write_for(&mut stream, req, timeout - start.elapsed())?;
        if n == 0 {
            return Err(Error::new(ErrorKind::UnexpectedEof, ""));
        }
        req = &req[n..];
    }
    let mut buf = [0; 1024];
    loop {
        let n = read_for(&mut stream, &mut buf, timeout - start.elapsed())?;
        if n == 0 {
            break;
        }
        resp.extend_from_slice(&buf[..n]);
    }
    Ok(())
}

This is slightly better but we still have to keep track of the elapsed time and keep recomputing the remaining timeout to pass to each call. It would be much better if this was handled by TcpStream itself so we wouldn’t have been bothered repeating ourselves.

Attempt #4

If we use a deadline instead of a timeout, we won’t have to recompute the remaining timeout on each call.

fn connect_until(remote: &SocketAddr, deadline: Instant) -> Result<TcpStream> {
    let now = Instant::now();
    if deadline <= now {
        return Err(Error::new(ErrorKind::TimedOut, ""));
    }
    TcpStream::connect_timeout(remote, deadline - now)
}

fn write_until(stream: &mut TcpStream, buf: &[u8], deadline: Instant) -> Result<usize> {
    let now = Instant::now();
    if deadline <= now {
        return Err(Error::new(ErrorKind::TimedOut, ""));
    }
    stream.set_write_timeout(Some(deadline - now))?;
    stream.write(buf)
}

fn read_until(stream: &mut TcpStream, buf: &mut [u8], deadline: Instant) -> Result<usize> {
    let now = Instant::now();
    if deadline <= now {
        return Err(Error::new(ErrorKind::TimedOut, ""));
    }
    stream.set_read_timeout(Some(deadline - now))?;
    stream.read(buf)
}

fn attempt4(remote: &SocketAddr, mut req: &[u8], resp: &mut Vec<u8>) -> Result<()> {
    let deadline = Instant::now() + Duration::from_secs(10);
    let mut stream = connect_until(remote, deadline)?;
    while !req.is_empty() {
        let n = write_until(&mut stream, req, deadline)?;
        if n == 0 {
            return Err(Error::new(ErrorKind::UnexpectedEof, ""));
        }
        req = &req[n..];
    }
    let mut buf = [0; 1024];
    loop {
        let n = read_until(&mut stream, &mut buf, deadline)?;
        if n == 0 {
            break;
        }
        resp.extend_from_slice(&buf[..n]);
    }
    Ok(())
}

Slightly better than before but still not very ergonomic. But now since we have a deadline, we can implement write_all_until and read_to_end_until on top of read_until and write_until that will respect the deadline.

Attempt #5

fn write_all_until(stream: &mut TcpStream, mut buf: &[u8], deadline: Instant) -> Result<()> {
    while !buf.is_empty() {
        let n = write_until(stream, buf, deadline)?;
        if n == 0 {
            return Err(Error::new(ErrorKind::UnexpectedEof, ""));
        }
        buf = &buf[n..];
    }
    Ok(())
}

fn read_to_end_until(stream: &mut TcpStream, out: &mut Vec<u8>, deadline: Instant) -> Result<()> {
    let mut buf = [0; 1024];
    loop {
        let n = read_until(stream, &mut buf, deadline)?;
        if n == 0 {
            break;
        }
        out.extend_from_slice(&buf[..n]);
    }
    Ok(())
}

fn attempt5(remote: &SocketAddr, req: &[u8], resp: &mut Vec<u8>) -> Result<()> {
    let deadline = Instant::now() + Duration::from_secs(10);
    let mut stream = connect_until(remote, deadline)?;
    write_all_until(&mut stream, req, deadline)?;
    read_to_end_until(&mut stream, resp, deadline)?;
    Ok(())
}

At this point, attempt #5 looks pretty decent. We can use higher level functions to read/write all bytes to the socket while respecting our time limits.

What we learned so far:

  • Deadlines compose very well, which allows building higher level functions that are robust and well defined.
  • Timeouts do not compose at all because they are context sensitive.

Proposal

With the learnings above, it is clear that timeout centric APIs are not good building blocks for robust applications. As such I propose that all timeout APIs in stdlib are deprecated. Luckily there are only a handful of them. At the same time of the deprecation, deadline centric variants for each operation should be introduced.

Furthermore, Futures and async/await should also be built with deadlines centric APIs instead of timeouts. For asyncio deadlines are even more important. For best ergonomics, deadlines in async programming should be set on core::task::Context for reasons I will cover in another post.

@Nemo157

This comment has been minimized.

Show comment
Hide comment
@Nemo157

Nemo157 Apr 10, 2018

Using a deadline ties cancellation purely to time, sometimes you have other reasons to cancel an operation (e.g. in a proxy server, when the client goes away you can cancel your request to a server). This appears to be very close to a time specific version of C#'s CancellationToken system (used for Tasks, the equivalent of Futures, but I don't see any reason it couldn't be used for synchronous IO as well), it has the same property of passing an extra parameter down through every function to allow cancelling the operation externally from the bottom up (although, being able to use the low-level set_{read,write}_timeout may have better performance implications than a general token system).

Also, for futures as far as i'm aware the standard way to implement timeouts would be:

async fn attempt_async(remote: &SocketAddr, req: &[u8], resp: &mut Vec<u8>) -> io::Result<()> {
    let operation = async_block! {
        let mut stream = await!(TcpStream::connect(remote))?;
        await!(stream.write_all(req))?;
        await!(stream.read_to_end(resp))?;
        Ok(())
    }
    await!(select(operation, Timeout::from_secs(10)))?;
    Ok(())
}

i.e. applying a timeout to the operation as a whole externally rather than trying to manage internal timeouts on the individual operations, you then rely on Drop implementations on the futures to cleanup if the timeout is hit.

Nemo157 commented Apr 10, 2018

Using a deadline ties cancellation purely to time, sometimes you have other reasons to cancel an operation (e.g. in a proxy server, when the client goes away you can cancel your request to a server). This appears to be very close to a time specific version of C#'s CancellationToken system (used for Tasks, the equivalent of Futures, but I don't see any reason it couldn't be used for synchronous IO as well), it has the same property of passing an extra parameter down through every function to allow cancelling the operation externally from the bottom up (although, being able to use the low-level set_{read,write}_timeout may have better performance implications than a general token system).

Also, for futures as far as i'm aware the standard way to implement timeouts would be:

async fn attempt_async(remote: &SocketAddr, req: &[u8], resp: &mut Vec<u8>) -> io::Result<()> {
    let operation = async_block! {
        let mut stream = await!(TcpStream::connect(remote))?;
        await!(stream.write_all(req))?;
        await!(stream.read_to_end(resp))?;
        Ok(())
    }
    await!(select(operation, Timeout::from_secs(10)))?;
    Ok(())
}

i.e. applying a timeout to the operation as a whole externally rather than trying to manage internal timeouts on the individual operations, you then rely on Drop implementations on the futures to cleanup if the timeout is hit.

@chrish42

This comment has been minimized.

Show comment
Hide comment
@chrish42

chrish42 Apr 10, 2018

First, yes! Second, are you aware of @njsmith's work on Python on similar ideas? I feel these would apply to Rust too, but I haven't thought about this in depth: https://vorpus.org/blog/timeouts-and-cancellation-for-humans/

It would be really nice if Rust ended up with an ergonomic interface for this, which made it easy to do the right thing. Hopefully there'll be a RFC about this in the future...

First, yes! Second, are you aware of @njsmith's work on Python on similar ideas? I feel these would apply to Rust too, but I haven't thought about this in depth: https://vorpus.org/blog/timeouts-and-cancellation-for-humans/

It would be really nice if Rust ended up with an ergonomic interface for this, which made it easy to do the right thing. Hopefully there'll be a RFC about this in the future...

@alkis

This comment has been minimized.

Show comment
Hide comment
@alkis

alkis Apr 10, 2018

@Nemo157 I didn't go very deep in this post to avoid complicating the discussion. Cancellations are orthogonal to timeout vs deadlines. I don't think there is a portable way to handle cancellation in blocking operations.

How would the example with futures be implemented? Would the timeout enter the underlying select loop?

@chrish42 yes I read that. I tried to find posts about timeouts vs deadlines in the public web and it is surprisingly little information on it. Sadly this seems like one of the things that ends up being rediscovered independently over and over. The first step to get a nice model is to introduce deadline-centric APIs. After that other abstractions can be built on top: scoped deadlines is one such feature.

Owner

alkis commented Apr 10, 2018

@Nemo157 I didn't go very deep in this post to avoid complicating the discussion. Cancellations are orthogonal to timeout vs deadlines. I don't think there is a portable way to handle cancellation in blocking operations.

How would the example with futures be implemented? Would the timeout enter the underlying select loop?

@chrish42 yes I read that. I tried to find posts about timeouts vs deadlines in the public web and it is surprisingly little information on it. Sadly this seems like one of the things that ends up being rediscovered independently over and over. The first step to get a nice model is to introduce deadline-centric APIs. After that other abstractions can be built on top: scoped deadlines is one such feature.

@christophebiocca

This comment has been minimized.

Show comment
Hide comment
@christophebiocca

christophebiocca Apr 10, 2018

How would the example with futures be implemented? Would the timeout enter the underlying select loop?

The behaviour is (not would be, this is an already existing system) as follows:

  • Futures are polled to completion. This means something like read-to-end is done as a bunch of smaller non-blocking read calls, separated by waiting for more data to be available. This polling is driven by whoever immediately depends on the Future.
  • The read and the timeout both get scheduled (probably both using select|epoll|whatever, although that's an internal detail). Future::select drives both, waiting until one finishes.
  • If the timeout finishes first, the read Future is dropped on the floor. When the select|epoll|whatever call comes back from the OS, the signal is known to be useless, and is silently ignored by the event loop. Alternatively the read Future might be able to unregister from select|epoll|whatever when it gets dropped, but that's an internal detail.
  • If the read finishes first, the timeout Future is dropped on the floor. When the select|epoll|whatever call comes back from the OS, the signal is known to be useless, and is silently ignored by the event loop. Alternatively the timeout Future might be able to unregister from select|epoll|whatever when it gets dropped, but that's an internal detail.

christophebiocca commented Apr 10, 2018

How would the example with futures be implemented? Would the timeout enter the underlying select loop?

The behaviour is (not would be, this is an already existing system) as follows:

  • Futures are polled to completion. This means something like read-to-end is done as a bunch of smaller non-blocking read calls, separated by waiting for more data to be available. This polling is driven by whoever immediately depends on the Future.
  • The read and the timeout both get scheduled (probably both using select|epoll|whatever, although that's an internal detail). Future::select drives both, waiting until one finishes.
  • If the timeout finishes first, the read Future is dropped on the floor. When the select|epoll|whatever call comes back from the OS, the signal is known to be useless, and is silently ignored by the event loop. Alternatively the read Future might be able to unregister from select|epoll|whatever when it gets dropped, but that's an internal detail.
  • If the read finishes first, the timeout Future is dropped on the floor. When the select|epoll|whatever call comes back from the OS, the signal is known to be useless, and is silently ignored by the event loop. Alternatively the timeout Future might be able to unregister from select|epoll|whatever when it gets dropped, but that's an internal detail.
@peterdelevoryas

This comment has been minimized.

Show comment
Hide comment
@peterdelevoryas

peterdelevoryas Apr 10, 2018

tokio has deadlines in its timer module

tokio has deadlines in its timer module

@CAFxX

This comment has been minimized.

Show comment
Hide comment
@CAFxX

CAFxX Apr 11, 2018

Slightly orthogonal but I'll bring this up since we're on topic. I would like to suggest what I already suggested for go: all I/O, and especially network calls, should default to have some sort of timeout/deadline. There are good reasons to sometimes have no timeout at all, but I would argue they are a minority compared to the common case. If that's true then it stands to reason that the default should fit the common case, not the uncommon one.

CAFxX commented Apr 11, 2018

Slightly orthogonal but I'll bring this up since we're on topic. I would like to suggest what I already suggested for go: all I/O, and especially network calls, should default to have some sort of timeout/deadline. There are good reasons to sometimes have no timeout at all, but I would argue they are a minority compared to the common case. If that's true then it stands to reason that the default should fit the common case, not the uncommon one.

@anguslees

This comment has been minimized.

Show comment
Hide comment
@anguslees

anguslees Apr 18, 2018

I strongly oppose adding artificial timeouts by default fwiw. Many times I've had to deal with a production outage where the correct fix was "just wait longer", because people set arbitrary timeouts based on whatever is normal for their experience - but deviation from normal is not an error.

I strongly suggest timeouts/deadlines should be imposed by the caller(*) and not internally fabricated "just because". In the general default/no-deadline case, this means "wait forever" (in practice this is of course finite when the user gets bored and presses ^c or the TCP connection/OS/whatever is restarted).

(*) The exception here is where the timeout is part of a sophisticated internal error recovery/retry strategy.

I strongly oppose adding artificial timeouts by default fwiw. Many times I've had to deal with a production outage where the correct fix was "just wait longer", because people set arbitrary timeouts based on whatever is normal for their experience - but deviation from normal is not an error.

I strongly suggest timeouts/deadlines should be imposed by the caller(*) and not internally fabricated "just because". In the general default/no-deadline case, this means "wait forever" (in practice this is of course finite when the user gets bored and presses ^c or the TCP connection/OS/whatever is restarted).

(*) The exception here is where the timeout is part of a sophisticated internal error recovery/retry strategy.

@CAFxX

This comment has been minimized.

Show comment
Hide comment
@CAFxX

CAFxX Apr 19, 2018

Many times I've had to deal with a production outage where the correct fix was "just wait longer", because people set arbitrary timeouts based on whatever is normal for their experience - but deviation from normal is not an error.

@anguslees I argue that the opposite is true as well: many reliability problems I witnessed in production stem from missing timeouts.

My point is the following: the common case can't be "wait forever" because "waiting forever" is the same as "hanging", i.e. a bug (if for no other reason that you're consuming resources to do so, and therefore this is the same as consuming resources "forever", i.e. a resource leak).

Now, I'm not claiming that there are no good reasons for having no timeout set. What I am claiming is that the common case does not require "infinite timeouts" and that since "infinite timeouts" are equivalent to leaking resources, the safe default should be to have some form of default, finite timeout rather than no timeout at all. See e.g. the discussion done over golang/go#23459 (comment).

CAFxX commented Apr 19, 2018

Many times I've had to deal with a production outage where the correct fix was "just wait longer", because people set arbitrary timeouts based on whatever is normal for their experience - but deviation from normal is not an error.

@anguslees I argue that the opposite is true as well: many reliability problems I witnessed in production stem from missing timeouts.

My point is the following: the common case can't be "wait forever" because "waiting forever" is the same as "hanging", i.e. a bug (if for no other reason that you're consuming resources to do so, and therefore this is the same as consuming resources "forever", i.e. a resource leak).

Now, I'm not claiming that there are no good reasons for having no timeout set. What I am claiming is that the common case does not require "infinite timeouts" and that since "infinite timeouts" are equivalent to leaking resources, the safe default should be to have some form of default, finite timeout rather than no timeout at all. See e.g. the discussion done over golang/go#23459 (comment).

@Ixrec

This comment has been minimized.

Show comment
Hide comment
@Ixrec

Ixrec Apr 19, 2018

Personally, I'd lean toward "wait forever"/"no default timeouts", because in a futures/async world it's fairly straightforward to override a global lack of timeouts by adding a single timeout in your application code (as demonstrated above). But if the problem is a default timeout that's too small, your ability to override it is restricted by the APIs of however many libraries there are between you and the one with the timeout, which means there's a good chance there simply is no way to override it without serious hacks like forking that entire dependency tree.

If you somehow have a library that cannot possibly be used async-ly, then... well that sounds very niche to me, so I'd be tempted to say "mandatory timeout/deadline arguments everywhere for all clients" and call it a day.

Ixrec commented Apr 19, 2018

Personally, I'd lean toward "wait forever"/"no default timeouts", because in a futures/async world it's fairly straightforward to override a global lack of timeouts by adding a single timeout in your application code (as demonstrated above). But if the problem is a default timeout that's too small, your ability to override it is restricted by the APIs of however many libraries there are between you and the one with the timeout, which means there's a good chance there simply is no way to override it without serious hacks like forking that entire dependency tree.

If you somehow have a library that cannot possibly be used async-ly, then... well that sounds very niche to me, so I'd be tempted to say "mandatory timeout/deadline arguments everywhere for all clients" and call it a day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment