Skip to content

Instantly share code, notes, and snippets.

@alkis
Last active April 19, 2018 12:49
Show Gist options
  • Save alkis/9510a840f1965185ab0a02cb59761dd8 to your computer and use it in GitHub Desktop.
Save alkis/9510a840f1965185ab0a02cb59761dd8 to your computer and use it in GitHub Desktop.
The case for deadlines

The case for deadlines

Rust aims to be the foundation of fast and robust software. A thorn in that side is the choice of timeout-centric APIs for I/O and multithreading. I posit this was a bad choice and it needs to be remedied as soon as possible before more APIs proliferate this (possibly accidental) design decision.

Motivating example: time limited network client request

Task: connect to a remote server, send a request, receive response with an end-to-end timeout of 10 seconds.

Attempt #1

fn attempt1(remote: &SocketAddr, req: &[u8], resp: &mut Vec<u8>) -> Result<()> {
    let start = Instant::now();
    let timeout = Duration::from_secs(10);
    let mut stream = TcpStream::connect_timeout(remote, timeout)?;
    stream.set_write_timeout(Some(timeout - start.elapsed()))?;
    stream.write_all(req);
    stream.set_read_timeout(Some(timeout - start.elapsed()))?;
    stream.read_to_end(resp)?;
    Ok(())
}

While the above code is setting timeouts, the total runtime is not bound by the timeouts at all. A remote server that sends 100 kilobytes at 1 kilobyte per second will cause the client to stay connected for 100 seconds. This is because read_to_end will keep calling read and the read_timeout we have set will apply to each individual read call.

write_all exhibits similar behavior.

To fix this we need to try harder and practically handroll both write_all and read_to_end, while setting timeouts before each call.

Attempt #2

fn attempt2(remote: &SocketAddr, mut req: &[u8], resp: &mut Vec<u8>) -> Result<()> {
    let start = Instant::now();
    let timeout = Duration::from_secs(10);
    let mut stream = TcpStream::connect_timeout(remote, timeout)?;
    while !req.is_empty() {
        stream.set_write_timeout(Some(timeout - start.elapsed()))?;
        let n = stream.write(req)?;
        if n == 0 {
            return Err(Error::new(ErrorKind::UnexpectedEof, ""));
        }
        req = &req[n..];
    }
    let mut buf = [0; 1024];
    loop {
        stream.set_read_timeout(Some(timeout - start.elapsed()))?;
        let n = stream.read(&mut buf)?;
        if n == 0 {
            break;
        }
        resp.extend_from_slice(&buf[..n]);
    }
    Ok(())
}

What we learned so far:

  • Timeouts on the socket level are hard to use and make little sense in the context of higher level functions which call read or write many times.
  • write_all, read_to_end, and similar functions are at best unintuitive. There is very little one can do about documenting their behavior in the presence of timeouts.
  • To implement the correct semantics and cap our end-to-end time to 10 secs we had to set the timeout before each individual read or write operation.

The code is also not ergonomic at all and difficult to get right. In addition, even attempt #2 is not entirely correct since start.elapsed() can be greater than timeout in which case it will panic. So much for robustness...

Attempt #3

If we add a timeout parameter to each call, perhaps we can improve the ergonomics.

fn write_for(stream: &mut TcpStream, buf: &[u8], timeout: Duration) -> Result<usize> {
    stream.set_write_timeout(Some(timeout))?;
    stream.write(buf)
}

fn read_for(stream: &mut TcpStream, buf: &mut [u8], timeout: Duration) -> Result<usize> {
    stream.set_read_timeout(Some(timeout))?;
    stream.read(buf)
}

fn attempt3(remote: &SocketAddr, mut req: &[u8], resp: &mut Vec<u8>) -> Result<()> {
    let start = Instant::now();
    let timeout = Duration::from_secs(10);
    let mut stream = TcpStream::connect_timeout(remote, timeout)?;
    while !req.is_empty() {
        let n = write_for(&mut stream, req, timeout - start.elapsed())?;
        if n == 0 {
            return Err(Error::new(ErrorKind::UnexpectedEof, ""));
        }
        req = &req[n..];
    }
    let mut buf = [0; 1024];
    loop {
        let n = read_for(&mut stream, &mut buf, timeout - start.elapsed())?;
        if n == 0 {
            break;
        }
        resp.extend_from_slice(&buf[..n]);
    }
    Ok(())
}

This is slightly better but we still have to keep track of the elapsed time and keep recomputing the remaining timeout to pass to each call. It would be much better if this was handled by TcpStream itself so we wouldn’t have been bothered repeating ourselves.

Attempt #4

If we use a deadline instead of a timeout, we won’t have to recompute the remaining timeout on each call.

fn connect_until(remote: &SocketAddr, deadline: Instant) -> Result<TcpStream> {
    let now = Instant::now();
    if deadline <= now {
        return Err(Error::new(ErrorKind::TimedOut, ""));
    }
    TcpStream::connect_timeout(remote, deadline - now)
}

fn write_until(stream: &mut TcpStream, buf: &[u8], deadline: Instant) -> Result<usize> {
    let now = Instant::now();
    if deadline <= now {
        return Err(Error::new(ErrorKind::TimedOut, ""));
    }
    stream.set_write_timeout(Some(deadline - now))?;
    stream.write(buf)
}

fn read_until(stream: &mut TcpStream, buf: &mut [u8], deadline: Instant) -> Result<usize> {
    let now = Instant::now();
    if deadline <= now {
        return Err(Error::new(ErrorKind::TimedOut, ""));
    }
    stream.set_read_timeout(Some(deadline - now))?;
    stream.read(buf)
}

fn attempt4(remote: &SocketAddr, mut req: &[u8], resp: &mut Vec<u8>) -> Result<()> {
    let deadline = Instant::now() + Duration::from_secs(10);
    let mut stream = connect_until(remote, deadline)?;
    while !req.is_empty() {
        let n = write_until(&mut stream, req, deadline)?;
        if n == 0 {
            return Err(Error::new(ErrorKind::UnexpectedEof, ""));
        }
        req = &req[n..];
    }
    let mut buf = [0; 1024];
    loop {
        let n = read_until(&mut stream, &mut buf, deadline)?;
        if n == 0 {
            break;
        }
        resp.extend_from_slice(&buf[..n]);
    }
    Ok(())
}

Slightly better than before but still not very ergonomic. But now since we have a deadline, we can implement write_all_until and read_to_end_until on top of read_until and write_until that will respect the deadline.

Attempt #5

fn write_all_until(stream: &mut TcpStream, mut buf: &[u8], deadline: Instant) -> Result<()> {
    while !buf.is_empty() {
        let n = write_until(stream, buf, deadline)?;
        if n == 0 {
            return Err(Error::new(ErrorKind::UnexpectedEof, ""));
        }
        buf = &buf[n..];
    }
    Ok(())
}

fn read_to_end_until(stream: &mut TcpStream, out: &mut Vec<u8>, deadline: Instant) -> Result<()> {
    let mut buf = [0; 1024];
    loop {
        let n = read_until(stream, &mut buf, deadline)?;
        if n == 0 {
            break;
        }
        out.extend_from_slice(&buf[..n]);
    }
    Ok(())
}

fn attempt5(remote: &SocketAddr, req: &[u8], resp: &mut Vec<u8>) -> Result<()> {
    let deadline = Instant::now() + Duration::from_secs(10);
    let mut stream = connect_until(remote, deadline)?;
    write_all_until(&mut stream, req, deadline)?;
    read_to_end_until(&mut stream, resp, deadline)?;
    Ok(())
}

At this point, attempt #5 looks pretty decent. We can use higher level functions to read/write all bytes to the socket while respecting our time limits.

What we learned so far:

  • Deadlines compose very well, which allows building higher level functions that are robust and well defined.
  • Timeouts do not compose at all because they are context sensitive.

Proposal

With the learnings above, it is clear that timeout centric APIs are not good building blocks for robust applications. As such I propose that all timeout APIs in stdlib are deprecated. Luckily there are only a handful of them. At the same time of the deprecation, deadline centric variants for each operation should be introduced.

Furthermore, Futures and async/await should also be built with deadlines centric APIs instead of timeouts. For asyncio deadlines are even more important. For best ergonomics, deadlines in async programming should be set on core::task::Context for reasons I will cover in another post.

@CAFxX
Copy link

CAFxX commented Apr 11, 2018

Slightly orthogonal but I'll bring this up since we're on topic. I would like to suggest what I already suggested for go: all I/O, and especially network calls, should default to have some sort of timeout/deadline. There are good reasons to sometimes have no timeout at all, but I would argue they are a minority compared to the common case. If that's true then it stands to reason that the default should fit the common case, not the uncommon one.

@anguslees
Copy link

I strongly oppose adding artificial timeouts by default fwiw. Many times I've had to deal with a production outage where the correct fix was "just wait longer", because people set arbitrary timeouts based on whatever is normal for their experience - but deviation from normal is not an error.

I strongly suggest timeouts/deadlines should be imposed by the caller(*) and not internally fabricated "just because". In the general default/no-deadline case, this means "wait forever" (in practice this is of course finite when the user gets bored and presses ^c or the TCP connection/OS/whatever is restarted).

(*) The exception here is where the timeout is part of a sophisticated internal error recovery/retry strategy.

@CAFxX
Copy link

CAFxX commented Apr 19, 2018

Many times I've had to deal with a production outage where the correct fix was "just wait longer", because people set arbitrary timeouts based on whatever is normal for their experience - but deviation from normal is not an error.

@anguslees I argue that the opposite is true as well: many reliability problems I witnessed in production stem from missing timeouts.

My point is the following: the common case can't be "wait forever" because "waiting forever" is the same as "hanging", i.e. a bug (if for no other reason that you're consuming resources to do so, and therefore this is the same as consuming resources "forever", i.e. a resource leak).

Now, I'm not claiming that there are no good reasons for having no timeout set. What I am claiming is that the common case does not require "infinite timeouts" and that since "infinite timeouts" are equivalent to leaking resources, the safe default should be to have some form of default, finite timeout rather than no timeout at all. See e.g. the discussion done over golang/go#23459 (comment).

@Ixrec
Copy link

Ixrec commented Apr 19, 2018

Personally, I'd lean toward "wait forever"/"no default timeouts", because in a futures/async world it's fairly straightforward to override a global lack of timeouts by adding a single timeout in your application code (as demonstrated above). But if the problem is a default timeout that's too small, your ability to override it is restricted by the APIs of however many libraries there are between you and the one with the timeout, which means there's a good chance there simply is no way to override it without serious hacks like forking that entire dependency tree.

If you somehow have a library that cannot possibly be used async-ly, then... well that sounds very niche to me, so I'd be tempted to say "mandatory timeout/deadline arguments everywhere for all clients" and call it a day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment