Skip to content

Instantly share code, notes, and snippets.

@jeanbza
Created November 20, 2019 18:10
Show Gist options
  • Save jeanbza/b9727b67dc8a9d1599703eda7e580c84 to your computer and use it in GitHub Desktop.
Save jeanbza/b9727b67dc8a9d1599703eda7e580c84 to your computer and use it in GitHub Desktop.
Retry Additional Notes

The following are additional notes about which RPCs should retry which failed RPCs.

Idempotency and pragmaticism

A common concern with retrying CREATE and DELETE operations is that they are non-idempotent operations. The functional concern can be summarized with this example:

Get("Foo")
> NOT_FOUND
Create("Foo") // Let's work under the assumption that this should be retried.
// ----> UNAVAILABLE
// ----> Create("Foo") # retry #1
// ----> UNAVAILABLE   # actually, the request succeeded, but the connection died during the response
// ----> Create("Foo") # retry #2
// ----> ALREADY_EXISTS
> ALREADY_EXISTS

Here, an unexpected result - ALREADY_EXISTS - is returned from a Create. The resource didn't exist before the call was initiated, so the user may wonder, "How did I receive this result?".

The definition of idempotency is unfortunately vague in this regard. Oxford dictionary defines idempotent as,

denoting an element of a set which is unchanged in value when multiplied or otherwise operated on by itself.

One interpretation of this definition is that an operation is idempotent if repeating the operation results in the same response.

This definition allows for retrying READ and UPDATE requests, since repeated runs always result in the same state and response code. However, this definition precludes retrying CREATE calls, since ALREADY_EXISTS may be returned unexpectedly (as in the above example). It also precludes retrying DELETE calls, since NOT_FOUND may be returned unexpectedly.

Another intepretation of this definition is that an operation is idempotent if repeating the operation results in the same state (or, value).

This second definition allows for retrying all CRUD operations, since the interpretation is only concerned with state, and repeating any CRUD operation always results in the same state.

This document's retryability guidances follows the latter interpretation, which allows for retrying all CRUD operations. The reason for this is that there's much more value for users to retry all operations, with little downside.

Consider the downside to retrying the CREATE and DELETE operations. For users it generally just results in an extra code check:

resp, err := client.Create("Foo)
if err != nil {
  if status.Convert(err).Code() == codes.ALREADY_EXISTS { // <- extra code check
    // It already exists.
    resp = client.Get("Foo")
  } else {
    // An actual error.
    panic(err)
  }
}
_ = resp // TODO: Use resp.

The above code is quite natural and exists in many codebases already: it is the easy answer to asynchrony within a system, and a cheaper/more performant way of writing code that doesn't know about state (the alternative is to always perform a READ before a CREATE/DELETE, which is superfluous).

In comparison to the relatively small downside, there is a large advantage. Users will almost always retry INTERNAL and UNAVAILABLE. By building it into a client library instead, users are saved from implementing retry loops, deadlines, exponential backoff, and jitter.

Finally, consider that most users will have to write the if statement either way. After all, a user who implements retry logic themselves will still run into the same problem (and solution) described above.

Absolute correctness

There is an argument that some user may need to be absolutely sure of the state of a system before retrying a non-GET request. Such a user is an argument in favour of only retrying GET requests; never retrying UPDATE, CREATE, or DELETE. However, this set of users likely has very niche requirements, and is dwarfed by the set of users for whom retrying is advantageous.

A simple answer exists to support this set of users: provide an opt-out for retries. This approach fits the generally-accepted goal of client libraries striving for setting good defaults with opt-outs for the minority of users.

The concurrency fallacy

Another argument against retrying CREATE and DELETE is as such:

Thread1: Get("Foo")
Thread1: > NOT_FOUND
Thread2: Get("Foo")
Thread2: > NOT_FOUND
Thread1: Create("Foo") // Let's work under the assumption that this should be retried.
Thread1: // ----> UNAVAILABLE
Thread2: Create("Foo")
Thread2: // ----> OK
Thread1: // ----> Create("Foo") # retry #1
Thread1: // ----> ALREADY_EXISTS
Thread1: > ALREADY_EXISTS

Here, a system has two threads that convince themselves that "Foo" does not exist, and then try to create it. Because Thread1 retried the first failed Create request, it unexpectedly returned ALREADY_EXISTS.

However, it is easy to see that one of these requests would have to receive ALREADY_EXISTS, even without retries. So, the problem here is not retrying: the problem is that there is no synchronization between two components of the system that are both concerned with the same resource. Removing the retry logic would not solve the problem.

There are numerous solutions to solving this problem:

  • Run these requests in a read-write transaction, ensuring atomicity.
  • Make the two components in the system synchronous.
  • Add a piece of code that considers ALREADY_EXISTS a success case (as above described in "Idempotency and pragmatism").
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment