The following are additional notes about which RPCs should retry which failed RPCs.
A common concern with retrying CREATE
and DELETE
operations is that they are
non-idempotent operations. The functional concern can be summarized with this
example:
Get("Foo")
> NOT_FOUND
Create("Foo") // Let's work under the assumption that this should be retried.
// ----> UNAVAILABLE
// ----> Create("Foo") # retry #1
// ----> UNAVAILABLE # actually, the request succeeded, but the connection died during the response
// ----> Create("Foo") # retry #2
// ----> ALREADY_EXISTS
> ALREADY_EXISTS
Here, an unexpected result - ALREADY_EXISTS
- is returned from a Create. The
resource didn't exist before the call was initiated, so the user may wonder,
"How did I receive this result?".
The definition of idempotency is unfortunately vague in this regard. Oxford dictionary defines idempotent as,
denoting an element of a set which is unchanged in value when multiplied or otherwise operated on by itself.
One interpretation of this definition is that an operation is idempotent if repeating the operation results in the same response.
This definition allows for retrying READ
and UPDATE
requests, since
repeated runs always result in the same state and response code. However, this
definition precludes retrying CREATE
calls, since ALREADY_EXISTS
may be
returned unexpectedly (as in the above example). It also precludes retrying
DELETE
calls, since NOT_FOUND
may be returned unexpectedly.
Another intepretation of this definition is that an operation is idempotent if repeating the operation results in the same state (or, value).
This second definition allows for retrying all CRUD operations, since the interpretation is only concerned with state, and repeating any CRUD operation always results in the same state.
This document's retryability guidances follows the latter interpretation, which allows for retrying all CRUD operations. The reason for this is that there's much more value for users to retry all operations, with little downside.
Consider the downside to retrying the CREATE
and DELETE
operations. For
users it generally just results in an extra code check:
resp, err := client.Create("Foo)
if err != nil {
if status.Convert(err).Code() == codes.ALREADY_EXISTS { // <- extra code check
// It already exists.
resp = client.Get("Foo")
} else {
// An actual error.
panic(err)
}
}
_ = resp // TODO: Use resp.
The above code is quite natural and exists in many codebases already: it is the
easy answer to asynchrony within a system, and a cheaper/more performant way
of writing code that doesn't know about state (the alternative is to always
perform a READ
before a CREATE
/DELETE
, which is superfluous).
In comparison to the relatively small downside, there is a large advantage.
Users will almost always retry INTERNAL
and UNAVAILABLE
. By building it
into a client library instead, users are saved from implementing retry loops,
deadlines, exponential backoff, and jitter.
Finally, consider that most users will have to write the if statement either way. After all, a user who implements retry logic themselves will still run into the same problem (and solution) described above.
There is an argument that some user may need to be absolutely sure of the
state of a system before retrying a non-GET
request. Such a user is an
argument in favour of only retrying GET
requests; never retrying UPDATE
,
CREATE
, or DELETE
. However, this set of users likely has very niche
requirements, and is dwarfed by the set of users for whom retrying is
advantageous.
A simple answer exists to support this set of users: provide an opt-out for retries. This approach fits the generally-accepted goal of client libraries striving for setting good defaults with opt-outs for the minority of users.
Another argument against retrying CREATE
and DELETE
is as such:
Thread1: Get("Foo")
Thread1: > NOT_FOUND
Thread2: Get("Foo")
Thread2: > NOT_FOUND
Thread1: Create("Foo") // Let's work under the assumption that this should be retried.
Thread1: // ----> UNAVAILABLE
Thread2: Create("Foo")
Thread2: // ----> OK
Thread1: // ----> Create("Foo") # retry #1
Thread1: // ----> ALREADY_EXISTS
Thread1: > ALREADY_EXISTS
Here, a system has two threads that convince themselves that "Foo" does not
exist, and then try to create it. Because Thread1 retried the first failed
Create request, it unexpectedly returned ALREADY_EXISTS
.
However, it is easy to see that one of these requests would have to receive
ALREADY_EXISTS
, even without retries. So, the problem here is not retrying:
the problem is that there is no synchronization between two components of the
system that are both concerned with the same resource. Removing the retry logic
would not solve the problem.
There are numerous solutions to solving this problem:
- Run these requests in a read-write transaction, ensuring atomicity.
- Make the two components in the system synchronous.
- Add a piece of code that considers
ALREADY_EXISTS
a success case (as above described in "Idempotency and pragmatism").