jeanbza/retry_adtl_notes.md

## retry_adtl_notes.md

      
    Raw
  

              retry_adtl_notes.md
            
          
    The following are additional notes about which RPCs should retry which failed RPCs.
Idempotency and pragmaticism

A common concern with retrying CREATE and DELETE operations is that they are
non-idempotent operations. The functional concern can be summarized with this
example:
Get("Foo")
> NOT_FOUND
Create("Foo") // Let's work under the assumption that this should be retried.
// ----> UNAVAILABLE
// ----> Create("Foo") # retry #1
// ----> UNAVAILABLE   # actually, the request succeeded, but the connection died during the response
// ----> Create("Foo") # retry #2
// ----> ALREADY_EXISTS
> ALREADY_EXISTS

Here, an unexpected result - ALREADY_EXISTS - is returned from a Create. The
resource didn't exist before the call was initiated, so the user may wonder,
"How did I receive this result?".
The definition of idempotency is unfortunately vague in this regard. Oxford
dictionary defines idempotent as,

denoting an element of a set which is unchanged in value when multiplied or
otherwise operated on by itself.

One interpretation of this definition is that an operation is idempotent if
repeating the operation results in the same response.
This definition allows for retrying READ and UPDATE requests, since
repeated runs always result in the same state and response code. However, this
definition precludes retrying CREATE calls, since ALREADY_EXISTS may be
returned unexpectedly (as in the above example). It also precludes retrying
DELETE calls, since NOT_FOUND may be returned unexpectedly.
Another intepretation of this definition is that an operation is idempotent if
repeating the operation results in the same state (or, value).
This second definition allows for retrying all CRUD operations, since the
interpretation is only concerned with state, and repeating any CRUD operation
always results in the same state.
This document's retryability guidances follows the latter interpretation,
which allows for retrying all CRUD operations. The reason for this is that
there's much more value for users to retry all operations, with little downside.
Consider the downside to retrying the CREATE and DELETE operations. For
users it generally just results in an extra code check:
resp, err := client.Create("Foo)
if err != nil {
  if status.Convert(err).Code() == codes.ALREADY_EXISTS { // <- extra code check
    // It already exists.
    resp = client.Get("Foo")
  } else {
    // An actual error.
    panic(err)
  }
}
_ = resp // TODO: Use resp.

The above code is quite natural and exists in many codebases already: it is the
easy answer to asynchrony within a system, and a cheaper/more performant way
of writing code that doesn't know about state (the alternative is to always
perform a READ before a CREATE/DELETE, which is superfluous).
In comparison to the relatively small downside, there is a large advantage.
Users will almost always retry INTERNAL and UNAVAILABLE. By building it
into a client library instead, users are saved from implementing retry loops,
deadlines, exponential backoff, and jitter.
Finally, consider that most users will have to write the if statement either
way. After all, a user who implements retry logic themselves will still run
into the same problem (and solution) described above.
Absolute correctness

There is an argument that some user may need to be absolutely sure of the
state of a system before retrying a non-GET request. Such a user is an
argument in favour of only retrying GET requests; never retrying UPDATE,
CREATE, or DELETE. However, this set of users likely has very niche
requirements, and is dwarfed by the set of users for whom retrying is
advantageous.
A simple answer exists to support this set of users: provide an opt-out for
retries. This approach fits the generally-accepted goal of client libraries
striving for setting good defaults with opt-outs for the minority of users.
The concurrency fallacy

Another argument against retrying CREATE and DELETE is as such:
Thread1: Get("Foo")
Thread1: > NOT_FOUND
Thread2: Get("Foo")
Thread2: > NOT_FOUND
Thread1: Create("Foo") // Let's work under the assumption that this should be retried.
Thread1: // ----> UNAVAILABLE
Thread2: Create("Foo")
Thread2: // ----> OK
Thread1: // ----> Create("Foo") # retry #1
Thread1: // ----> ALREADY_EXISTS
Thread1: > ALREADY_EXISTS

Here, a system has two threads that convince themselves that "Foo" does not
exist, and then try to create it. Because Thread1 retried the first failed
Create request, it unexpectedly returned ALREADY_EXISTS.
However, it is easy to see that one of these requests would have to receive
ALREADY_EXISTS, even without retries. So, the problem here is not retrying:
the problem is that there is no synchronization between two components of the
system that are both concerned with the same resource. Removing the retry logic
would not solve the problem.
There are numerous solutions to solving this problem:

Run these requests in a read-write transaction, ensuring atomicity.
Make the two components in the system synchronous.
Add a piece of code that considers ALREADY_EXISTS a success case (as above
described in "Idempotency and pragmatism").