darrenbkl/blog.md Secret

## blog.md

      
    Raw
  

              blog.md
            
          
    Orchestrating Distributed Transaction in Microservices

#blog
Be glad when you start hating Microservices, that’s when you know you’re doing something right.
When writing microservices, it is easy to forget that it is part of a distributed system, especially if an entire team is working on a service (which can be subdivided further into sub-services), we need to be aware of the implicit network cost and failure.
I like to think of microservices as a stateless function, akin to the one in functional programming. An ideal microservice does only 1 thing, it has no dependency to other services, like a hunter in a dark forest, it makes no assumption about its surrounding, it only cares about how its own resiliency and survival.
On the other hand, a microservice that is aware of its surrounding is fragile.
Distributed Transaction

A distributed transaction is a transaction that spans multiple resources across the network, while preserving ACID properties. However, it is not unique to microservices. Its prevalence in microservices is due to the distributed nature of microservices, so transaction are usually implicitly distributed.
Hard Problem

To see why distributed transaction is inherently hard, let’s take a look at a textbook yet extremely common in real life application example - Ordering and Payment system.
Say we have an inventory management system, ordering system and a payment system, each modeled as a microservice. The boundary is clear, ordering system takes order, inventory system allocate stocks, while the payment system deals only with payment and refund related issue.
A single order transaction consist of creating an order + reserve stock + payment, in any order. Failure at any point during the transaction should revert everything before it. For example, a payment failure will cause the inventory system to release the reserved stocks, and the ordering system to cancel the order.
How to Implement a Fragile Distributed Transaction

A naive implementation typically uses HTTP for RPC, that’s fine if you’re working on a quick demo.

 diagram
When an order request comes in, ordering system takes the order, it makes an HTTP call to the inventory system to reserve stock. If the stock reservation succeeds, it calls the payment system to try to make payment with the credit card provided by the user.
Otherwise, stock is not reserved.
Now, if the payment fails, we gotta roll back the stock reservation by issuing a release stock, and roll back the order by issuing an order cancellation.
 diagram

There are some serious flaw with this approach.

Fallacy of distributed system -> Relies heavily on the stability of the network throughout the transaction
Transaction could end up in an indeterminate state -> Synchronous HTTP, timeout,
Fragile to topography changes -> Each component has explicit knowledge about its dependency

Synchronous HTTP call blocks indefinitely, imagine payment service calls some 3rd party API like PayPal or Stripe, the transaction is effectively out of your control. What happen if the API is down or throttled, or that there is network disruption along the network path, the transaction get stucked. Or, one of the 3 services is down, due to any of the 1000 reasons from application bug to submarine cable disruption.
To counter this, we could put in place a client timeout. But what would you set? 5s? 10s? 30s? Any number is arbitrary and making implicit assumption of the network. In fact, when a connection timed out, it doesn’t says anything about the state of the transaction at all, it merely concludes that the transaction has taken more than your specified timeout, and that the transaction is currently in Stock

 diagram showing the current state of the transaction in the middle between 2 state

Concretely, if the inventory system managed to reserve some stocks, but the payment system timed out for whatever reason, we cannot say that the payment has failed, as explained above. If we treat timeout as failure, we would have rolled back the stock reservation and cancel the order, but the payment actually did go through, perhaps the external payment API is taking more time than usual or network disruption, so we cut off the connection before payment service has a chance to respond. Now the transaction is in both Paid and Stock Released state simultaneously.
All these knowledge about the surrounding services forces a service to deal with the specifics, instead of improving its resiliency. This primordial form of distributed transaction relies heavily on the interaction with other services, and the network being reliable, succumbing to the  It is highly fragile to the topography changes or the slightest network disturbance.
Bandaids like exponential back-off Retry doesn’t solve the issue, although it may look like it does for 50% of the time. Let’s see how we can rearchitect this into something more robust.
Robust Strategy

A robust distributed transaction strategy should have the following properties

No explicit inter-service communication
Does not make assumption about the reliability of the network and the services
Global transaction as a series of local ACID transactions
Always in a known state
State is managed externally
Eventual consistency, but consistent nonetheless

We will try to improve the naive implementation with a simple Saga pattern.
Saga Pattern models the global distributed transaction as a series of local ACID transaction, with compensation as a rollback mechanism. The global transaction maintains a state and move between different  state depending on the result of local transaction execution.
There are generally 2 kinds of saga implementation.

Orchestration
Choreography

Orchestration-Based Saga

Orchestration-based Saga is a natural evolution from the naive implementation, because it can be incrementally adopted.
An orchestrator, or transaction manager is responsible for coordinating the transaction flow, that is, communicating with the appropriate services that involves in the distributed transaction, manages the state of the transaction, and orchestrate the necessary compensation. The orchestrator is aware of the context of the transaction, but the services is not aware of it.
Implementing the Orchestrator and compensation feature.
The orchestrator manages the transaction state, and calls the appropriate compensation depending on the services’ response. This way the services is not unaware of the transaction in progress, and decoupled from the surround services.
However, the orchestrator remains the bottleneck of the whole transaction, since we have just move the communication between services, to communication between orchestrator and services. We need a way to remove the this explicit communication, so that no one is aware of each other.
Remove synchronous HTTP communication

Recall the inventory service that gets stuck waiting for payment service, that’s because of the RPC in the middle of its local transaction.
A service’s local atomic transaction should ideally consist of just 2 steps

Local business logic
Notify a 3rd party system of its work done
That’s all, no long, synchronous, blocking RPC somewhere in the middle of the local transaction. When it has done its job, it just need to tell the that it’s done.


 diagram showing RPC in transaction, and no RPC

Now the 2nd step is tricky, because notifying a message broker is essentially an RPC in itself, albeit a much shorter one, but an RPC nevertheless. Is there a way to totally eliminate any form of RPC so that we can reduce the local transaction into one that is purely ACID-based?
Notification of Local Work Done via Transactional Outbox

To combine step 1 and 2 into one ACID transaction, we can make use of Transactional Outbox pattern.
When we write the result of the local transaction into the database, we include the work done message as part of the transaction as well. This way, the demarcation of local transaction and notification is clear. The services has done its job, and the work done message is ready to be picked up by the message broker for publishing.
Now we only need to figure a way to know WHEN to pickup the message, there must be a way for the database to notify the message broker that something has just been inserted. Enter Change Data Capture.
Debezium and Kafka Connect is an excellent tool to capture the notification and stream it into Kafka.

 diagram

Handling Request Idempotency

The orchestrator also ensure request idempotency to prevent invoking the transaction twice. If the orchestrator receives a duplicated request, i.e. a request with an duplicated idempotent key, it is returned the result, or return an error (what’s the error code for duplicated idempotency?). This also means that we need to store both the idempotency key and all API response.
Relation to Antifragility
Though not exactly an antifragile architecture, I’m not even sure if

This architecture adheres to the core principle of antifragility. The naive implementation will work perfectly probably around 90% of the time, but subjected to 10% catastrophe with huge loss. The saga based architecture may seems less optimized due to the eventually consistent model, but it is tolerant to the fluctuation of network, in the event of catastrophe, its downside is limited. This is the essence of anti fragility.
Conclusion

Distributed Transaction is hard, doing it well requires some effort. One way of doing it through Orchestration- based Saga pattern. It is not important to note this is not a remedy to apply a “traditional transaction” at the level of distributed system, rather, it is Long Running Transaction that models transaction as a state machine, with each service’s local ACID transaction as a state transition function. It guarantees that the transaction is always in one of the many defined state due to the local ACID transaction. In the event of network disruption, you can always fix the problem and resume the transaction from the last known state.
Advantage

No dependency between services

Disadvantages

We are just removing the dependency of the fine grain services to the orchestration layer (diagram), of course the this layer will consist of significantly lesser orchestrator then the services, but they have a dependency to other services, so the more orchestrator the more fragile the system.