craigtp/AdvancedDistributedSystemDesignCourseNotes.md

## AdvancedDistributedSystemDesignCourseNotes.md

      
    Raw
  

              AdvancedDistributedSystemDesignCourseNotes.md
            
          
    Advanced Distributed System Design Course - Udi Dahan

Notes by Craig Phillips
Fallacies of Distributed Computing


There are 11 fallacies of Distributed Computing:

The network is reliable
Latency isn’t a problem
Bandwidth isn’t a problem
The network is secure
The topology won’t change
The administrator will know what to do
Transport cost isn’t a problem
The network is homogeneous
The system is atomic/monolithic
The system is finished
Business logic can and should be centralized


Originally, there was only the first 8.  These were mostly formulated by L. Peter Deutsch and James Gosling around 1994-1997.  The last 3 were added by Ted Neward in 2006.
First rule of distributed objects is "Don't distribute objects".  This is the idea of proxy objects that appear to be "local" calls but that mask a network call.  e.g.  var service = new SomeService(); var result = service.Process(data); where the call to .Process masks a network call as the method call goes against a proxy.  This is a distributed object and can be evil!
For latency, if a single CPU clock cycle was scaled up to say it takes 1 second, access to RAM is equivalent to 6 minutes, a TCP packet from New York to San Francisco is 4 years, and a re-transmission of a TCP packet takes between 105 to 317 years!
Of all the developments in computing where speeds/quantities have increased almost exponentially in recent years, network bandwidth has grown the slowest.
Adding more servers on a network is like adding more cars to a motorway.  Saturation happens quickly and everything slows down.
Network bandwidth, latency and reliability are all interlinked.  If bandwidth gets maxed out, packets need to be resent, which increases latency and decreases reliability.
You can never be 100% secure and most security lapses are attacks against the human aspect of the system.
With limited bandwidth, we can't eager-load or lazy-load everything in a large object graph, so the solution is to move towards multiple, smaller domain models specific to different use cases.  Allows loading all data for given use case as it's not too large.
It's not a question of "if" you'll get hacked, it's a question of "when".
Network topology should be expected to change.  Machines can lose network connectivity or can change IP Address, move to a different subnet etc.  Never hard-code IP Addresses, use resilient protocols and discovery mechanisms
Backwards compatibility is hard.  You need to have the ability to concurrently run version N and version N+1 of a given service in production.  Failure to do this will result in system downtime.
Don't over log.  Some logging is helpful, but too much can be a burden.  When you're looking for a needle in a haystack, don't create a bigger haystack!
Serialization and deserialization adds cost to calls over the network, this can be exacerbated when running in the cloud.  Beware as it's often code that we don't profile for performance.
Networks are not homogenous, even using JSON over REST/HTTP can have problems with one serializer from one platform not being compatible with a serializer from another platform.
Semantic interoperability is hard so should be budgeted for.  This is the "meaning" of data being correctly understood by both sending and receiving parties, or two systems that have different requirements around data properties.
Maintenance is hard in "big balls of mud" as changing one part of the system often affects other parts.
Integration through the database creates coupling.  Things get even worse when schemas are locked down so developers start to store JSON/XML/CSV of multiple values in a single string column.
Scaling systems across multiple machines that were never designed for such may actually hurt performance.  This is especially true if there's a single common database used, as more servers hitting the same database makes things worse.
Maintenance costs over the lifetime of a system are greater than its development costs, however, there's really no such thing as software "maintenance" as software is never really "finished".
The business doesn't really give developers requirements.  They have an inventive workaround to achieve something they need done, and it's that that is presented as a requirement.
Estimates for work should always be given as "Given as well-formed team of size S, that is not working on anything else, I'm C% confident work will take between T1 and T2".
If business logic is centralized in a single place, places that need to use that logic naturally have to take a dependency on it, which creates undesired coupling.
With discipline in development effort, we can have traceability of all code that implements a given feature or fixes a given bug.  Can often be achieved by tracking from issue/ticket to the pull request, through to the actual code files modified inside of our Repository/CI/CD tools (i.e. GitHub, Azure DevOps etc.)
In summary, best practices have yet to catch up to "best thinking", technology cannot solve all problems and adding more hardware doesn't necessarily help.

Coupling


Coupling is a measure of dependencies between two parts of a system.
There's two types of coupling, Afferent and Efferent.  Afferent coupling is "incoming" - who depends on you.  Efferent coupling is "outgoing" - who you depend on.
When we refer to "tightly coupled" code, it's more related to outgoing (Efferent) coupling rather than incoming coupling.  Incoming coupling is usually found in low-level generic code such as a framework (logging, ORM, DI etc.).  Outgoing coupling is usually against domain-specific code.
DTO's often have more incoming (afferent) coupling than outgoing (efferent) coupling.
Most important is to understand if you're coupled to stable dependencies or volatile dependencies.  Generic framework code is more stable than domain-specific code.
Beware of "hidden" coupling. This can be seen with database integration.  If A and B both talk to the same database, we can effectively have A calling B (or vice-versa) but it's not explicit so not seen.  Make coupling explicit!
Keep coupling low, but not "mechanically" low - i.e. Don't reduce 5 different methods into a single generic method that just hides the coupling.
Coupling exists in platforms too. This is also known as "interoperability".  Only use protocols that exist on both platforms talking to each other and remember the tenets of service-oriented architecture (SOA), "Share contracts and schema, not classes or types".
We have Temporal coupling in which time affects services depending upon each other.  If the processing time of B can affect A, then A is temporally coupled to B.  Synchronous communication is often temporally coupled, but asynchronous communication is usually not.
Spatial coupling refers to the amount by which one service is bound to a specific machine to complete an operation.  e.g. A well constructed web application that can load balance across multiple machines has low/no spatial coupling.  A service running on a single machine will introduce high spatial coupling for consumers of that service.
Schemas for text-based contracts between services are primarily designed for developer productivity, rather than for interoperability.
Multi-threaded code should be avoided if possible.  Synchronous code is far easier to reason about than asynchronous / multi-threaded code.
To overcome temporal issues with request/response, we move towards a callback model.  This is more of a pub/sub model so the service that needs the data subscribes and keeps it's own cached copy prior to use.
Temporal constraints mean that subscribers must be able to make decisions based on stale data, it requires a strong division of responsibility between publishers and subscribers and there must be a single logical publisher of a given kind of event (single source of truth).
Publish/Subscribe communication naturally implies eventual consistency - things make be out of sync for a given amount of time before becoming consistent.
In designing events, avoid requests or commands instead state something that has already happened, in the past tense.  Subscribers should not be able to invalidate the event.
If including data inside of events (i.e. a Product Price), strongly consider also including validity of that data (i.e. Price valid from [some_date] to [some_other_date/time]).
If business requirements demand consistency between two things, these things have to be in the same service and can't be separated communicating only via pub/sub.  Can't even talk with simple request/response either unless it's possible to add a distributed transaction around it.
To solve spatial coupling issues, look into the Service Agent Pattern.
Avoid document-centric messaging. This is where some parts of a larger document have to be routed to specific places for processing (i.e. government for that says, if you tick this, skip to section 4).  Doing this makes spatial coupling a core part of the technological solution due to it being content-based routing.  Prefer strongly-typed messages for individual document sections that can be published and subscribed by concerned services.
In summary, coupling is a function of the 5 different dimensions.  These are sometimes interdependent, so reducing one type of coupling may require more of another type.  There's not really loosely coupled or strongly coupled, but "for this component, is the coupling right?"

Introduction To Messaging


Messaging is an effective mechanism to reduce both afferent and efferent coupling.  It also reduces temporal coupling and standard serialization mechanisms (such as JSON/AMQP) reduces platform coupling.
One-way, fire-and-forget asynchronous messaging is the most basic type of messaging, and all other types of messaging is built on top of this.
Most message queues use store-and-forward.  Published messages are actually stored in an outbox on the local machine.  This is synchronous.  Eventually the messages in the local outbox are forwarded to the messaging server and this is asynchronous and has automatic retries built-in as part of the messaging platform.
All messages should have a unique Id.  This is useful for the messaging server's store-and-forward capabilities but also for our own abilities to identify duplicate messages.
When comparing RPC vs Messaging, it can appear the RPC calls and Messaging have equivalent throughput at first, however, if monitored over time, RPC throughput decreases due to increased memory load waiting for calls to return whereas messaging throughput tends to increase and flatten at maximum throughput.
Messaging is more resilient under load as it can use durable disk to handle the backlog of messages to be processed versus RPC which requires memory and threads to handle the backlog of requests.
Standard service layers of an API can get very large and it becomes difficult to split individual methods into low vs high priority.
Instead, we represent methods as messages and all messages are strongly-typed.
Messages are handled by the recipient with separate handling logic based on polymorphism for a IHandleMessages<T> interface.  We can have multiple handlers per message to handle different versions of the same message.
In the event of failures, RPC calls will require any retry logic to be implemented explicitly otherwise data can be lost (i.e. call to webserver is ok, but webserver fails to talk to DB).  With Messaging, we often get retry logic as part of the messaging infrastructure - a message that fails to be correctly processed remains on the queue and will be retried.
Messaging affords the ability to easily implement auditing or journaling by simply having an additional subscriber that receives a copy of every message from the queue.  By using Correlation Ids on messages, we can create visualisations / logs to show the exact stream and order of messages and which messages were created as a result of prior messages.
Don't violate the "rule of 2".  Any given process must only use one type of communication infrastructure i.e. only messaging or only a DB call or only a web service request/response. If you try to do two of these in the same process, you can potentially duplicate messages with potentially different message Ids (thus, preventing downstream processes from being able to de-duplicate itself).
If a third-party web service doesn't de-duplicate, it's possible to attempt to mimic this functionality if the web service allows the ability to query it to ask if it's seen a given message before, although this is not bulletproof and can fail on some edge cases.

Messaging Patterns


There is a problem known as the "sequential update" problem.  When one user issues two sequential updates, say to a customer's address.  This is not a messaging problem, although messages can get delivered out of order, but is actually an asynchronicity or parallel problem.  This can be solved by ensuring that each update has a version or sequence number associated with it so that subsequent updates include the prior version number to check in the database, otherwise it's considered a concurrency exception
The Return Address pattern is for when you're expecting a response to a message.  You have two different channels on your queue, one for requests and one for responses.  Responses are simply messages just like requests.
We need to use Correlation Ids on all messages to ensure that Response messages, which can arrive asynchronously at a later time, contain the same Correlation Id as the Request message had to allow the subscriber to determine which response goes with which request.
A single request message can have multiple response messages.  This is effectively equivalent to a HTTP request/response that returns with a HTTP 202 status code to indicate a URL the client should poll for the result.
The Pub/Sub pattern (Publish/Subscribe) is actually better called Sub/Pub (Subscribe/Publish).  Pub/Sub is too easy to mentally translate into Request/Response, which is incorrect, as subscribe has to happen first before the publisher can provide messages.  This is in the context of subscribers communicating directly with publishers (i.e. when not using a message broker).
Don't forget consistency boundaries!  If a subscriber is scaled out to multiple machines, make sure you know whether only one machine should receive the message from the publisher, or whether all machines can receive the message.
When moving from in-process events to distributed events, the publisher, by definition, cannot know when the subscriber has received and processed the message whereas this is possible when running in-process.
Don't throw exceptions for immediate failures to send an event/message.  If there's a failure to process, we can put the message back on the queue to retry again.  If we fail to process after N number of times, we "remove" the message and write to the error queue (aka dead letter/poisoned message).  This means that logs from the business code are far less important than monitoring the error queue to determine system health.
If you're trying to introduce messaging into a company that's never used it before, don't immediately jump to installing a messaging framework (RabbitMQ etc.).  Start small, say use messaging for a simple email sender service, and use a database for the message persistence.  Get developers used to writing event driven code using message classes/interfaces and message handlers.  Allow operations and admin people to feel comfortable with the underlying infrastructure. Once everyone is comfortable, explain how database licenses could be reduced by moving to a messaging framework etc.

Architectural Styles - Bus & Broker


What is an "architectural style"?  It defines what is and what is not allowed inside an architecture, and a system is likely to have multiple different architectural styles.
There are two types of architectural styles for messaging, bus and broker.
The Broker pattern is also known as "Hub & Spoke" or "Mediator". It's a thing that effectively sits in the middle of all other services that need to communicate with each other. Each service communicates with the single broker rather than trying to communicate with multiple other services.
Advantages of the broker style are that all communication is centralized, so can be centrally managed and configured, it enables intelligent routing and transformation of data and doesn't require changes to the applications/services that communicate with it.
Disadvantages of the broker style are that it embodies the fallacy of centralizing business logic, make unit testing difficult since the broker is central to the operation and communication of the system and prevent applications from gaining true autonomy.  The rise of the SOA (Service-Oriented Architecture) was a reaction to the large brokers of the 80's and 90's.
The Bus architectural style is designed around event sources and event sinks - publishers and consumers of events. It was designed to allow independent evolution of sources and sinks.
The hardware PCI bus is the same pattern.  Ethernet networks are also based on the same pattern. There's a physical unique MAC addresses for each device, and a logical IP address that sits on top and allows for routing messages.  It's based on the concept of "smart endpoints, dumb pipes".
For the Bus style, there's no physical central bus.  Like an ethernet, the bus is effectively everywhere, distributed across all devices on the network.
The bus style is simpler than the broker style as is has no content-based routing or data transformations.
Broker style is about integrating existing applications/services. The Bus style is about creating a communication mechanism that allows for application/services that may come in the future to be able to communicate.
Many commercial Enterprise Service Bus products are actually Brokers and not Buses.
Advantages of the bus style are that there's no single point of failure and it doesn't break service autonomy.
Disadvantage of bus style is that it's harder to design distributed systems for a bus than it is for a centralized broker.
It's not bus versus broker, but more of, where does it make more sense to use a bus and where does it make more sense to use a broker.  You'll likely use both in a given system.

SOA Building Blocks


What is a service?  Tenets of service-orientation are:

Services are autonomous.
Services have explicit boundaries
Services share contract & schema, not class or type
Service interaction is controlled by policy.


The original tenets didn't really relate SOA to the business needs, so a better definition of a service is that a service is the technical authority for a specific business capability.
Anything that provides only a function (calculation, validation etc.) or anything that provides only data (CRUD) are not services, even if we sometimes incorrectly use that term to describe them.
Services have complete autonomy over the business capability that they exist to serve.  This means that they can be supplied with all of the data they require to perform their required functionality.
Business processes are entirely contained within the business capability, and thus within the service.  An Enterprise process can span multiple services.
With separate, independent services, we don't have an orchestrator that invokes one service after another.  This is the very essence of the bus implementation. Each service receives messages and can emit other messages which other services can react to.
A classic example of separate services is an Amazon product web page.  There's not really a "page" as such, as various data - product name/price, product inventory, product ratings will all come from different services - including the display responsibility.
Brokers tend to be aligned more with system boundaries (where system here could be mobile client, web client, back-end etc.).  Buses tend to be more cross-cutting, encapsulating a slice through a number of systems.
For service deployments, many services can be deployed onto the same machine, or deployed in the same application or co-operating within a single workflow.
To share schema between different services, this is usually done by putting all of the message classes/interfaces in a single class library and packaging it up as a (e.g.) NuGet package that the services needing to communicate can take a dependency on.
When looking at "business entities" (i.e a customer) we need to look at all the data associated with the entity and ask if it belongs together.  If a customer has an Id and Name, Address, Phone Number & PreferredStatus fields, we don't necessarily put them all on one "entity" or object.  We ask if each data field is strongly affiliated with others - e.g. Is the PreferredStatus strongly affiliated with the customer Name?  Address?  No.  It's only affiliated with other data (e.g. Purchases) that are held in a different service and different entities.  Therefore, the PreferredStatus only relates to the Customer Id, so the "entity" here would consist of only CustomerId and PreferredStatus fields.
A main design point: When you hear domain language that says "entity has some value" see if you can break it apart to say "Value is associated with entity Id, but it belongs somewhere else".  This will help avoid creating larger monolithic entities that frequently require request/response between them.
A general rules of thumb for SOA is: "Don't try to model the real world, as it doesn't exist".  More often than not, what we think of as a real world entity i.e. a Product, a Customer, is really separate collections of attributes that each live in different places (services) with only an Id to tie them together.
In order to allow child components to evolve, instead of having the parent point to the child, have the child point to the parent.  i.e. Instead of an Order having a Product Id, and a Shipping Address Id etc., we have the Shipping Address contain the Order Id and the Product (for the order) contain the Order Id.
By creating the Order Id (the main identifier) in advance, it can be used by separate services for persistence - i.e. when the user enters shipping information for the order, even though the order process isn't completed yet.  The Order Id could be stored in session state in the browser so that other services owning other parts of the order process can use that Order Id.
Using Id's in this manner means that separate services don't need to pass too much data to each other.
To create view models that represent all the required data from different services for one page, we should use an SOA-friendly view model, these are different from a more "traditional" view model.  A traditional view model showing product name, price and inventory may look something like public List<ProductInfo> ProductInfos; where ProductInfo has properties for Name, Price and Inventory.  A more SOA-friendly view model would have multiple dictionaries associating the product Id with the various properties i.e. Dictionary<string, string> containing ProductId and Name and a separate Dictionary<string, decimal> containing the ProductId and Price and finally a Dictionary<string, int> containing ProductId and Inventory.  Each separate service populates the respective dictionary with the data that it owns.
There are IT/Ops concerns for all of our services.  These are not direct dependencies, but a set of standards, or conventions, that all services should adhere to.  The message queue platform, the type authentication/authorisation used etc.
Be careful not to stray into business logic territory here.  Keep it generic.
IT/Ops components can use a hub & spoke mechanism to perform its processes.  A good example for this is where the system requires importing data, perhaps from a third-party system.  Each service will have code that talks, in-process, to the IT/Ops process code in order to invoke the process logic.  e.g. An OrderAcceptedHandler could depend upon an IBillCustomerForOrders object which comes from the IT/Ops code.  This in turn is responsible for, and depend upon other code such as IProvideOrderTotal and IProvideCreditCard.
IT/Ops encapsulates all of the technical concerns required in the business.  It's not a business/domain capability per-se, but it the technology required within them.

Services Modelling


The exercise presented is to model the service boundaries of a simple hotel with search availability, book room, check-in and check-out.
It's ok to share such data as Id's between services, but not data such as (say) a Price of a product.  It's all about volatilities.  The concept of an OrderId or a hotel ReservationId is unlikely to change over time.  A price of a room - or what actually constitutes a "price" - can change over time.
Identifying service boundaries is hard, especially in new domains.  It doesn't necessarily get easier even when done multiple times.
Be aware of using departmental boundaries as identifying service boundaries, they're usually not.  Departments are fiefdoms and largely defined by politics and managerial power struggles.
Be aware of processes, usually ending in "ing" at the end of their names (i.e. Billing, Shipping etc.) are processes rather than capabilities.  Most processes involve multiple capabilities working together. i.e. Booking a hotel room would likely involve multiple services.
Be aware of using entities as the basis for services (i.e. Booking, Room, Reservation, Guest). This is the traditional, non-SOA way of analysing the business domain.  It leads to lots of request/response between the entities and also means that bits of each entity end up in multiple different services.
Make use of "anti-requirements".  e.g. When searching hotel rooms, we can intuit that price and description would come from different services, so we frame a question to the business which might try to refute our intuition - "Would you ever need a business rule that sets minimum price if the description is more than 20 characters?"  The business will laugh and think it's stupid, but you've found a completely stable business invariant.
Software Architecture is all about finding the stable business abstractions and aligning the software abstractions with those boundaries and also to encapsulate the volatile abstractions. One of the indications that we've got this right is that we don't need a lot of request/responses between them.
In the event that a Room's price can change for a booking, we might initially think to copy the price from the price service into the record in the booking service.  It's better to put the BookingId from the booking service into the record in the Price service as the BookingId is a less volatile concept than the Price.  It also means that, should the business wish to start accepting Bitcoin as payment, we only need to change the Price service and nothing else.
Within User-facing processes, the service-to-service interaction is done in-process and any remote calls are usually done within a service boundary.
Be aware of using terminology from different domains that you may understand better into this domain.  You may wish to call something that feels like "inventory" by that name, but inventory in retailing may allow for easy replenishment when diminished, but in the hotel domain, inventory is usually very fixed based on physical rooms in a building.

Business & Autonomous Components


Business Components are how we sub-divide a Business Capability.  This is especially useful if the software implementing the Business Capability feels as if it's got "too fat".
Business Component and Business Capability share the same acronym as a Domain-Driven Design Bounded Context, and although that's coincidental, it's actually very useful!  Business Capability is the same as a Bounded Context!
A Business Component is the technical authority for a section of the larger Business Capability.
One indicator of when to break a capability into multiple components might be when you realise some customers are buying 1 million products at a time instead of 10.  The business would not accept that the biggest customers are potentially receiving the worse quality of service as, technically, it'll take longer to process 1 million products rather than 10.  So we might break our Sales service to have two components, one for "normal" customers and one for "strategic" customers.
Another, technological, indicator is when you have lots of code in a service that starts to become "schizophrenic" and is being pulled in two different directions.  This is usually manifested as having lots of conditional logic / branching in key places.
When you sub-divide a business environment into services, you'll have many, maybe 7 to 10. But when you sub-divide a service into business components, you should never really have more than 2 or 3.
Other examples of differentiators for needing separate components: Airlines have standard class and business class. Shipping may ship paper products, but then need to ship ice cream in temperature controlled vehicles. Customers may be invoiced once per month whilst others may be directly charged via credit card.
Business components don't talk to each other.  Although they usually deal with the same business function (e.g. charging a customer), they do it in two very different ways.
Autonomous components are created to handle different types of messages in the system.  One differentiator is messages that change state (commands) versus messages that don't change state (queries).  Business Components are composed of multiple Autonomous Components, with each Autonomous Component being responsible for one or more message types.  For example, a set of CRUD operations would be a single autonomous unit, so grouped into it's own Autonomous Component.
Autonomous Components are usually packaged as their own package (NuGet).
Autonomous Components should be completely decoupled.  Don't try to reuse code between multiple Autonomous Components.  They should be small enough that they could be thrown away and re-written from scratch and should be written to solve today's problem only, not tomorrows. JFHCI (Just $%&*ing hard-code it!)
Autonomous Components are the unit of packaging within SOA - it's what comes out of our CI pipelines.  This is the technological view of the architecture.  The Autonomous Components are then deployed into Systems/Services - these are a purely logical view of the architecture.
When defining service boundaries, the correct boundaries are usually obvious.  For example, in the insurance industry, services are usually split into Policy (front-office) and Claims (back-office), but for any insurance product, there's a coupling between Claim and Policy as any Claim needs lots of data from the Policy.  A much better set of service boundaries is across different types of insurance products (home, motor etc.) as motor insurance is handled differently from home insurance.  This will be obvious to the business, but that's how you know you've got the boundaries correct.
To determine boundaries, always look for the things that change together, or require data/behaviour from other areas.
Reporting within an SOA system can be difficult.  An insurance customer may have both home and motor insurance, but the customer data will be duplicated within each service boundary so it can be difficult to provide a customer report to show all insurance owned by each given customer.
Many reports that users want can be encoded into the domain by examining and digging into the values they use for their filters and queries on the reports.  e.g. Hotels might want a "report" of all guests checking in on a given day, but we can encode that domain logic requirement into the service that owns that data and proactively send an email to the user each day with the data they require.
For many purposes, customers use reports as a tool to essentially provide pattern-matching over data, but by encoding the report logic in the domain, we provide the customer with a better tool to achieve the same result.
Reporting is really orthogonal to SOA.  SOA is about finding stable business abstractions, but reporting is about trying to uncover something about the business that no-one else knows which is not a stable business abstraction.
Referential Integrity within an SOA architecture is that we can't enforce foreign key relationships between data owned by different services.  This is ultimately solved with Eventual Consistency.
Relational databases having strict referential integrity can be problematic as there are genuine business states of partial consistency that are still valuable.  These are often held in memory, session state or a memory cache before they're ultimately persisted to the database once consistent.
Eventual Consistency is able to work due to the system being able to recover from transient inconsistency. i.e. checking for some data that may not exist is retried slightly later when the data then may exist.
Cascading deletes are evil - we don't delete all outstanding orders and inventory when we delete a product. There is the concept of private data - data that has never left the boundaries of the service - and public data - data that has had it's Id "published" outside of the service boundary.  For this reason, it can be helpful to give users the ability to "publish" for any data they maintain in the service (data is simply flagged as such).  Can CRUD to heart's content when not published, but as soon as published, cannot ever delete.
Team Structure for building SOA architectures is usually based around individual teams for each service, often with a small group of IT/Ops people that straddle all of those groups helping to steer things like language/framework choice etc.  This can, however, sometimes lead to silo-ism.
An alternative approach is to use "Task Forces". These are ad-hoc teams created to implement specific features within any specific service, but this is often combined with core small collections of people who retain the knowledge of the code for specific services, allowing such knowledge to be disseminated throughout other teams.
Be aware that SOA architecture doesn't really work in some types of business.  Start-ups exploring an idea to see if a business can be made, or Generic/Extensible Platform type products (think SAP) are two types of business for whom SOA won't really help.

Service Structure - Command/Query Responsibility Segregation


The start of the CQRS journey starts with multi-user collaboration, and is primarily around the single-writer, multiple-readers scenario.
Different data has different read/write properties.  A book description on Amazon has one writer but many readers.  A book rating would have many writers that make up the overall rating value.
A fundamental truth to realise is that, when users read data, they're reading "stale" data.  This is true of any software information system.
Scaling databases out in non-collaborative domains can be achieved with sharding. This is splitting the data written over multiple servers, and it works well for data that is non-collaborative with one (or very few) writers for a given piece of data.
In collaborative domains with traditional solutions, we often encounter the situation where high numbers of writes against the same data record (say purchases for a single popular book) cause the database to lock, this creates contention and ties up connections, draining the connection pool and rendering the entire database unusable for all users (people looking to buy other books).
A naive approach to fix this might be to put a queue in front of the database and queue the updates.  This helps the database, but hurts throughput as now people who are trying to buy the less popular books where high contention does not exist now have to wait in line in the queue.
Another approach might be to add the orders for the books to a new table, thus performing an insert operation for each order, rather than an update.  But, if the order operation checks inventory too, we've now moved that responsibility from the command side (update database record whilst checking inventory) to the query side - we can no longer check inventory within the command, but in the query, we look at inventory and take into account the inserted records representing orders.  Reads don't lock records in the DB, so this can create a situation where two or more orders doing the query to determine inventory can both read the same value concurrently leading to negative inventory.
The actual answer here is not to worry about temporarily negative inventory.  We allow that new interim business state to exist, but we can work around this (by getting better at replenishing inventory). This is the price we pay for highly available and highly concurrent systems.
CQRS isn't just an implementation pattern.  It's also about the approach and business analysis, focused on uncovering all the "collaboration" areas on specific data within the domain.
When thinking about how to implement a new feature, focus on the data model to ensure it can handle concurrency.  Look at updates to that model - are certain individual records likely to have high contention? Look at the data as if it were on pieces of paper on a table - how does it naturally relate?  If there's a natural relational model, or a natural graph model, that will inform the choice of database used for the service.  This is why different services may use entirely different database technologies.
For queries within CQRS, be explicit about query staleness.  Ensure all queries/models include a timestamp of when the read model was created (as at).
Keep queries simple.  Code should be able to pretty much directly access the data store and invoke simple queries in order to retrieve the required read model (i.e. SELECT * FROM [MyReadModel] WHERE Id = X).
Be aware if you're creating read models that appear to differ in a minor way (i.e. a report for standard users containing CustomerId, Name, PhoneNumber and a report for supervisors that contains the same fields and adds LifetimeValue).  Lifetime value is derived from orders and prices, and will exist in a different service than customer names and phone numbers, so always re-evaluate the service boundaries to ensure they're correct.
When we cache read models close to the UI tier of the solution, we can leverage the read model for quick validation checks for new data being entered into the system.
For the "unique user name" problem, we send the command to create the user with a client-generated UserID (Guid) and username, the server checks and sees that the username is taken, but we persist the user record as it's the UserID that's most important. We change the username (e.g. "billsmith" becomes "billsmith25") and return that to the user with an apology that the name is taken.  User can always change the username after the fact, but the UserID remains the same.  So long as we know the command won't fail and the UserID will persist, we can actually issue more commands from the client before we've even got a response from the server relating to the original sign-up request command.
We need to perform sufficient business analysis to attempt to be able to create commands that won't fail.  There can often be race conditions between multiple users updating the same data that would cause a command to fail.  We should examine race conditions to see if the two races, if reversed, end in the same result.  CQRS allows us to get to the point of finding the command failures based upon collaboration and to find those collaborations, some of which can be hidden at first look.
Very frequently, our UI's are designed in ways that work against the ability to allow collaboration.  When we present users with many fields or tables of data that they can edit in one big batch and click save at the end, we'll often run into optimistic concurrency violations.  Users may learn to perform smaller edits before clicking save, but the UI is fundamentally working against the ability to perform collaborative updates. In these cases, we should really fix the UI first.
Some domains, such as booking seats in a theatre, where the "collaboration" against the data can't really work in an eventually consistent manner won't benefit from having CQRS applied against them.
Some approaches to help (but not completely fix) the problem is to divide the complete set of theatre seats into sections, perhaps priced differently.  Each section can be handled by a different server.  We're taking a concurrency problem of 10000 people fighting over all of the seats and breaking it down to 10 x 1000 people fighting over sub-sets of the seats.
When uncovering a collaborative domain and running into scalability problems, many times you'll have to almost re-invent how the domain works in order to resolve the collaboration problems, such as having entirely different UI / user workflows etc.
If you have contention concerns on the write side (commands), you'll need to take those concerns into account when updating read models. i.e. Can't update read model on every command, must batch the updates.
The concept of "CRUD on an entity" is a product of a single-user mentality.
The CAP Theorem states you can't have consistency, availability and partition tolerance (another way of saying distributed).  Easiest way to scale is on a single big server - you're keeping consistency and availability but sacrificing partition tolerance.  If there's business pushback against that approach and a declaration that "in this other system, we achieved it", it's often that consistency has been sacrificed somewhere - possibly even resulting in data loss.
When users ask "build me something that works like Excel" and follow it up with a request to be able to save some specific combination of filters etc. with a given name - this is a clear indication of a more specific domain concept and requirement.  We can even "look through" the set of criteria the user is using to uncover e.g. specific constants that are used more than once in different computations.  These are often other domain concepts that can be uncovered and modelled explicitly.
There's something known as the "Engine Pattern".  This is used for business functions that require complex calculations i.e. for calculating a product price, determining if an order is fraudulent etc.  This takes the form of having each service involved in the calculation operating on it's own data, and performing a transform of the input data. e.g. Hotel room has list price of $500.  The service dealing with dates can multiply the rate by 1.2 due to high season, the next service dealing with guests may decrement price by multiplying by 0.7 due to discount for many people etc.  It's important that each step of the transformation is owned by the service that owns the data involved in the step. The total calculation is spread across multiple services.
When building engines with this pattern, it's often done by a "team" that consists of members from each of the service teams that are involved in the engine's calculation.  The code would live in the IT/Ops areas of each of the various services.

Scalability and Flexibility / Monitoring and Management


Rather than using microservices, we usually deploy our services as a collection of NuGet packages for each service.  The advantage of microservices is that we simplify deployment - one deployment means all clients benefit immediately.  The disadvantage is that if we have a bug, we break everything immediately.  When using multiple NuGet packages for each service (e.g. WebShop.Branding, Webshop.Schema, Webshop.Frontend etc.) we deploy each package into different services - if some other service needs the Webshop schema, we deploy Webshop.Schema into it.  This complicates our deployment model but means we can upgrade only the backend part of our system without upgrading the frontend part thereby minimizing blast radius for errors.
When naming Autonomous Components, for the backend, we name them following the message that they're responding to (e.g. CustomerRegistered).  For the frontend, we following the use case (e.g. CustomerSignup).
Queues are usually named the same way as Autonomous Components.
By monitoring queue-based systems, we get much more visibility into what is actually happening and flowing through the whole system.  Our error queues help to notify admins for errors.
We can identify bottlenecks by monitoring the number of messages in a queue and the throughput of those messages (messages processed per second).  This allows us to calculate the average time to process a single message.  By doing this continually over time, we can see where in the system we get the greatest load and where we might need to increase our ability to process messages more quickly.
In order to scale our systems, we can use the "competing consumers" pattern.  We deploy the same Autonomous Component to multiple servers allowing many copies to pull from the same queue.
Competing Consumers can also be implemented in a "distributor" manner, which is where each consumer has their own local queue which can be filled from the main message queue.
With a queue-based environment, as opposed to a request/response microservices based one, we can scale individual system components separately and they can be of differing capabilities in terms of power and throughput.  With HTTP microservices, where the performance of one microservice is dependent upon the performance of another microservice, individual components need to be scaled together.
If we're using multiple database platforms, DBA's may find it difficult to backup everything easily. This is best handled by using a SAN (Storage Area Network) which allows for SAN Snapshots which give a fully consistent system-wide backup as at a point in time.
When there's pushback from DBA's / admins when they ask "Why can't you just use SQL Server for everything?" it's usually driven by concerns over backup and restore capability, so moving to a SAN based environment can solve that.
Versioning strategies within a service and queue based environment are simpler.  We reduce both physical and temporal coupling meaning individual components can be swapped out and deployed independently without publisher/consumers having to stop publishing/consuming messages.
We need to ensure the next version is backwards compatible with the previous version, however, services allow us to greatly reduce the surface area of code that is required to be backwards compatible.
Ensure version upgrades are performed "back to front".  Start with DB changes, then server version, then message contract and finally client.

Long Running Processes


What is a process?  It's a set of activities performed in a certain sequence as a result of internal and external triggers.  Normal processes have triggers that occur fairly quickly to move from one activity to another.  Long-running processes have an indeterministic amount of time for when the trigger can be received to move from one activity to the next.  When triggers are received, they're handled by the same process - therefore the process is stateful.
Long-running processes can be handled by the Saga pattern.  They're message-driven and are just like a standard message handler with the exception that the Saga has it's own state.
The Saga process should only ever update it's own "process state" never update the "master data" when in the middle of running the process.  Master data is only updated at the end of the Saga process or, if it must update master data before the end of the process, it only ever does so by sending out it's own events to be handled by something else.  Sagas shouldn't really read master data either. They should be entirely self-contained, transactional processes that do not rely on any data outside of them.
Sagas need to have their own Ids and these need to be added to messages that are sent/received from the Saga so that messages can be sent to the correct Saga instance (whose state may have been persisted between activity steps).
Long-running processes can often be required to handle time.  That is, the process might "timeout" if more than 24 hours passes between the receipt of one trigger message and the next.  This is frequently achieved by having the Saga process send a message to itself, leveraging the functionality often built into the underlying queuing platforms to defer the sending of a message until some later time.
Long-running processes can be "long running" for a few minutes, a few days or even a few years.
Orchestration is not a service in itself.  Divide up workflows/orchestrations along service
boundaries.  Events are published at the end of the sub-flow in a service and can trigger a sub-flow in other services.
Sagas are a prime candidate for unit testing, especially for time-bound sagas.

Service Layer - Domain Model Interaction


The Domain Model pattern was originally formulated by Martin Fowler in the book "Patterns of Enterprise Application Architecture".
Prior to the Domain Model pattern, we had a more traditional layered architecture, but this often meant spreading the business rules throughout many different layers.
The Domain Model pattern doesn't encapsulate the entirety of the business logic.  The pattern itself says to use the pattern "...if you have complicated and ever-changing rules..." but that "If you have simple not-null checks and a couple of sums to calculate, a Transaction Script is a better bet".
A common mistake is to have classes representing Customer, Product, Order etc. and build links/references saying "Orders have products, Customers have orders etc.". We then try to add domain logic/business rules into them.  This is wrong.  The classes represent the stable structural concerns - things that don't change i.e. Customers will always have orders, orders will always have products so they're not complicated or ever-changing.  The Domain logic is the ever-changing business logic, but if we apply that to these classes, we create a schizophrenic codebase of code that, on the one hand, wants to change but on the other hand wants to stay the same.  Not every bunch of POCOs is a domain model.
We can have many Domain Models and they can be deployed to different tiers of the solution.  Domain Models can even exist in the database layer. This is especially useful for batch imports. If you have 1TB of data and 1MB of domain model, you can either move the data to the domain model (i.e. reading the data in, processing in code and writing back) or you can move the domain model to the data (i.e. put the logic in the database, bulk copy in the data and process in-place).  It'll always be quicker to move the smaller thing closer to the bigger thing!
There's a few options for concurrency models with domain models. Optimistic concurrency - which can be either "first one wins" or "last one wins" - either way someone loses data.  There's pessimistic concurrency, which means everyone has to stand in line and each data update is dealt with one at a time.  However, neither option is good for realistic real-world concerns.
Beware of the false sense of security of wrapping things in a database transaction. By default, transactions operate in READ COMMITTED mode, so if the transaction does two reads, some computation, then an update, the things being read can be updated by other transactions before your transaction ends.
Keep updates as small as possible. Don't read two, three or more objects before making an update.  This is especially true with entity relationships, which expose us to potential inconsistencies (i.e. Looping over order line in an order).
Entities and entity relationships are bad.  The good news is that, if we follow correct service boundaries, entity relationships will largely not exist.
Many apparent race conditions are symptoms of how an underlying problem is solved with current systems.  Use techniques such as the "five whys" which can uncover the underlying problem which can then often be solved in such a way as the race condition goes away.
Understand how temporality can impact potential behaviours and actions in the domain.  Statistics might show that 90% of orders, if cancelled, are done so in the first 30 minutes after placing the order, so we might defer the publishing of the OrderAccepted event to other parts of the system to allow for a potential cancellation.
Note that this temporality means that many of these processes and system behaviours are better suited to being implemented with the Saga pattern.  This leads us to realise that our domain models are actually Sagas.
These kinds of Sagas can be referred to as "Policies". They're not individual business processes as such, but more rules that govern how multiple processes will behave in relation to the policy.
To explain SOA to business executives, we say that we're looking to align the software according to the business capabilities and business policies, rather than aligning with the business processes and business entities.  We're Capability-centric via Services and Policy-centric via Sagas.

Organisational Transition to SOA


If you're in an organisation with legacy, non-SOA based systems, don't do a big-bang rewrite.  You won't have the time to complete it and will, in effect, only ultimately contribute to the big ball of mud as your new code ends up merely integrating with the old, not replacing it.
The only value from the "big rewrite" project is the budgetary conversation that you have with the business.  They need to understand the costs involved of doing something new and different.
To transition to SOA, we need to approach the transition in phases.  We follow the maxim of "slow as possible, fast as necessary".
The political component needs to be addressed.  Every small win needs to celebrated extensively.  This is required as subsequent phases in the SOA transition will incur a bigger amount of tax (additional time/resource that the business gives you), and in order to get that from the business, the political game needs to be played.
Phase 1 is to implement good programming practices, unit testing for all new code, good deployment practices via CI/CD.  Importantly, for every use case touched in code, make sure you publish an event - you're starting to implement code around the edge that will ultimately help to move away from the BBOM (Big Ball Of Mud).  We shouldn't implement huge infrastructure changes - if the company doesn't already use any queue technology, use the database as a queue.
In doing this, it's on top of existing business-as-usual work.  The business is paying a "tax" on top of the existing work.
Phase 1 should be a small amount of tax, say 5%. Phase 1 will take anything from 3 to 9 months.
Phase 2 of the transition involves implementing subscribers to the events.  These should be implemented intelligently - don't just add them for the sake of it, implement subscribers that have the most value impact.
Another activity for Phase 2 might be the implementation of a composite UI, showing data from the old BBOM and the new subscribers on the same screen.
Phase 2 is where the SOA analysis kicks in.  As subscribers start to take on more functionality and responsibility, we perform more analysis around service boundaries to ensure we're heading in the general right direction.
Phase 2 tax is around 10% tax.  Phase 2 should last for anything from 9 months to 18 months.  This is because you need to convince the business to pay for the tax in the next phase.
Phase 3 is about taking functionality out of the BBOM and transferring it to the subscribers previously extracted in Phase 2.  This will take a lot of time and effort as each time it's done, the BBOM needs to be re-stabilised.
As subscribers take more functionality from the BBOM, they'll become message publishers themselves and will start to communicate between themselves.
Phase 3 is where we should start to get significant time and input from domain experts.  We'll need their help to really zone in on the service boundaries and ensure we can get the details of where business data should really reside.
Phase 3 is where we can implement report replacements - the sagas that respond in real time to business events and can email business stakeholders every day with key business intelligence data.  This helps to demonstrate improvements to the total solution.
Phase 3 tax is around 30%.  Phase 3 should last for anything from 1 year to 3 years.
Phase 4 is where you attack the database.  Like taking functionality out of the BBOM into services in Phase 3, you do the same to the database - taking parts of it out of the monolithic database and putting it into it own relevant service.  This is perhaps the most difficult Phase as problems with splitting apart the database could bring the whole system down.
During Phase 4 it is likely time to consider team re-organisation as the current team structure was based around the BBOM structure, but will not be structurally aligned with the new services that have been extracted.  We don't do this in Phase 3 as the database is still a monolith, so we only attempt the team re-org after both the code and the data are split and migrated.
Although you won't get more "tax" in Phase 4, the functionality shrinkage from the BBOM into separate services should mean that the business-as-usual work should be able to be completed more easily and quickly.
Phase 4 tax is still at 30% (the business won't ever give you more than this). Phase 4 will last at least 1 year, likely much longer.
The entire transition process, on the low-end, could take around 3 years for small organisations with smaller existing codebases. More realistically, it's likely to take closer to 5 to 7 years.

Web Services & User Interfaces


Be aware that distributed caches don't store all keys/values on every machine.  Code that e.g. loops over every key in the cache will cause excessive network calls as the cache has to reach out to other nodes to retrieve data the current node doesn't have in memory.
It's a universal truth that there's always be more data to cache than you have memory available to store it.  Beware with cache invalidation.  Your cache implementation may appear to work great for a given load of users, but if the load increases old cache items are invalidated. The first idea to fix this might be to cache more aggressively but this just makes things worse.
The reason to introduce a cache is to help with the high load scenarios, not to help when the load is low, but a lot of caching is set up such that it's less and less effective as the system load increases.  To fix this, ensure you're only caching static data, not per-user/per-request transient data.
Most caches have a metric called "hit rate".  This is the percentage of how many times a value for a key was asked for and the cache had the value.  We need to measure this metric to ensure our cache is actually making things better for us.
Be careful with CDN's (Content Delivery Networks). You need to understand your user base.  The more geographically distributed your userbase, the more CDN's improve the end-user's experience.  Conversely, the less distributed your userbase, the less benefit you'll get from a CDN.
Leverage the ability to cache different parts of a single web page, based upon the volatility of the content.  Even though this might result in a single page making many more requests to the server to render the complete page, it can often improve user experience if many of the page parts can be cached.
Even though it can seem unintuitive, making more HTTP requests to render your page can actually improve performance than making fewer requests.  This is because requesting many smaller pieces can often be served by caches in a faster time than making a single request that may not be able to be served by an intermediary cache.  Most modern browsers will also parallelize many requests for a single page, so overall wall clock time for all requests to complete can still be very fast.
Don't use query strings (?) in URL's as it'll prevent the nodes on the internet from being able to cache resource effectively.  Use a slash and make it part of the URL.  (i.e. Don't do http://weather.myurl.com?location=london but instead do http://weather.myurl.com/location/london).