In Summary:
- Microservices are small autonomous services
- Microservices are modeled around business concepts
- Microservices encourage a culture of automation
- Microservices should be highly observable
- Microservices should hide implementation details
- Microservices should isolate failure
- Microservices should be deployed independently
- Microservices should decentralise all the things
The long list...
- Cohesion: group related code together
- Gather together things that change for the same reason
- Separate those things that change for different reasons
- If behaviour is spread across services, then change in behaviour requires deploying updates to multiple services
- Focus service boundaries where we can ensure related behaviour is located in one place
- Microservices make it obvious where code lives for a given behaviour
- Thus avoiding the problem of a service growing too large
- Avoid structuring services around technical concepts, aim for business bounded contexts
- Routing is a business requirement (I want to direct users to somewhere)
- Page Composition is a business requirement (I want to put a page together for the user)
- Source of data is a business requirement (I want a place where I can manage by config/templates)
- Each microservice should be hosted on its own machine (don't pack services together in order to save cost)
- Multiple micro services on one host means a failure of one impacts the other
- This also means you're now unable to scale appropriately for the demands of any one microservice
- Ensure services are evenly distributed across different regions and availability zones to improve resiliency
- Utilise Load Balancers to help balance the incoming traffic (as well as SSL termination; as long as services are within a VPC)
- Services need to change independently of each other
- Services need to be loosely coupled (e.g. changed & deployed by themselves without requiring consumers to change)
- Services should have a clear contract/interface
- Services should try to be stateless and immutable (idempotent) as this requires much less complexity and facilitates easier scalability
- Otherwise consuming services can become coupled to an internal representation
- Choose technology agnostic APIs (e.g. REST over HTTP)
- This means avoiding integration technology that dictates what technology stacks we can use to implement our microservice
- Microservices allow choosing the right tool for the job
- Microservices facilitate SPOF handling (offer a gracefully degraded service when part of the system fails)
- Microservices allow us to align architecture with the organisation (focus on team ownership)
- Microservices facilitates easy rewriting of services due to small size and well defined boundaries
- Avoid shared libraries as they can restrict your ability to deploy easily/quickly
- Don't let shared code leak outside your service boundary (otherwise this introduces a form of coupling)
- You also lose technology heterogeneity with libraries (consumer needs to be the same language; e.g. Alephant)
- Define good 'principles', followed by good 'practices' that support/guide those principles
- Different teams with different technical 'practices' can then share a common 'principle'
- It is essential that we can see a coherent, cross-service view of our system's health
- This has to be system-wide, not service-specific
- Inspecting service-specific health is useful only when diagnosing a wider problem
- All services should have consistent mechanism for emitting health indicators/metrics as well as logging
- Down/Upstream services should shield themselves accordingly from other unhealthy services
- Provide templates (generators; e.g. CloudKit) that allow developers to follow best practices/architectural guidelines easily
- The team who creates the templates shouldn't be gatekeepers, they should be open to accepting suggestions/changes
- Avoid a centralised framework that does too much and affects developer productivity (rather than improve it)
- Microservices allow greater ownership from multiple sources
- Boundaries in code (e.g. think object-orientation) can result in becoming candidates for their own microservices
- Services can be nested (in an abstraction sense) behind an encompassing service, but can depend on organisational structure
- Good integration means simplicity. RPC may be good for performance but tightly couples our services with too much context
- RPC exposes too much internal representation detail and should be avoided unless performance is absolutely critical
- Always have interfaces/APIs in front of a data store (e.g. change from relational to nosql should not affect consumers)
- Asynchronous communication is harder to co-ordinate but offers greater loose coupling (apposed to sync request/response)
- RPC sometimes causes problems when devs aren't aware calls are 'remote' as appose to 'local' (affecting overall performance)
- RPC typically isn't versioned and so you could implement a breaking change that requires 'lock-step releases' (i.e. coupling)
- Collection and central aggregation of as much 'data' (e.g. logs/metrics) as we can get
- We do this with logs going into Sumo Logic (I wish for something better than Sumo though)
- We also do this with metrics going into CloudWatch and then out into Grafana (we can do better though)
- Aim for consistency in the format for Metrics and Logs to enable the ability to easily filter them via a aggregation service
- This is made easier via standardised tools (shared custom logging abstractions; e.g. Alephant Logger)
- Being able to generate services with tools pre-baked in is useful, but you have to be careful about centralised authority stagnating progress
- But we're still not doing this properly as far as tracing a call appropriately
- Synthetic Monitoring (e.g. a synthetic transaction): a way to automate a fake request and store outcomes into a test bucket for analysis
- Synthetic Monitoring can help identify when a service is unable to communicate with/to another service (but is otherwise healthy)
- Make sure that synthetic testing system doesn't accidentally trigger unwanted 'side-effects' (less of an issue for us just displaying text content)
- Correlation IDs: a poor man's "distributed tracing" (generate a unique guid and pass it along to all log calls)
- Might be a clever way to expose a session guid to the logger (suggestion has been via HTTP headers)?
- Remember that the service needs to pass the header over to the next service as well (this is where a form of consistency - contract - is required)
- This maybe a poor man's tracing but it would be supremely useful in tracking a single request from start to finish
- Especially considering that most people find Zipkin to be a bit heavyweight
- Circuit Breakers help handling cascading service failures in a more elegant fashion
- Aggregated network health status visibility system (e.g. my Heka hackday from 2015 or 2014) are recommended
- Authentication inside a VPC perimeter can be made more efficient by terminating from the front door and using internal load balancers
- Downside is if an attacker breaches your internal network then you stand no chance of preventing them reading your network traffic without HTTPS
- But I'd argue if your VPC is compromised, you have much bigger issues
- Implement network segregation (e.g. we do this already via VPC's, but have them on a more granular level; Morph & Mozart should be/are)
- Whether the segregation is based on 'team ownership' or 'risk level' is up to your organisation to decide what's more appropriate
- Tightly coupled organisations generally appear to produce tightly coupled software architecture by their natural influence
- Similarly, loosely coupled organisations generally appear to produce very modular and loosely coupled software architecture
- Having multiple teams trying to manage a code base makes it difficult to communicate, coordinate and to reason about the service
- Distributed teams need to identify portions of a service that they can take ownership of and introduce clear service boundaries
- The tendency for a single team that owns many services to lean towards tight coupling is more and more likely to occur
- Team ownership of a service means they can do what they like as long as they don't break contracts/interfaces their consumers rely upon
- Unless indicated via a versioning system
- Having 'feature teams' also doesn't work as it means those teams cross over the responsibility boundaries
- Internal 'open-source' (IOS) - let's face it: that's Alephant - can help avoid the need for 'feature teams'
- IOS uses the idea of core custodians but that other teams can help towards pushing a particular service functionality forward and avoid bottlenecking
- Balance the need for complete automation of scaling against the service requirements (e.g. does a basic dashboard need 100% up time or not?)
- Degrade your service functionality gracefully (as best you can to suit the requirements of your users/consumers)
- Cascading failures are more likely to be caused by 'slow' responding services than failing ones (monitor and react accordingly)
- Put timeouts on all 'out-of-process' calls to try and avoid slow services causing bottlenecks and knock-on effects
- Circuit Breakers help defend your service against upstream services that are having problems
- Plan for failure (e.g. Chaos Monkey).
- Implement 'Bulkheads'. These are sections of your code that can be closed off to prevent sinking your entire application
- Bulkheads are subtly different from Circuit Breakers (the former shuts down aspects of your own service; the latter is for upstream services)
- Bulkheads aren't always logic based (e.g. if bad thing happens, disable feature X) they are also part of the software design process
- e.g. the use of different connection pools for each upstream service; if one upstream is slow then only that one part of our service shuts down
- Teasing apart functionality into microservices is another form of Bulkhead (failing of one microservice shouldn't affect another)
- Timeouts and Circuit Breakers free up resources when they become constrained
- Bulkheads ensure resources don't become constrained in the first place
- Avoid designing a system where one service relies on another being up
- e.g. Mozart Composition tries to solve that problem by serving from a page level cache if Morph is unavailable
- This also means that much less coordination is needed between services (we become more loosely coupled)
- Don't be afraid to start again and redesign (the beauty of microservices means a rebuild shouldn't be as costly as for a monolith)
- Identify your business model (reads vs writes) and aim to scale your services and resources appropriately
- Implement caching at as many levels as is appropriate (HTTP, application, CDN etc)
- You can even design your system in such a way that high bursts of 'writes' are cached and then flushed at a later stage ("write-behind cache")
- Cached writes could be as simple as fire off the data to a queue to be processed asynchronously (depending on your business model)
- Utilise AutoScaling and its variants (reactive, scheduled) more intelligently to suit your business needs
- e.g. scale down services on a scheduled basis overnight if they're only utilised heavily during office hours (lunch time peak for a news orgs)
- Understand CAP Theorem and what sacrifices (trade-offs) you can make that will best fit your business needs
- Automate documentation wherever possible as this allows it to stay fresh (e.g. on code commit trigger documentation automation update)