In Summary:
⁃ Microservices are small autonomous services
⁃ Microservices are modeled around business concepts
⁃ Microservices encourage a culture of automation
⁃ Microservices should be highly observable
⁃ Microservices should hide implementation details
⁃ Microservices should isolate failure
⁃ Microservices should be deployed independently
⁃ Microservices should decentralise all the things
The long list...
⁃ Cohesion: group related code together
⁃ Gather together things that change for the same reason
⁃ Separate those things that change for different reasons
⁃ If behaviour is spread across services, then change in behaviour requires deploying updates to multiple services
⁃ Focus service boundaries where we can ensure related behaviour is located in one place
⁃ Microservices make it obvious where code lives for a given behaviour
⁃ Thus avoiding the problem of a service growing too large
⁃ Avoid structuring services around technical concepts, aim for business bounded contexts
⁃ Routing is a business requirement (I want to direct users to somewhere)
⁃ Page Composition is a business requirement (I want to put a page together for the user)
⁃ Source of data is a business requirement (I want a place where I can manage by config/templates)
⁃ Each microservice should be hosted on its own machine (don't pack services together in order to save cost)
⁃ Multiple micro services on one host means a failure of one impacts the other
⁃ This also means you're now unable to scale appropriately for the demands of any one microservice
⁃ Ensure services are evenly distributed across different regions and availability zones to improve resiliency
⁃ Utilise Load Balancers to help balance the incoming traffic (as well as SSL termination; as long as services are within a VPC)
⁃ Services need to change independently of each other
⁃ Services need to be loosely coupled (e.g. changed & deployed by themselves without requiring consumers to change)
⁃ Services should have a clear contract/interface
⁃ Services should try to be stateless and immutable (idempotent) as this requires much less complexity and facilitates easier scalability
⁃ Otherwise consuming services can become coupled to an internal representation
⁃ Choose technology agnostic APIs (e.g. REST over HTTP)
⁃ This means avoiding integration technology that dictates what technology stacks we can use to implement our microservice
⁃ Microservices allow choosing the right tool for the job
⁃ Microservices facilitate SPOF handling (offer a gracefully degraded service when part of the system fails)
⁃ Microservices allow us to align architecture with the organisation (focus on team ownership)
⁃ Microservices facilitates easy rewriting of services due to small size and well defined boundaries
⁃ Avoid shared libraries as they can restrict your ability to deploy easily/quickly
⁃ Don't let shared code leak outside your service boundary (otherwise this introduces a form of coupling)
⁃ You also lose technology heterogeneity with libraries (consumer needs to be the same language; e.g. Alephant)
⁃ Define good 'principles', followed by good 'practices' that support/guide those principles
⁃ Different teams with different technical 'practices' can then share a common 'principle'
⁃ It is essential that we can see a coherent, cross-service view of our system's health
⁃ This has to be system-wide, not service-specific
⁃ Inspecting service-specific health is useful only when diagnosing a wider problem
⁃ All services should have consistent mechanism for emitting health indicators/metrics as well as logging
⁃ Down/Upstream services should shield themselves accordingly from other unhealthy services
⁃ Provide templates (generators; e.g. CloudKit) that allow developers to follow best practices/architectural guidelines easily
⁃ The team who creates the templates shouldn't be gatekeepers, they should be open to accepting suggestions/changes
⁃ Avoid a centralised framework that does too much and affects developer productivity (rather than improve it)
⁃ Microservices allow greater ownership from multiple sources
⁃ Boundaries in code (e.g. think object-orientation) can result in becoming candidates for their own microservices
⁃ Services can be nested (in an abstraction sense) behind an encompassing service, but can depend on organisational structure
⁃ Good integration means simplicity. RPC may be good for performance but tightly couples our services with too much context
⁃ RPC exposes too much internal representation detail and should be avoided unless performance is absolutely critical
⁃ Always have interfaces/APIs in front of a data store (e.g. change from relational to nosql should not affect consumers)
⁃ Asynchronous communication is harder to co-ordinate but offers greater loose coupling (apposed to sync request/response)
⁃ RPC sometimes causes problems when devs aren't aware calls are 'remote' as appose to 'local' (affecting overall performance)
⁃ RPC typically isn't versioned and so you could implement a breaking change that requires 'lock-step releases' (i.e. coupling)
⁃ Collection and central aggregation of as much 'data' (e.g. logs/metrics) as we can get
⁃ We do this with logs going into Sumo Logic (I wish for something better than Sumo though)
⁃ We also do this with metrics going into CloudWatch and then out into Grafana (we can do better though)
⁃ Aim for consistency in the format for Metrics and Logs to enable the ability to easily filter them via a aggregation service
⁃ This is made easier via standardised tools (shared custom logging abstractions; e.g. Alephant Logger)
⁃ Being able to generate services with tools pre-baked in is useful, but you have to be careful about centralised authority stagnating progress
⁃ But we're still not doing this properly as far as tracing a call appropriately
⁃ Synthetic Monitoring (e.g. a synthetic transaction): a way to automate a fake request and store outcomes into a test bucket for analysis
⁃ Synthetic Monitoring can help identify when a service is unable to communicate with/to another service (but is otherwise healthy)
⁃ Make sure that synthetic testing system doesn't accidentally trigger unwanted 'side-effects' (less of an issue for us just displaying text content)
⁃ Correlation IDs: a poor man's "distributed tracing" (generate a unique guid and pass it along to all log calls)
⁃ Might be a clever way to expose a session guid to the logger (suggestion has been via HTTP headers)?
⁃ Remember that the service needs to pass the header over to the next service as well (this is where a form of consistency - contract - is required)
⁃ This maybe a poor man's tracing but it would be supremely useful in tracking a single request from start to finish
⁃ Especially considering that most people find Zipkin to be a bit heavyweight
⁃ Circuit Breakers help handling cascading service failures in a more elegant fashion
⁃ Aggregated network health status visibility system (e.g. my Heka hackday from 2015 or 2014) are recommended
⁃ Authentication inside a VPC perimeter can be made more efficient by terminating from the front door and using internal load balancers
⁃ Downside is if an attacker breaches your internal network then you stand no chance of preventing them reading your network traffic without HTTPS
⁃ But I'd argue if your VPC is compromised, you have much bigger issues
⁃ Implement network segregation (e.g. we do this already via VPC's, but have them on a more granular level; Morph & Mozart should be/are)
⁃ Whether the segregation is based on 'team ownership' or 'risk level' is up to your organisation to decide what's more appropriate
⁃ Tightly coupled organisations generally appear to produce tightly coupled software architecture by their natural influence
⁃ Similarly, loosely coupled organisations generally appear to produce very modular and loosely coupled software architecture
⁃ Having multiple teams trying to manage a code base makes it difficult to communicate, coordinate and to reason about the service
⁃ Distributed teams need to identify portions of a service that they can take ownership of and introduce clear service boundaries
⁃ The tendency for a single team that owns many services to lean towards tight coupling is more and more likely to occur
⁃ Team ownership of a service means they can do what they like as long as they don't break contracts/interfaces their consumers rely upon
⁃ Unless indicated via a versioning system
⁃ Having 'feature teams' also doesn't work as it means those teams cross over the responsibility boundaries
⁃ Internal 'open-source' (IOS) - let's face it: that's Alephant - can help avoid the need for 'feature teams'
⁃ IOS uses the idea of core custodians but that other teams can help towards pushing a particular service functionality forward and avoid bottlenecking
⁃ Balance the need for complete automation of scaling against the service requirements (e.g. does a basic dashboard need 100% up time or not?)
⁃ Degrade your service functionality gracefully (as best you can to suit the requirements of your users/consumers)
⁃ Cascading failures are more likely to be caused by 'slow' responding services than failing ones (monitor and react accordingly)
⁃ Put timeouts on all 'out-of-process' calls to try and avoid slow services causing bottlenecks and knock-on effects
⁃ Circuit Breakers help defend your service against upstream services that are having problems
⁃ Plan for failure (e.g. Chaos Monkey).
⁃ Implement 'Bulkheads'. These are sections of your code that can be closed off to prevent sinking your entire application
⁃ Bulkheads are subtly different from Circuit Breakers (the former shuts down aspects of your own service; the latter is for upstream services)
⁃ Bulkheads aren't always logic based (e.g. if bad thing happens, disable feature X) they are also part of the software design process
⁃ e.g. the use of different connection pools for each upstream service; if one upstream is slow then only that one part of our service shuts down
⁃ Teasing apart functionality into microservices is another form of Bulkhead (failing of one microservice shouldn't affect another)
⁃ Timeouts and Circuit Breakers free up resources when they become constrained
⁃ Bulkheads ensure resources don't become constrained in the first place
⁃ Avoid designing a system where one service relies on another being up
⁃ e.g. Mozart Composition tries to solve that problem by serving from a page level cache if Morph is unavailable
⁃ This also means that much less coordination is needed between services (we become more loosely coupled)
⁃ Don't be afraid to start again and redesign (the beauty of microservices means a rebuild shouldn't be as costly as for a monolith)
⁃ Identify your business model (reads vs writes) and aim to scale your services and resources appropriately
⁃ Implement caching at as many levels as is appropriate (HTTP, application, CDN etc)
⁃ You can even design your system in such a way that high bursts of 'writes' are cached and then flushed at a later stage ("write-behind cache")
⁃ Cached writes could be as simple as fire off the data to a queue to be processed asynchronously (depending on your business model)
⁃ Utilise AutoScaling and its variants (reactive, scheduled) more intelligently to suit your business needs
⁃ e.g. scale down services on a scheduled basis overnight if they're only utilised heavily during office hours (lunch time peak for a news orgs)
⁃ Understand CAP Theorem and what sacrifices (trade-offs) you can make that will best fit your business needs
⁃ Automate documentation wherever possible as this allows it to stay fresh (e.g. on code commit trigger documentation automation update)