bloo/msa_updater.md

## msa_updater.md

      
    Raw
  

              msa_updater.md
            
          
Table of Contents  generated with DocToc

Overview

Rules
Service types

worker
gateway
step


Message types

events
tasks


Service base - language and framework

NodeJS Stats
SenecaJS for NodeJS

Relevant plugins


Build, test, deploy - CI/CD with Docker

Feature Acceptance - Branch based previews and E2E testing via Runnable.com
Production Deployment

Docker image update
Zero downtime
Database migration
Cluster infrastructure options:


Cloud Service Providers

Internal MicroService
Cluster


Patterns

Reactive
Authentication
Authorization
REST vs CQRS vs GraphQL
Complex workflows and Long-running business transactions

Saga Pattern
State machines


Phases of implementation


Overview

This is a general plan and preliminary spec for Updater's Microservices
architecture. The overall Microservices system will be responsible for
performing much of the business logic, data manipulation (reconciliation,
normalization, references), and service request fulfillment.
Rules


Isolate everything

failures shouldn't bring down other services (Bulkheading)
use asynchronous messaging between services (events and tasks)


Single Responsibility Principle

Do ONE THING, and DO IT WELL
business domains and responsibilities are not entangled
LOC stays small; testing is easier
maintenance and frequent continuous deployments are easier


Own your state EXCLUSIVELY

maintain your own persistent data stores, caches, etc
share anything via message bus events


Location Transparency

services can be recycled, relocated, fail, upgrade or scale
should be addressable through a virtual network - usually a cluster DNS entry
addressable through single unit no matter how many instances or their locations


Service types

The 3 main types of services we'll employ, differentiated by how they're
executed and how they communicate with other services.
All service types emit events, tasks, and manage their
own state.
worker

Consumes asynchronous events and tasks via message bus.
This is the general, long-running worker, communicating solely by asynchronous
messaging via the event bus and queues, processing a single function, business
or otherwise.
gateway

Responds to upstream synchronous requests for entity data via RESTful, CQRS or
GraphQL endpoints.
These service synchronous requests from upstream systems such as the core API
or publicly accessible clients. These lie on the edge of the Microservices
ecosystem, and require an Authorization header to identify the calling
user.
step

Finite running jobs, launched by a state machine or workflow engine.
These are single tasks that are grouped into collective systems that complete
Complex Workflows, which may make up a state machine which may also adhere
to the Saga Pattern. They are launched when needed, perform
their function, and exit.
Message types

events

Events are published by any service type and following the naming convention: <entity>.<verb> where the verb is past tense. These represent "things that
have happened" and are subscribed to by any interested service. Examples:

user.authed
mover.created
address.received
address.reconciled

Any service can subscribe to message topics by their names and process the
payload of these events.
tasks

Tasks are published by any service type and following the naming convention: <service>.<entity>.<request>. These represent "things that need to be done"
and by which service. This is an asynchronous way of requesting work. Examples:

auth.user.authorize
movers.mover.created
addressing.address.get
addressing.address.reconcile

Service base - language and framework

NodeJS Stats


excellent tooling, profiling and debugging
npm hosts > 100k packages
most devs already know JS
concurrency via callback hell, but "promises" and "async generator" alleviate
mature; JS evolves slowly due to older standard
inexplicit error handling (throw/catch or vague callbacks)
performance on the rise, but dynamism of runtime can cause hinderances
many frameworks available

SenecaJS for NodeJS

SenecaJS is an application framework that separates
business logic into separate, composable blocks that communicate with each other
(async or sync) regardless of the communication mechanism between them.
Behind the scenes plumbing for transport (MQ or RPC), auth, etc are abstracted
away from the functional building blocks of a Seneca app via plugins.
Relevant plugins


Web - map web routes to actions
Message - map message events to actions
User - map JWT tokens to identity context
AWS Lambda - invoke Lambda for actions

Build, test, deploy - CI/CD with Docker


What would it take for someone to spin up a new service?

Feature Acceptance - Branch based previews and E2E testing via Runnable.com

This can be done right on a CI build server with multiple docker-compose.yml
files, extending a common one and building against dependent services using
environment variables based on local vs CI server vs production clusters, but
Runnable.com looks like a less fragile way of doing the
same while eliminating the CI server.
In-house services shouldn't have a direct dependency on any other running
service, only events from the pub/sub event bus, simplifying the use case where
docker-compose is leveraged in lieu of Runnable.com.

install Docker toolset, clone Git repos
create new Git branch (automatically launching a Preview Environment)
spin up local dev server via docker-compose, that:


attaches service local event bus, database server, and other services (ie
SMTP) needed to build/test on local machine
local specific config envs should vary minimally or not at all if kept
within confines of the docker-compose environment


push changes to Git, which automatically:


run unit tests
update/deploy to Preview Environment
notify Slack, Jira, GitHub PR, etc
run functional and end-to-end tests

Runnable.com Pros

Runnable aims to let you run all your End-to-End tests continuously
cross-team changes can be validated by connecting your environment to other
services in co-development
support is quick to respond

Runnable.com Caveats

Runnable.com documentation isn't complete
using docker-compose.yml for your Preview Environments requires support assistance (for now)
supports RabbitMQ only (unless you pay for Enterprise support and install it on AWS)

Production Deployment


How would it operate in terms of deployment?

Docker image update


code review on PR (enforceable via new GitHub PR rules or team convention)
merge to master, which automatically:


run unit tests
update/deploy to Preview Environment
notify Slack, JIRA, GitHub PR, etc
run functional and end-to-end tests


trigger deployment script from Docker image repository notification


upon new image, built by master deploy, webhook to deployment script
Docker stacks and deploys are managed via Docker Cloud + Docker BYOH

Zero downtime

For zero downtime Docker container updates, you need:

at least 3 running container instances per image
at least 5 compute nodes to spread all containers across

Database migration

Tools like https://flywaydb.org/ make database migrations easier for developers

migrate at service startup, fast fail on schema inconsistencies
able to drop and rebuild entire database in test environments
able to drop everything but the schemas for clean start during testing
cluster safe - locking migrations
does NOT support Drops for rollbacks by design

https://flywaydb.org/documentation/faq.html#downgrade
schema shouldn't apply destructive changes alongside code that depends on them
use snapshots if rollbacks are needed


Cluster infrastructure options:


Docker 1.13 Swarm
Kubernetes - I've had issues w/ inter-container discovery via cluster DNS
AWS Elastic Container - pricier

Cloud Service Providers

Internal MicroService

Dependencies that each of our services require but we don't want to maintain
can be outsources to relevant cloud service providers. These include:

RDBMS (Heroku Postgres or Google Spanner)
REDIS K/V (Redis Cloud or AWS ElastiCache)
SMTP (SendGrid or Mandrill by Mailchimp)

Note: Assuming we host our cluster environment on AWS, it may be beneficial
to use as many compatible services from the AWS ecosystem for performance and
cost reasons.
These types of dependent services can be spun up by docker-compose or
configured within Runnable.com and seeded for local development and Feature
Acceptance testing. Publicly available Docker images that also work for test:

Official Redis image
Official Postgres image
SMTP "Mailcatcher" image

Cluster

Cloud service providers the cluster itself requires:


Event Message Bus and Queue - Amazon SNS + SQS or RabbitMQ
RabbitMQ is AMQP compliant and would not require updating each services queue
subscription mechanism if we moved cloud providers.


Cluster Logging (Elastic Cloud or Logit.io)
Elastic is more mature.


Metrics and Monitoring (Prometheus.io and Graphana - hosted version coming soon)
Prometheus is built in to Docker 1.13, and provides pull metrics.


Secrets and Config Management (Hashicorp Vault)
Already being implemented at Updater.


Patterns

Reactive

Having every component in an ecosystem adhere to reactive patterns lends all
moving parts to decoupling, failure isolation, scalability and overall
resiliency.
By communicating between services through topics and queues, we isolate services
from each others' failures. If a consuming services goes down, the queuing
system will keep a replayable backlog.
Authentication

An Authentication Service is a gateway service, employed to handle user
authentication events. Given credentials received by web or mobile clients, this
service handles authentication and emits the user.authed event with JWT
payload.
Tasks to be performed or events to be handled are to be done on behalf of an
identity, authenticated in the case of a non-system identity:

system for maintenance jobs, etc
user-mover for movers who are requesting to perform moving services provided
user-client for companies operating through dashboards that provide mover data
user-business for major companies operating through dashboards that provide mover services

If any service performs a request based upon an even external entity, a JSON Web
Token or JWT will be generated by the Authentication
Service and emitted to the Microservices cluster as a user.authed event
with the JWT token as part of the payload.
This token can be used by any service to:

cache in memory users currently authed
embedded in HTTP headers when accessing external services
embedded in message headers when emitting subsequent events performed on
behalf of said user in session

Authorization

Mapping of global roles (as provided by the user.authed event) to authz roles
specific to the service handling the event.
REST vs CQRS vs GraphQL

All 3 patterns are implemented as synchronous services for upstream systems
and clients.
GraphQL and REST provide CRUD operations on resource
entities, while CQRS is a pattern to request an action on a resource and
receive the result of the action's operation in the response payload.
Complex workflows and Long-running business transactions

Saga Pattern

For long running business transactions (hours, days) that pass through multiple
services, we may have a business need to handle them like a state machine
with accompanying rules to rollback or fail all of them. Each change at each
service will have to maintain reversal rules, and a broadcast message upon
business transaction success or failure will need to be designed to trigger
those rules upon failure.
https://medium.com/@roman01la/confusion-about-saga-pattern-bbaac56e622
State machines

The state machine mechanics can be managed by frameworks like:

Netflix Conductor or
AWS Step-Functions - visual workflow

and dispatched on one-time-use compute nodes such as:

AWS Lambda
Iron Workers
or internal cluster managed tasks as dispatched by Swarm or Kubernetes

Phases of implementation


automated testing, deploying, rolling image updates
setup configuration environment management (vault or otherwise)
cluster monitoring, logging, metrics, alerts with dashboard(s)
design message bus event and tasks specs, health check and reporting checkpoints
setup pub/sub eventbus,
docker base images for service types, triggered jobs


coding convention, choose linting libraries to enforce
test coverage requirements, set up code coverage plugins accordingly
decide on testing frameworks for unit, functional, and e2e
decide on persistence, caching, HTTP endpoints for reporting, management


build services
optimize for HA, performance