archmangler/monolith-migrating-to-cloud.md

## monolith-migrating-to-cloud.md

      
    Raw
  

              monolith-migrating-to-cloud.md
            
          
    Analysis

Clarifying the requirements, the organisation desires an application deployment and operating model which has these four key characteristics:
1. Agility:
The development, testing, integration and deployment of new features and improvements to the application (and it's infrastructure) is multiple times to orders faster than the current speed of these processes.
2. Zero Downtime:
The deployment and scaling process for both the application and infrastructure results in no service downtime (individual component downtime may be acceptable).
3. Continuous Deployment:
The application and infrastructure code and configuration should always be in a known, validated state to be deployed at any time (no deploy windows needed).
4. Automated Scaling:
The Application infrastructure, Application and supporting components should be scalable automatically in response to suitable metrics (e.g processing load)
5. Observability (delivered through Monitoring, Logging, Reporting), Metrics)
The implementation of a solution delivering these characteristics should also meet the following non-functional requirements:

a) Application and platform reliability ("stability")
b) Application Code Quality ("bug free deliveries")
c) Application, Platform and service Security
d) Application and Platform Scalability

To re-engineer the application, it's infrastructure and the processes surrounding it to have these characteristics, the following changes should be made:
NOTE: These changes can be made in the following order, or in parallel.
Change 1. Code version practices should change as follows in order to establish a "Guaranteed deployable master branch"
No direct commits to master without quality and vetting gates. All commits to master must pass through:


a) A series of automated quality tests for known regressions, bugs and code quality best practices. Failing these tests will fail creation of a PR.


b) following successful creation of a PR, a peer review process must happen via the SCM's PR reviewing tools (e.g Git PR)


Before these steps, developers should test features on a non-production environment (cloned from production) and include test results in the PR


This should be based on a feature branch branched of the current tip of the master branch.


Following a merge to the master branch, master should be tagged with a semantic version identifier to mark the version of the code.


The  new features and bugfixes of the version should be documented somewhere (preferrably in a README in the master repo).


NOTE: This semantic version branch tag will be the identifier selected to deploy to production from the master branch.


Change 2. Implement a platform that natively supports fast deployment, automated scaling, high availability and fast failure recovery.

One of the key components of an agile software system is platform and infrastructure services which come with "built in" High Availability, Scalability and Automatability (i.e an API to access these features). Historicaly these have been bespoke platforms custom built by the organisation.
These days Open Source projects like Kubernetes, Mesosphere DC/OS, Docker Swarm, Openstack, Openshift provide PaaS and IaaS systems which deliver these features so it's not necessary to build one from scratch.
Containerisation is supported by most of these platforms and Containerisation is a key feature used to provide Agile application deployment.

Recommended change:

a) Deploy a Kubernetes platform initially on-premise, with the longer term objective to move it to the Public cloud (one of Azure, AWS, GCP). Kubernetes open source product provides a high degree of automation, features for implementing highly available services.
b) Partition the platform into Dev, UAT and production areas using Kubernetes Namespaces or Devspaces. Alternatively create separate Kubernetes clusters for prod/non-prod.
c) Begin rebuilding the monolithic application to be run as-is in containers (Docker specifically), use this exercise to test how the application will run on a containerisation platform like Kubernetes. Containerisation will be a key strategy for speeding up application deployment and lifecycle management.

NOTE:

The deployment of this platform should be completely automated via versioned code in a repository (e.g Git).
This code repository should be managed exactly as recommended in point 1 above.
Automatic scalability based on CPU load or incoming application requests (custom metrics) can be used by Kubernetes to scale the number of application pods up/down.

Security Notes

All access credentials required to access the Kubernetes cluster should be store in an automated Vault solution (e.g Hashicorp vault)
Application containers running on the platform should access application secrets from the Vault via Kubernetes integration
Kubernetes pod security policies should be used to control what applications on the platform can access, and which other pods they can access.
Kubernetes Network Security Policies can be used to restrict how traffic is allowed to flow to and from application pods on Kubernetes. This can be used to secure network access to the application pods.

Change 3. Implement a highly automated deployment pipeline to enable CI and CD

With systems 1 and 2 above in place, we should be ready to implement an automated workflow (pipeline) to deliver code from the well-managed master branch of the application repository (1) to the platform that will host the application (2).
This deployment pipeline should be implemented as code, in it's own SCM repository and managed exactly as recommended as in 1. above. This is the "Pipeline as Code" concept.
Here we require a tool which can automate this process, integrating with the SCM system (e.g git) and the target platform (e.g Kubernetes), e.g Jenkins or Gitlab.

Recommended change:

Deploy a Jenkins CI/CD server which will provide a large collection of features and tools to implement and integrate automated deployment pipelines.
Deploy supporting components of the CI/CD server e.g an artifact repository (e.g JFROG Artifactory) and a secrets vault solution (e.g Hashicorp Vault)
Implement the deployment pipeline as Jenkins Pipeline-code ("Jenkinsfile"): Define and code the stages required to build, test, configure and install the Java application.
Configure the pipeline to read parameters from a configuration source (either the code repo, or a CMDB) which will decide which of Kubernetes non-prod or prod the app is deployed to.
Include tests in the final stage of the pipeline to health-check the application service once it's deployed and validate it.
Security note: Ensure that all secrets required by the pipeline during it's operation are sourced or stored in the Vault solution integrated with the Jenkins server. Recommendation here is Vault.
Integrate the change approval process into the deployment pipeline by triggering a deployment on the Jenkins server to a non-prod environment whenever a PR results in a merge of new feature branch to the master branch:
If the deployment is successful (based on the tests integrated in the pipeline), then update a status dashboard so that the stability of the deployment can be made visible.
If the deployment fails, alert engineers to immediately investigate and fix any issues with the code (or infrastructure)
This will set up a sustainable cycle between code changes, testing and deployment which will build confidence in the deployable state of the branch.

NOTE: Integrate pipeline performance metrics which would be stored and graphed using tools like Prometheus and Grafana.
NOTE: The above automated pipeline will become important for speeding up the development required to re-engineer the Java app itself.
Security Notes

During the deployment pipeline run, all credentials required to access the Kubernetes platform and the code repos as well as other application componente should be accessed from a Vault via Jenkins-Vault integration (e.g Hashicorp)

Change 4.: Decouple the monolithic Java application further to support n+1 scaling in all or most tiers.

The key objective here is to decouple the monolithic Java application into smaller components which can be tested and deployed faster.
Initially decouple the largest functional components of the application, e.g separate the 10 workflows into 1 separate "microservices", i.e java applications which deliver just one workflow and communicates with the other workflows using simple APIs.
Containerise these decoupled applications, converting from a program running on a VM or bare metal to a dockerised service.
if possible, redesign/recode these 10 workflow applications to be parallelised, i.e working in worker pools to handle requests from a Queing service and writing requests to the database via an asynchronous queuing service (e.g kafka, Pulsar, RabbitMQ)
NOTE: Be sure to instrument the java application with monitoring libraries which expose all possible application metrics via a simple API and protocol (e.g JMX). Also add dedicated monitoring components to each workflow application e.g by adding a monitoring agent in a side-car container to run alongside the application container and scrape stats from the JMX metrics port.
Run the Queuing system as a containerised application (e.g Kafka supports this) and automate the management and operations of the Queuing system using a tool like Ansible or Chef. Ensure the queueing system is monitored for all possible metrics like request latency, request size, failures, queue volume. Ensure these metrics are clearly dashboarded and define thresholds for alerts which drive action by developers or SREs.
Monitoring tools  like Prometheus (metrics collection), Fluentd (logging), Elasticsearch (log indexing), Kibana (log search and visualisation) can be used here to dashboard application metrics, as well as platform metrics (e.g kubernetes integrates with prometheus)

Implementing Zero Downtime by Decoupling Service Layer from the Application

To implement zero downtime deployments, the monolithic application must be decoupled into a multi-node model where a number of instances of the application can be run in parallel.
This means if we decoupled the 10 workflows into 10 separate application instances ("microservices"), we must further make sure each microservice can be run as a set of 2 or more instances which process data in parallel, i.e "clusters".
The microservice "clusters" should have their services accessed via a lightweight service load-balancer like NGINX/HAPROXY or Consul, so each workflow service is accessed via a single load-balancing endpoint which routes requests to available microservice instances.
During deployment, the pipeline will include logic to ensure 0-downtime by replacing one microservice instance at a time and replacing it with an updated application container (or "kubernetes pod") and updating the configuration of the load-balancer for that microservice (e.g by service discovery)

Change 5.: Once the changes 1 - 4 have been completed, migrate the kubernetes platform to a public cloud provider
Caveat: If processing latency and data regulatory compliance allows it.
Migrate the Kubernetes platform, CI/CD pipeline, repos and supporting tools to a public cloud provider.
The benefits of this would be:

a) Better scalability (there are limits to private datacenter capacity)
b) Better automatability (which provides Agility and deployment speed of infrastructure via code)
c) Better security (leverage the cloud providers vested interest in keeping it's platform and customers secured, e.g DDoS protection)

Security Implementation
Each phase of the implementation described above has a security dimension which must be addressed at design and implementation time.
The key security features that have to be implemented are as follows:

Secrets Management: There should be no handling of passwords/secrets by human users to configure applications or platforms. All secret access and generation by the deployment pipeline, Kubernetes platform and integration scripts should be done using a Vault solution (e.g Hashicorp Vault)
Network Perimeter Security: The cloud network environment (VPC in AWS or VNET in Azure) should be highly secure at the perimeter, both outbound and inbound security rules should be explicitly defined (default deny all for both outbound and inbound)
Platform Networking Security: Adopting the "Zero Trust" principle, the Kubernetes platform should explicitly define which applications can communicate, with what protocols and to what destinations. Here, Kubernetes provides Network Security Policy to define traffic flow within the kubernetes cluster.
Data Security (Data at Rest Encryption): Data stored on the application platform should be encrypted at rest (in object storage or block storage)
Application and Platform Role Based Access Control:
Code Security (Secure Code)
Image and artifact security

Evaluation

How do the above changes deliver the desired characteristics specified in the list of requirements? We review as follows:
a) Application and platform reliability ("stability")

The adoption of "infrastructure as code" and CD pipelines is a major contributor to platform stability. Another key factor is the use of GitOps, i.e the versioning of all infrastructure code in a repository, in addition to application code.
With IAC and automated deployment based on versioned code, we can make repeatable deployments, assess failures, and incorporate the fixes back into the code in such a way that continous improvement is possible.
The adoption of kubernetes a platform which natively handles HA, automated deployment, scale-up and scale down is a major contributor to platform stability. Instead of developing bespoke distributed systems and engineering the complex algorithms required to guarantee stability in a distributed system, we rely on proven technology in widespread use.

b) Application Code Quality ("bug free deliveries")
c) Application, Platform and service Security
d) Application and Platform Scalability