Skip to content

Instantly share code, notes, and snippets.

@askldjd
Last active December 23, 2017 04:37
Show Gist options
  • Save askldjd/523ab0d669fca40e394395ecd951ea7b to your computer and use it in GitHub Desktop.
Save askldjd/523ab0d669fca40e394395ecd951ea7b to your computer and use it in GitHub Desktop.
Draft of the 100 million dollar blog post

For the past year, the VA Enterprise Cloud team (VAEC) has been coordinating a massive effort to migrate the VA's on-premise applications to the cloud. By taking advantage of cloud technologies, VA is looking to reduce IT costs and to improve the scalablility and reliability of its applications. Through Vets.gov and Caseflow, we the Digital Service team at the VA is an early adopter of AWS within the agency. With our experience in AWS Govcloud, the Digital Service SRE team was consulted to review the VA's Initial Cloud Reference Architecture for AWS. During the one hour meeting, we spotted an opportunity to streamline the architecture to save over a 100 million dollars over 10 years.

The Cloud Migration

The VA is a massive organization with over 377,000 employees. There are nearly 600 on-premise applications in the last 40 years and they are in different phases of their software lifecycle. Many of these applications are critical systems that Veterans rely on to receive their benefits and assistance. To handle this massive migration, the cloud architecture needs a reliable and high bandwidth connectivity to the on-premise network. It also needs a flexible and scalable environment that can handle a variety of tenant.

Transit VPC and Direct Connect

To meet the VA's scalability requirement, the VA is deploying the Transit VPC solution with AWS Direct Connect. The Transit VPC solution utilizes the hub-and-spoke topology to allow a large number of VPCs to share the connection to the VA datacenters. To integrate the on-premise network to the cloud, the VA purchased several AWS Direct Connect 10 Gbit connections to enable a low cost and reliable private connections to AWS Govcloud region.

transit
Overview of Transit VPC with Direct Connect

Single and Multi-tenant Environment

With the connection infrastructure out of the way, the next key challenge is to design an enterprise level multi-tenant environment. There are various applications at different stages of their lifecycle. For mature applications that are in the sustainment phase, they will be migrated into a multi-tenant environment using the lift-and-shift strategy. These tenants are stable and resource provisioning will be performed at the enterprise level. For modern applications that has the devops culture (e.g. Caseflow, Vets.gov), they will be migrated into a single tenant environment using the cloud native strategy. These tenants will have more control over their environment such as deploying their own CI/CD pipeline and to take full advantage of the scalability of the cloud.

tenant
Mixture of Single and Multi-Tenent VPCs

Initial Reference Architecture

With these basic requirement defined, the VAEC drafted the Initial Reference Architecture, which provided a blueprint for the entire VA's AWS environment. This architecture captures all the requirements for the mixture of single tenant and multi-tenant environment.

initial-ref-arch
Simplified view of the Initial Reference Architecture

As we were briefed on this architecture, one thing that caught our attention was that it used GRE tunnels over VPC Peering as a layer-3 overlay for VPC interconnectivity and access to Direct Connect.

Beyond the AWS provided VPC Peering service and VPN service, these GRE tunnels provides several extra capabilities.

  • allows the scaling beyond the VPC peering limits of 125
  • allows multi-casting of packets
  • allows overlapping IP address space among VPCs

Unfortunately, this design has some significant drawbacks. To manage the GRE tunnels, each Spoke VPC requires at least two Cisco Cloud Service Router (CSR) to maintain the GRE tunnels with high availability. With a large number of VPCs, the potential cost for Cisco CSR licenses and AWS EC2 instances alone would be enormous. And this does not factor in the cost of maintenance and upgrade. In addition, all Spoke VPCs are required to peer with Transit VPCs. As this architecture scales, additional Transit VPCs will need to be deployed to overcome the VPC peering limit.

CSR Cost

To estimate the cost of this architecture in a steady state, we assume that out of the 600 applications in the cloud, there are 100 applications that qualifies for single tenant environment with some degree of devops culture. We also assume that there are three environments - dev, staging, prod for traffic isolation.

Here is a rough estimation of the CSR cost breakdown:

  • Total CSR count = 696

    • multi-tenant = 5 env * 2 CSRs * 5 = 50 CSRs
    • multi-tenant (FISMA High) = 2 env * 2 CSRs = 4 CSRs
    • single-tenant : 3 env * 2 CSRs * 100 = 600 CSRs
    • shared services : 3 env * 3 services * 2 CSRs = 18 CSRs
    • transit vpc = 3 env * 2 CSRs * 4 (VPC peer limit) = 24 CSRs
  • Each CSR cost = $14220

    • EC2 cost $5220 (c3.4xlarge, reserved at $0.596 per hour)
    • Cisco license cost: ~$9000 (1Gbit sec, 1year)
  • Total: $9,897,120 per year in CSR resource cost.

In the end, we concluded that at steady states, the annual CSR cost is at minimum $9,897,120. After factoring in the engineering maintenance cost (e.g software upgrades) and the overhead of license renewal, the total annual cost could easily exceed $10 million per year. Since this architecture may last well over 10 years, it is fair to say that it is a $100 million architecture.

Revised Architecture

The team took a step back and evaluated the needs support the nice-to-have features. We came to the conclusion to rely heavily on AWS's VPN Service and VPC Peering Service since they are free and would cover the requirements for majority of the applications. If we encounter an application that exceeds the capabilities of AWS' services, we can build out those GRE tunnels iteratively for those advanced tenants as the VA cloud architecture matures.

The DSVA team worked together with the VAEC team and significantly reduced the number of CSRs in the environment by offloading VPN endpoints to AWS's VGWs. The revised design also significantly reduces the network complexity because VPC Peering is only required for when applications have a need to integrates with each other. With this design, we estimate that we only need 18 CSRs* at steady state. This translate to a total cost of $255,960 per year, which is a 97.4% reduction in cost compare of the original architecture.

revised-arch
Revised architecture that simplifies the networking and minimize CSR cost

* Each CSR can connect up to 224 VGWs because of subnet and AWS GovCloud limitations. The details are beyond the scope of this blog post.

Conclusion

Working with the VAEC team, the Digital Service SRE team reduced the complexity of VA's initial cloud design and end up saving 100 million in an hour meeting. This new cloud design has already been deployed and we have begun migrating applications to the cloud environment. This effort will no doubt improve the reliability of applications that Veterans relies on every day and the Digital Service team is very excited to be part of the VA's IT modernization effort.

@askldjd
Copy link
Author

askldjd commented Dec 23, 2017

initial-ref-arch
revised-arch
tenant
transit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment