Skip to content

Instantly share code, notes, and snippets.

@christian-posta
Created September 18, 2019 16:38
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save christian-posta/cc81e1c2058d1a1243754386ee374c91 to your computer and use it in GitHub Desktop.
Save christian-posta/cc81e1c2058d1a1243754386ee374c91 to your computer and use it in GitHub Desktop.

Future of microservices:

  • Service mesh is happening in large organizations
  • There are challenges with any new, complicated technology
  • Determine whether you need a service mesh and what challenges to expect up front

Challenges of adopting a service mesh in an enterprise

I have been fortunate to work closely with enterprises adopting service mesh over the past two years through both my work at Red Hat and now at a startup, Solo.io, that focuses entirely on successful service-mesh adoption. I have seen the progression from "I've never heard of it" to "wow that's cool" to now "yah we're [going to be] doing that". Within the past year, as folks at major enterprises being putting "rubber to the road", I've been at the forefront of the challenges, some expected, some not, that have cropped up as well as how those organizations have chosen to approach their solutions. Adopting service mesh has been coincident with adopting and operating microservices, so there are multiple inter-related challenges.

You may have heard of the term "service mesh" and probably seen some definitions. I won't spend too much of this article re-hashing that as I've been helping to define the term and architecture for over two years. A quick and succinct definition could go something like this:

A service mesh is decentralized application infrastructure, typically implemented with sidecar proxies,
that solves difficult service-to-service communication challenges such as request-level routing, resilience (timeouts, retries, circuit breaking), telemetry collection, and security regardless of what language or framework you use to implement the service.

You may use a service mesh for large deployments of heterogeneous implemented services (ie, many different frameworks, languages, etc) deployed across heterogenous infrastructure (containers, VMs, clouds, etc). Technologies such as Linkerd, Istio, and Consul Connect fit this definition. Before we examine the challenges of doing so, however, let's see whether you need a service mesh. Sometimes its best to avoid challenges by not having them at all.

Do you need a service mesh?

Going toward a microservices architecture is already a complex endeavour. Increasing the number of moving pieces of an application architecture (services), CI/CD pipelines, testing across services, introducing new infrastructure, application patterns, etc already contribute to increasing the level of complexity in order to successfully implement microservices. A service mesh is yet another layer of complexity to this world and you should always ask yourself: do I really need this.

Start with an answer of "no". If you're just getting started with microservices and a handful of services, make sure you have the foundational pieces in place first. Microservices and its associated infrastructure are an optimization enabling you to make changes to your application faster. You can make a lot of strides toward going faster without a service mesh. You may even want some of the goodness that a service mesh brings without all of the complexity. Check out something like Gloo, an API Gateway built on Envoy proxy.

You may reach a point where a service mesh may make sense to you. You may have the following problems:

  • Large deployment of microservices across multiple clusters
  • Hybrid deployment of containers/k8s and VMs
  • Heterogeneous deployment of languages used to build services
  • Incomplete and inconsistent view of network observability

In that case, you may opt to use a service mesh but this decision doesn't come without its own challenges, however. For example, some challenges I see in adopting service-mesh technology today:

  • which one to choose?
  • Who's going to support it
  • Multi-tenancy issues within a single cluster
  • No good way to manage multiple clusters
  • Fitting with existing services (sidecar lifecycle, race conditions, etc)
  • What's the delineation between developers and operations
  • Non container environments / hybrid env
  • Centralization vs decentralization

Let's take a look at three challenges enterprises face when adopting a service mesh.

Challenges of adopting a service mesh in the enterprise

As organizations adopt service-mesh technology, we'll see the challenges of doing so change over time. Right now, I'd say we are at the very beginnings of this adoption en mass, but it's happening. The "today" challenges I see when adopting service mesh include figuring out how it will fit in with existing security policies, how to incrementally introduce the proxy with minimal disruptions, and the struggle between centralizing a control system like this vs decentralize it.

Introducing technology that touches many parts of the organization can be difficult. Let's explore these challenges head on.

Fitting in with existing security and tenancy policies

Technology organizations have been evolving their networking and security policies to improve their security posture in an ever-increasing world of cyber threats. Balancing out security requirements with the practical cost to implement these requirements has also pushed to a world of multi-tenant infrastructure. For example, when I see organizations run a Kubernetes cluster in production, they typically have very strict service-to-service networking security policies. They may host multiple teams and services within that cluster and typically lock down networking communication between namespaces and use software-defined networking policies between namespaces to to prohibit communication unless it's explicitly defined. In this model, "micro-segmentation" is implemented at the namespace level while everything within a single namespaces is trusted.

Introducing a service mesh with these policies could end up being a non-starter in this environment. Service-mesh technology often assumes flat network, or communication paths, between all the services on the mesh. Additionally a lot of the control planes for service mesh do not have a notion of "tenancy"; if a single tenant writes rules that take down the control plane, it takes it down for everyone. This means you may need to relax some of these tenancy or security constraints to get a service mesh into place (or at least make exceptions). This may not be ideal.

Introducing mesh proxies with minimal interruptions

A service mesh typically relies on some kind of "sidecar" proxy to implement the control and features of the mesh. A common sidecar proxy is Envoy proxy, however some mesh implementations like Linkerd have their own custom proxy. With any existing services, you want to make sure to safely introduce this proxy without interrupting any of the existing network communication.

This may seem like a perfectly safe operation in your testing environments, but when you start doing in production, unexpected issues can arise. For example, when introducing this proxy into a deployment on Kubernetes, you may see issues like race conditions between the application container and the proxy container. In this situation, the application comes up and is ready to initiate network communication before the sidecar proxy is ready and therefore cannot communicate.

Another situation that can arise is the sidecar proxy unexpectedly consumes more resources than are allocated to the container on that Pod in Kubernetes. You can end up in a situation where your good intentions of slowly introducing the sidecar to a single service at a time could take down that service because of misconfigured resource limits. This isn't an issue specific to service mesh, but does seem to be a big enough issue that many folks run into it.

Centralization vs decentralization

The last challenge I see facing enterprises is not service-mesh specific, but does come up. There is an eternal battle between decentralization for autonomy and agility vs centralization for simplified and consistent management. This battle extends into the implementation of microservices in large organizations. With your service mesh, who is responsible for the configurations of the mesh and what, if any, influence do the development teams have over this configuration? For example, when setting routing rules for an application (ie, matching rules, context path, etc), who's responsibility is it? When it comes to circuit breaking, timeouts, and retries, who owns that? When it comes to overriding some of the default behavior, is there a path to do that independent of some centralized team?

In many ways, previous generations of technologies (app servers, ESBs, API management, etc) became a bottleneck for changes because of the organizational centralization of these technologies. Service mesh comes with a lot of promise in terms of improving rapid delivery of services, progressive delivery, etc. but if it falls into the same traps of organizational centralization then this could indeed prove to be yet another bottleneck. Some hold the service mesh's centralization capabilities as its key differentiator compared to previous systems. Overcoming the tendency for this to practically become a drawback is a major hurdle enterprises are dealing with right now.

Service mesh adoption in the enterprise

Adoption of this powerful, yet complicated, technology is not without its hurdles and drawbacks. Before jumping head on, you should evaluate whether you really need a service mesh right now. This ecosystem is still emerging with new implementations being announced almost every week. In addition, you could likely implement a lot of the network control and observability aspects of a service mesh with something simpler like Envoy itself or an API Gateway based on Envoy like Gloo. Lastly, be prepared for the challenges listed above (among others) when introducing it into your environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment