Skip to content

Instantly share code, notes, and snippets.

@bojand
Last active March 1, 2024 19:32
Show Gist options
  • Save bojand/6a604f7e369d7c7d8c39eb77878a42c2 to your computer and use it in GitHub Desktop.
Save bojand/6a604f7e369d7c7d8c39eb77878a42c2 to your computer and use it in GitHub Desktop.
gRPC and Load Balancing

Just documenting docs, articles, and discussion related to gRPC and load balancing.

https://github.com/grpc/grpc/blob/master/doc/load-balancing.md

Seems gRPC prefers thin client-side load balancing where a client gets a list of connected clients and a load balancing policy from a "load balancer" and then performs client-side load balancing based on the information. However, this could be useful for traditional load banaling approaches in clound deployments.

https://groups.google.com/forum/#!topic/grpc-io/8s7UHY_Q1po

gRPC "works" in AWS. That is, you can run gRPC services on EC2 nodes and have them connect to other nodes, and everything is fine. If you are using AWS for easy access to hardware then all is fine. What doesn't work is ELB (aka CLB), and ALBs. Neither of these support HTTP/2 (h2c) in a way that gRPC needs. ELBs work in TCP mode, but you give up useful health checking and the join-shortest-queue behaviour that makes normal HTTP mode ELBs good. It also means you may experience problems with how well balanced your cluster is since only individual client connections are balanced rather than individual requests to the backend. If a single client is generating a lot of requests, they will all go to the same backend rather than being balanced across your available instances. This also means that ECS doesn't really work properly since it only supports the use of ELB and ALB load balancers. If your requirements are not too demanding TCP mode ELBs do work, and you can definitely ship stuff that way. It's just not ideal and has some fairly major problems as your request rates and general system complexity increase

I use gRPC on AWS and it works great. However, I don't believe ALBs support trailers in the HTTP/2 spec, so that won't work. Something may have changed since the last time I looked, but don't count on an HTTP/2 ALB working. I believe it's HTTP/2 to clients of the ELB but HTTP/1.1 to your backend servers.

Alternatively use ELB/ALB at Layer-3 but put your own HTTP2 compliant proxy behind it (Envoy, nghttpx, Linkerd, Traefik, ...) I know Lyft does this in production with Envoy.

https://forums.aws.amazon.com/thread.jspa?messageID=749377

We're trying to get the Application Load Balancer cooperating with some ECS-hosted gRPC services. So far it's failing; poking at the server a bit, it looks like requests are coming from the load balancer as HTTP/1.1, while gRPC server is expecting HTTP/2. The info on the load balancer suggests it supports HTTP/2, but does that only apply to the client side?

Hi. Yes, the requests are sent from the load balancer to the targets as HTTP/1.1. For more information, see http://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-listeners.html#listener-configuration.

https://groups.google.com/forum/#!topic/grpc-io/rgJ7QyecPoY

We sort of have this situation, since we use Google App Engine, and its load balancer and URLFetch service only support HTTP/1.1. We used the PRPC implementation described here, which is a mapping of the simple unary gRPC requests to an HTTP/1.1 protocol: http://nodir.io/post/138899670556/prpc. We used the Go implementation from the Chrome tools repository, and wrote our own client and server, which were relatively simple but absolutely do not support all of gRPC's features. The "better" approach might be to look at the grpc-web work, and possibly just run the grpcwebproxy. See: https://github.com/improbable-eng/grpc-web I think that will also have the problem that if your clients aren't Go or Javascript, you will need to implement the protocol yourself.

We normally recommend using a proxy that supports HTTP/2 to the backend, like nghttpx and derivatives (Envoy, Istio). If that's not possible, then the solutions tend to involve something that looks like grpc-web. If the proxy you are already using supports HTTP/1.1 trailers, it should be possible to use nghttpx to up-convert back to HTTP/2, but I've not tried that out.

Microservices at Lyst

HTTP load-balancing on gRPC services

Using Envoy to Load Balance gRPC Traffic

nginx now supports gRPC

gRPC Load Balancing with Nginx

DNS Load Balancing in GRPC

gRPC Load Balancing using Kubernetes and Linkerd

Tyk.io supports gRPC

HAProxy now supports gRPC

gRPC + AWS: Some gotchas

On gRPC Load Balancing

gRPC Load Balancing on Kubernetes

gRPC Load Balancing inside Kubernetes

How To Create Load Balancer For GRPC On AWS

Learnings from gRPC on AWS

New – Application Load Balancer Support for End-to-End HTTP/2 and gRPC

Demo for enabling gRPC workloads with end to end HTTP/2 support

Why load balancing gRPC is tricky? - A blog post providing an overview of gRPC load balancing options.

gRPC Client-Side Load Balancing in Go

Load Balancing gRPC services

gRPC load balancing with grpc-go

@mikeraimondi
Copy link

mikeraimondi commented Jan 5, 2018

This remark:

This also means that ECS doesn't really work properly since it only supports the use of ELB and ALB load balancers.

Is no longer true as of the time of writing. ECS supports AWS's Network Load Balancer (NLB). gRPC appears to work with NLBs in my testing.

@daroczig
Copy link

daroczig commented Jan 5, 2018

@bojand it seems that gRPC works pretty OK over the Network Load Balancer type of ELB

@ntindall
Copy link

@daroczig could you provide some more details about how you have the Network Load Balancer configured? We are having some issues using an NLB to balance grpc backend services deployed with ECS.

@subhi-chegg
Copy link

@daroczig, it'd be interesting to know how you set it up.

@dio
Copy link

dio commented Jan 29, 2018

@daroczig how do you overcome this grpc/grpc#7957?

@bojand
Copy link
Author

bojand commented Feb 6, 2018

Hello, just noticed this conversation.
I have not tried NLB on AWS yet, but I am interested to know how it would work.

Regarding GCE:
https://groups.google.com/forum/#!topic/grpc-io/bfURoNLojHo

Hello everyone
"HTTP/2 + gRPC to backend instances for HTTP(S) LB" is now in Alpha for Google Compute Engine (GCE). Please reach out to your Google Cloud Account managers to sign you up for Alpha. Stay tuned for Beta soon!
Regards,
Prajakta

@chintanop
Copy link

@bojand - we have been able to successfully run gRPC on NLB. It does round-robin based on TCP # of connections (note: it is not based on HTTP requests), which is not a big deal as all our requests are sent by different threads which results in different TCP sequence number and hence gets load-balanced (see below the details on routing).

Here is NLB routing algorithm: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/introduction.html
"A load balancer node selects a target using a flow hash algorithm, based on the protocol, source IP address, source port, destination IP address, destination port, and TCP sequence number. The TCP connections from a client have different source ports and sequence numbers, and can be routed to different targets. Each individual TCP connection is routed to a single target for the life of the connection. "

@bsideup
Copy link

bsideup commented Feb 13, 2018

I can confirm that it works with NLB, although I recommend to read this article about some hidden aspects of it:
https://hori-ryota.com/blog/failed-to-grpc-with-nlb/

@abhimanyu1289
Copy link

any info on how to run gRPC on kubernetes behind an NLB/ELB in proxy mode?

@public
Copy link

public commented Mar 29, 2018

Hi I'm the person who wrote the first quoted post at the top.

TCP mode ELBs were already possible and NLB remains undesirable for some users for the same reasons. NLB is missing lots of features which make ELB good. e.g. Health checks and least outstanding requests load balancing. These two things make it much easier to manage how clients interact with our services.

We actually just use our own small Python HTTP/1.1 compliant gRPC server implementation to work around how annoying this limitation in gRPC is. It's quite similar to how Envoy proxies gRPC over HTTP/1.1. We don't get streams but everything else works and it's much easier to standardise than random JSON APIs.

@AlekSi
Copy link

AlekSi commented Jun 8, 2018

https://tyk.io/ also supports gRPC

@jamisonhyatt
Copy link

External requests hitting an NLB to front your gRPC services running on ECS should work out of the box with one caveat...if other dependent services on the same ECS cluster attempt to connect through the NLB and just happen to be on the same Host as the gRPC service, the request will fail because the NLB doesn't support hairpinning.

https://forums.aws.amazon.com/message.jspa?messageID=805080

@yotamoron
Copy link

I can confirm that it works with NLB, although I recommend to read this article about some hidden aspects of it:
https://hori-ryota.com/blog/failed-to-grpc-with-nlb/

Dude - its Japanese...

@towens
Copy link

towens commented Jan 16, 2020

That Japanese article is referencing go doc keepalive

Example (not mine) here

@alisher-agzamov
Copy link

When I configured AWS NLB in the Kubernetes cluster (EKS) to balance gRPC traffic between different numbers of application replicas I faced that very often gRPC clients could not established connection (approximately every second connection). I tried to play with keepalive timeouts but could not fix it. The issue was solved when I created a secure gRPC server grpc.ServerCredentials.createSsl().

@ericbrumfield
Copy link

I can confirm that it works with NLB, although I recommend to read this article about some hidden aspects of it:
https://hori-ryota.com/blog/failed-to-grpc-with-nlb/

I'd second this that it works and have verified it holds a long duration gRPC bidi stream in an app I'm building for hours and still going strong today in my testing. The key that is missing in this article is that the gRPC server side needs an enforcement policy that matches the client side settings for the keep alive params. In my test setup I have the keepalive time at 30 seconds on both server and client, server side has an enforcement policy with MinTime of 30 seconds and both sides have PermitWithoutStream set to true. Project I'm in is in golang and is hosted behind aws nlb. Without the enforcement policy setup on the gRPC server side, the connections would drop somewhere between 8 to 15 minutes like the Japanese article was stuck on.

The latest ALB support of gRPC is great, but for a probably more complex/niche scenario, such as dynamically doing mTLS end to end I'm at a loss how to achieve that through the new ALB gRPC features as that would terminate there, but the NLB approach should support doing that. I'm not even sure a setup with api gateway would achieve a dynamic mTLS scenario end to end like I'm in with this gRPC client/server setup.

BTW this is a good gist @bojand.

@SHUFIL
Copy link

SHUFIL commented Aug 3, 2021

gRPC is working using with AWS network loadbalancer , but when we use SSL (ACM) in NLB still it is not working ,any one know why it is ?

@comerford
Copy link

I'll add that I successfully had an EKS service deploy a non-secure GRPC API (TLS is actually not recommended for production in this case) behind an NLB fronting the TLS termination. I was able to do so by just using the following notations:

service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "http"
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
service.beta.kubernetes.io/aws-load-balancer-ssl-negotiation-policy: "ELBSecurityPolicy-FS-1-2-Res-2020-10"
service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "grpcapi,httpapi"
service.beta.kubernetes.io/aws-load-balancer-ssl-cert: "arn:aws:acm:eu-west-1:1231212341234:certificate/your-arn"

Another fun note; I was configuring it with terraform and the switch to nlb was instantaneous from a terraform perspective, but it takes a while for it to actually provision and become available. Hence any references made to the LB (like to make a CNAME to the LB address) did not pick up any difference immediately. I had to re-run terraform once the LB was up for it to realise the target LB had changed (and update the Route53 CNAME).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment