Skip to content

Instantly share code, notes, and snippets.

@rponte
Last active March 20, 2024 09:44
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save rponte/c610d391820a7fd28aee13ce879a87c8 to your computer and use it in GitHub Desktop.
Save rponte/c610d391820a7fd28aee13ce879a87c8 to your computer and use it in GitHub Desktop.
Notes on Gloogle Cloud Rate-limiting strategies and techniques

Gloogle Cloud: Rate-limiting strategies and techniques

Sever-side

  • Even in the cases where the rate limiting is implemented entirely on the server side, the client should be engineered to react appropriately.
  • Decisions about failing open or failing closed are mostly relevant on the server side, but knowledge of what retry techniques the clients use on a failed request might influence your decisions made about server behavior.
  • In HTTP services, the most common way that services signal that they are applying rate limiting is by returning a 429 status code in the HTTP response. A 429 response can provide additional details about why the limit is applied (for example, a freemium user has a lower quota, or the system is undergoing maintenance).
  • Build your system with robust error handling in case some part of your rate-limiting strategy fails, and understand what users of your service will receive in those situations. [...] Using timeouts, deadlines, and circuit-breaking patterns helps your service to be more robust in the absence of rate limiting.
  • If your service calls other services to fulfill requests, you can choose how you pass any rate-limiting signals from those services back to the original caller. [...] The simplest option is to only forward the rate-limiting response from the downstream service to the caller. An alternative is to enforce the rate limits on behalf of the downstream service and block the caller.
  • To enforce rate limiting, first understand why it is being applied in this case, and then determine which attributes of the request are best suited to be used as the limiting key (for example, source IP address, user, API key). After you choose a limiting key, a limiting implementation can use it to track usage. When limits are reached, the service returns a limiting signal (usually a 429 HTTP response).
  • If computing a response is expensive or time-consuming, a system might be unable to provide a prompt response to a request, which makes it harder for a service to handle high rates of requests. An alternative to rate limiting in these cases is to shunt requests into a queue and return some form of job ID. [...] The deferred response pattern is easiest to apply when the immediate response to a request holds no real information. If this pattern is overused, then it can increase the complexity and failure modes of your system.

Client-side

  • In response to rate-limiting, intermittent, or non-specific errors, a client should generally retry the request after a delay. It is a best practice for this delay to increase exponentially after each failed request, which is referred to as exponential backoff.
  • Imagine a mobile app with many users that checks in with an API at exactly noon every day, and applies the same deterministic back-off logic. [...] By adding a random offset (jitter) to the time of the initial request or to the delay time, the requests and retries can be more evenly distributed, giving the service a better chance of fulfilling the requests.
  • For situations in which the client developer knows that the system that they are calling is not resilient to stressful loads and does not support rate-limiting signals (back-pressure), the client library developer or client application developer can choose to apply self-imposed throttling. [...]

Techniques for enforcing rate limits

In general, a rate is a simple count of occurrences over time.

  • Token bucket: A token bucket maintains a rolling and accumulating budget of usage as a balance of tokens. This technique recognizes that not all inputs to a service correspond 1:1 with requests. A token bucket adds tokens at some rate. When a service request is made, the service attempts to withdraw a token (decrementing the token count) to fulfill the request. If there are no tokens in the bucket, the service has reached its limit and responds with backpressure. [...]
  • Leaky bucket: A leaky bucket is similar to a token bucket, but the rate is limited by the amount that can drip or leak out of the bucket. This technique recognizes that the system has some degree of finite capacity to hold a request until the service can act on it; any extra simply spills over the edge and is discarded. [...]
  • Fixed window: Fixed-window limits—such as 3,000 requests per hour or 10 requests per day—are easy to state, but they are subject to spikes at the edges of the window, as available quota resets. [...]
  • Sliding window: Sliding windows have the benefits of a fixed window, but the rolling window of time smooths out bursts. Systems such as Redis facilitate this technique with expiring keys.

Other references:

Additional techniques for greater resilience

Rate limiting at the application level can provide services with increased resilience, but resilience can be further improved by combining application-level rate limiting with other techniques:

  • Caching: Storing results that are slow to compute makes it possible for a service to process a higher rate of requests, which might cause rate-limiting backpressure to be applied less frequently to clients.
  • Circuit breaking: You can make service networks more resilient to problems resulting from propagation of recurring errors by making parts of the system latch temporarily to a quiet state.
  • Prioritization: Not all users of a system are of equal priority. Consider additional factors in designing rate-limiting keys to ensure that higher-priority clients are served. You can use load shedding to remove the burden of lower-priority traffic from systems.
  • Rate limiting at multiple layers: If your machine's network interface or OS kernel is being overwhelmed, then application-layer rate limiting might never even have a chance to begin. You can apply rate limits at layer 3 in iptables, or on-premises appliances can limit at layer 4. You might also be exposed to tuneable rate limits applied to your system's I/O for things like disk and network buffers.
  • Monitoring: Being aware that throttling is happening is crucial for operations systems and personnel. Monitoring for rates that exceed quotas is critical for incident management and catching regressions in software. We recommend implementing such monitoring for both the client and server perspectives of services. Not all occurrences of rate limiting should cause alerts that demand immediate attention by operations personnel.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment