Diagnosing Redis errors on the client side
Customers periodically ask "Why am I getting errors when talking to Redis". The answer is complicated - it could be a client or server side problem. In this article, I am going to talk about client side issues. For server side issues, see here
Clients can see connectivity issues or timeouts for several reason, here are some of the common ones I see:
Problem: Memory pressure on the client machine leads to all kinds of performance problems that can delay processing of data that was sent by the Redis instance without any delay. When memory pressure hits, the system typically has to page data from physical memory to virtual memory which is on disk. This page faulting causes the system to slow down significantly.
- Monitory memory usage on machine to make sure that it does not exceed available memory.
- Monitor the Page Faults/Sec perf counter. Most systems will have some page faults even during normal operation, so watch for spikes in this page faults perf counter which correspond with timeouts.
Resolution: Upgrade to a larger client VM size with more memory or dig into your memory usage patterns to reduce memory consuption.
Burst of traffic
Problem: Bursts of requests on a given client machine can cause client side spikes in CPU, threads creation delays, bandwidth limits being hit, Network I/O limits being hit and other problems that lead to delays in processing responses sent by Redis quickly but consumed slowly by the client application. For instance, entire responses from Redis can sit idle in the client's underlying socket kernel buffer because the CPU is overwhelmed or the I/O system is waiting for a thread to be available to process the data.
Measurement: Watch for suddent spikes in CPU, I/O, thread counts, etc. In .NET, monitor how your ThreadPool statistics change over time using code like this. You can also look at the TimeoutException message from StackExchange.Redis. Here is an example :
System.TimeoutException: Timeout performing EVAL, inst: 8, mgr: Inactive, queue: 0, qu: 0, qs: 0, qc: 0, wr: 0, wq: 0, in: 64221, ar: 0, IOCP: (Busy=6,Free=999,Min=2,Max=1000), WORKER: (Busy=7,Free=8184,Min=2,Max=8191)
In the above message, there are several issues that are interesting:
- Notice that in the "IOCP" section and the "WORKER" section you have a "Busy" value that is greater than the "Min" value. This means that your threadpool settings need adjusting.
- You can also see "in: 64221". This indicates that 64211 bytes have been received at the kernel socket layer but haven't yet been read by the application (e.g. StackExchange.Redis). This typically means that your application isn't reading data from the network as quickly as the server is sending it to you.
Resolution: Scale up your client VM size to handle bursts, find ways to smooth out concurrent calls on a given machine, investigate what is causing CPU spikes, etc. In .NET, configure your ThreadPool Settings to make sure that your threadpool will scale up quickly under burst scenarios.
High CPU usage
Problem: High CPU usage on the client is an indication that the system cannot keep up with the work that it has been asked to perform. High CPU is a problem because the CPU is busy and it can't keep up with the work the application is asking it to do. The response from Redis can come very quickly, but because the CPU isn't keeping up with the workload, the response sits in the socket's kernel buffer waiting to be processed. If the delay is long enough, a timeout occurs in spite of the requested data having already arrived from the server.
Measurement: Monitor the System Wide CPU usage through the azure portal or through the associated perf counter. Be careful not to monitor process CPU because a single process can have low CPU usage at the same time that overall system CPU can be high. Watch for spikes in CPU usage that correspond with timeouts. As a result of high CPU, you may also see high "in: XXX" values in TimeoutException error messages as described above in the "Burst of traffic" section. Note that in newer builds of StackExchange.Redis, the client-side CPU will be printed out in the timeout error message as long as the environment doesn't block access to the CPU perf counter and if the ConnectionMulitplexer.IncludePerformanceCountersInExceptions property has been set to true.
Note:If you are looking at the Azure portal to determine whether or not you are seeing spikes, please keep in mind that the metrics in the portal are gathered at some sampling rate (e.g. every 30 seconds). We have seen many cases where a CPU spike happens between samples and does not show up in the portal. StackExchange.Redis version 1.1.603 (or newer) now prints out "local-cpu" usage when a timeout occurs to help understand when client-side CPU usage may be affecting performance. However, some environments like Azure App Services, access to system performance counters has been blocked. In such cases, you will see "local-cpu: unavailable". Also, when debugging possible performance problems in an app, it is typically recommended that you look at the MAX CPU usage as opposed to AVG CPU. The reason is that AVG can hide shorter lived CPU spikes that could explain issues like Timeouts.
Resolution: Upgrade to a larger VM size with more CPU capacity or investigate what is causing CPU spikes.
Client Side Bandwidth Exceeded
Problem: Different sized client machines have limitations on how much network bandwidth they have available. If the client exceeds the available bandwidth, then data will not be processed on the client side as quickly as the server is sending it. This can lead to timeouts.
Measurement: Monitor how your Bandwidth usage change over time using code like this. Note that this code may not run successfully in some environments with restricted permissions (like Azure WebSites).
Resolution: Increase Client VM size or reduce network bandwidth consumption.
Large Request/Response Size
Problem: A large request/response can cause timeouts. As an example, suppose your timeout value configured is 1 second. Your application requests two keys (e.g. 'A' and 'B') at the same time using the same physical network connection. Most clients support "Pipelining" of requests, such that both requests 'A' and 'B' are sent on the wire to the server one after the other without waiting for the responses. The server will send the responses back in the same order. If response 'A' is large enough it can eat up most of the timeout for subsequent requests.
Below, I will try to demonstrate this. In this scenario, Request 'A' and 'B' are sent quickly, the server starts sending responses 'A' and 'B' quickly, but because of data transfer times, 'B' get stuck behind the other request and times out even though the server responded quickly.
|-------- 1 Second Timeout (A)----------| |-Request A-| |-------- 1 Second Timeout (B) ----------| |-Request B-| |- Read Response A --------| |- Read Response B-| (**TIMEOUT**)
Measurement: This is a difficult one to measure. You basically have to instrument your client code to track large requests and responses.
- Redis is optimized for a large number of small values, rather than a few large values. The preferred solution is to break up your data into related smaller values. See here for details around why smaller values are recommended.
- Increase the size of your VM (for client and Redis Cache Server), to get higher bandwidth capabilities, reducing data transfer times for larger responses. Note that getting more bandwidth on just the server or just on the client may not be enough. Measure your bandwidth usage and compare it to the capabilities of the size of VM you currently have.
- Increase the number of ConnectionMultiplexer objects you use and round-robin requests over different connections (e.g. use a connection pool). If you go this route, make sure that you don't create a brand new ConnectionMultiplexer for each request as the overhead of creating the new connection will kill your performance. Also, you may want to consider having different connections for different purposes - e.g. large requests/responses use one set of connections and smaller requests/responses use a different set of connections. This would allow you to have different timeout values for each pool of connections.