How do I troubleshoot high latency when using Amazon ElastiCache for Redis?
The following are common reasons for increased latencies or timeout issues in ElastiCache for Redis:
- Latency due to slow commands.
- High memory usage leading to increased paging.
- Latency caused by network issues.
- Client-side latency issues.
- ElastiCache cluster events.
Latency due to slow commands
Redis is mostly single threaded. So if one request is served slowly, all other clients will have to wait to be served. This wait contributes to command latencies. Redis commands also have time complexity defined using Big-O notation.
Usar Amazon CloudWatchmetricprovided by ElastiCache to monitor the average latency for different classes of commands. It's important to note that common Redis operations are computed with microsecond latency. CloudWatch metrics are sampled every 1 minute, and latency metrics show a summary of various commands. Therefore, a single command can produce unexpected results, such as timeouts without showing significant changes in the metric charts. In these situations, use theSLOW PROTOCOLcommand to determine which commands are taking the longest. Connect to the cluster and run theslow record gets 128Command in redis-cli to get the list. For more information, seeHow do I enable the Redis Slow protocol on an ElastiCache for Redis Cache Cluster?
You may also see an increase inEngineCPUUtilizationMetric in CloudWatch due to slow commands crashing the Redis engine. For more information, seeWhy am I seeing high or increasing CPU usage on my ElastiCache for Redis cluster?
Examples of complex commands are:
- KEYin production environments on large data sets, since a specific pattern is searched in the entire key space.
- DurableScripts LUA.
High memory usage leading to increased paging
Redis starts swapping pages when memory pressure increases on the cluster by using more memory than is available. Latency and timeout issues increase as memory pages are transferred to and from the swap space. The following are indications in the highest paging CloudWatch metrics:
- increase ofexchange of use.
- very lowFreeableMemory.
- HochBytesUsedForCacheYDatabase Memory Usage Percentagemetric.
exchange of useis a host-level metric that indicates the amount of memory swapped. It is normal for this metric to show non-zero values as it is driven by the underlying operating system and can be affected by many dynamic factors. These factors include the version of the operating system, activity patterns, etc. As an optimization technique, Linux proactively swaps unused keys (rarely accessed by clients) to disk to free up space for frequently used keys.
Swapping becomes a problem when there is not enough memory available. When this happens, the system starts moving pages back and forth between disk and memory. Special,exchange of useLess than a few hundred megabytes will not negatively affect Redis performance. There is performance degradation when theexchange of useis high and actively changing and the cluster is running out of memory. For more information, see:
- Why am I seeing high or increasing memory usage on my ElastiCache cluster?
- Why is there a swap in ElastiCache?
Network Caused Latency
Network latency between the client and the ElastiCache cluster
To isolate network latency between the client and the cluster nodes, useTests TCP traceroute or mtr from the application environment. Or use a debugging tool likeAWSSupport-SetupIPMonitoringFromVPCAWS Systems Manager Doc (SSM Doc) to test connections from the customer's subnet.
The cluster is reaching the limits of the network
An ElastiCache node shares the same network boundaries as corresponding Amazon Elastic Compute Cloud (Amazon EC2) instances. For example, the node type ofhidden.m6g.largehas the same network limits as thatm6g.largeEC2 instance. To review the three key components of network performance: bandwidth capacity, packet-per-second (PPS) throughput, and traced connections, seeMonitor network performance for your EC2 instance.
To troubleshoot network throttling on your ElastiCache node, seeTroubleshooting: Network Related Restrictions.
Clients connect to Redis clusters through a TCP connection. Establishing a TCP connection takes a few milliseconds. The extra milliseconds create additional overhead on the Redis operations performed by your application and additional pressure on the Redis CPU. Controlling the volume of new connections is important if your cluster uses ElastiCacheencryption in transitCharacteristic due to the additional time and CPU usage required for a TLS handshake. A high volume of connections opened quickly (new connections) and closed can affect the performance of the node. You can useconnection poolingto cache established TCP connections in a pool. The connections are then reused each time a new client tries to connect to the cluster. You can implement connection pooling using your Redis client library (if supported) with a framework available for your application environment, or build one from scratch. You can also use added commands likeMSET/MGETas an optimization technique.
There are a lot of connections in the ElastiCache node
It's good practice to keep track of that.current connectionsYnew connectionsCloudWatch metrics. These metrics monitor the number of TCP connections accepted by Redis. A large number of TCP connections could exhaust the 65,000maximum clientsEdge. This limit is the maximum number of concurrent connections you can have per node. When you reach the limit of 65,000, you get theERR maximum number of clients reachedMistake. If more connections are added beyond the Linux server limit or the maximum number of traced connections, the additional client connections will generate connection timeout errors. For information on how to avoid a large number of connections, seeBest Practices: Redis Clients and Amazon ElastiCache for Redis.
Client-side latency issues
Latency and wait times can come from the client itself. Check the memory, CPU, and network usage on the client side to see if any of these resources are reaching their limits. If your application is running on an EC2 instance, use the same CloudWatch metrics described above to look for bottlenecks. Latency can occur on an operating system that cannot be fully monitored by standard CloudWatch metrics. Consider installing a monitoring tool on the EC2 instance, for exampleon topoCloudWatch Agent.
If the timeout configuration values set on the application side are too small, you may get unnecessary timeout errors. Set the client-side timeout appropriately to give the server enough time to process the request and generate the response. For more information, seeBest Practices: Redis Clients and Amazon ElastiCache for Redis.
The timeout error received from your application contains additional details. These details include whether a specific node is involved, the name of the Redis data type causing the timeouts, the exact timestamp of when the timeout occurred, etc. This information will help you find the pattern of the problem. Use this information to answer questions such as the following:
- Do you usually spend time outside during a certain time of day?
- Did one or more customers time out?
- Did a Redis node or multiple nodes time out?
- Did one or more clusters time out?
Use these patterns to examine the most likely client or ElastiCache node. You can also use your application registration andVPC-Flussprotokolleto determine if the latency is client-side, ElastiCache node, or network.
Redis synchronization is started during backup, replace, and scale events. This is a computationally intensive workload that can introduce latency. Use theSaveInProgressCloudWatch metric to determine if sync is in progress. For more information, seeHow sync and backup are implemented.
ElastiCache Cluster Events
check theeventsSection in the ElastiCache console for the period in which latency was observed. Look for background activities such as node swapping or failover events that ElastiCache might be causing.managed maintenanceand service upgrades or unexpected hardware failures. You will receive notifications of scheduled events through the PHD dashboard and by email.
The following is an example of an event log:
Cache node recovery completed 0001 Cache node recovery 0001 Failover from master <cluster_node> to replica <cluster_node> completed
Best Practice Monitoring with Amazon ElastiCache for Redis using Amazon CloudWatch
Diagnose latency issues - Redis