Redis is packed with features and can solve many different kinds of problems. One common use case is to use it as a database cache where data is written to Redis whenever it is fetched from database and subsequent queries with same parameters first try to look up data from Redis and skip the database query if the data was found from Redis. This use case and some others result in high memory usage and may result in high change rate, which bring some challenges that are explained in this article.

Redis eviction policies

One of the more important settings for Redis is data eviction policy, which can be controlled from the Aiven web console. Redis has a max memory setting controlling how much data is allowed to be stored and the data eviction policy controls what happens when the maximum is reached.

Aiven Redis services by default have eviction policy set to No eviction, which means that if you keep on storing values and never remove anything writes will eventually start failing when the maximum memory is reached. This is fine when data is consumed at similar rate than it is written but for many use cases Evict all keys, least recently used first (LRU), which starts dropping old keys when max memory is reached, may work better. The other modes is dropping random keys instead of oldest first, dropping oldest or random but only for keys that have an explicit expiration time set, and dropping keys with shortest TTL first but only ones with explicit expiration time.

Regardless of the eviction policy, if you write enough you will eventually reach the max memory setting.

Behavior in high memory and high change rate scenarios

For new services the max memory for Redis is set to 70% of available RAM (minus management overhead) plus 10% for replication log. The reason for restricting memory below 100% is that there are couple of situations where Redis performs operations that can require significant additional memory: When new Redis node connects to the existing master, the master forks a copy of itself that sends current memory contents to the other node. Similar forking is done whenever Redis persists current state on disk. By the time of writing Aiven configures this to happen every 10 minutes.

When Redis creates a fork of itself all memory pages of the new process are identical to the parent and don't actually consume extra memory. However, any changes in the parent process cause memory to diverge and real memory allocation to grow. If the forked process took, say, 4 minutes to perform its task and new data was written at 5 megabytes per second, system memory usage would grow by 1.2 gigabytes during the operation. Duration of the backup and replication operations are directly proportional to the total amount of memory that is in use so the larger the plan the larger the possible memory diverge and thus the memory reserved for allowing these operations to complete (without using swap) is also proportional to total memory.

If Redis memory usage grows so large that it needs to use swap the system can quickly go into a cycle that makes the node completely unresponsive and the node needs to be replaced. In the on-disk persistence case the child process wants to dump all of its memory on disk. If parent process diverges so much that the node runs out of RAM then some pages that the child may not yet have persisted on disk are stored to swap on disk. This makes the child process become more IO bound and run more slowly, which increases the divergence and causes more memory to be swapped, making the child even slower and so on. Quite soon the node is unable to do anything but write and read swap.

The rate at which data can be written depends somewhat on the size of values that are being written and plan details but in general writing on average 5 megabytes per second works in most cases and writing on average 50 megabytes per second will almost always result in failure, especially if memory usage nears maximum allowed or if a new node needs to initialize itself.

Initial synchronization

During system upgrades or in HA setups in case of node failure, a new Redis node needs to be synchronized with current master. The new node starts with an empty state, connects to the master and asks the master to send full copy of its current state. After getting that full copy the new node will start following the replication stream from the master to fully catch up.

Initial sync implementation is somewhat CPU intensive and because Redis doesn't support splitting the workload between multiple CPU cores the maximum transfer speed is in practice usually around 50 megabytes per second. Additionally, the new node persists data initially on disk and does separate loading phase, which is low hundreds of megabytes per second. For a 20 gigabyte data set the initial sync phase takes around 6 minutes with today's hardware and it cannot be sped up noticeably.

Once initial sync is complete the new node tries to start following replication stream from master. Replication log size is set to 10% of the total memory Redis is allowed to use so e.g. for a 28 gigabyte plan it is just under 2 gigabytes. If the amount of changes that happen within the 6 minutes that initial sync takes place is larger than replication log size the new node will not be able to start following the replication log stream and it needs to do new initial sync, and again once that completes and there are too many changes it needs to start over and so forth. If new changes are written at 5 megabytes per second the amount of changes is 1.8 gigabytes in 6 minutes and the new node can start up successfully. Higher constant change rate makes the sync fail unless replication log size is increased.

Mitigations

Aiven does currently not rate limit Redis traffic because only limiting the relevant write operations requires specialized Redis proxy and restricting all traffic could unnecessarily affect non-write traffic. Rate limiting may be introduced in the future but for now you should be mindful of your workloads and try to keep Redis writes at a moderate level to avoid node failures due to too high memory usage or inability to initialize new nodes.

If you have a use case where you need to constantly write large amounts of data you can contact Aiven support to discuss options for finetuning your service configuration.

Did this answer your question?