For a high-level overview of the High Availability features, please refer to our article on service plans and availablility.
Aiven Redis availability features are defined by the service plan level being used:
- Hobbyist and Startup plans: These are single-node plans and have limited availability and durability when faults occur
- Business plans: These are two node plans (master + standby)
Minor failures such as a service process crashes or temporary loss of network access are handled automatically in all plans without any major changes to the service deployment. The service automatically restores normal operation once the crashed process is automatically restarted or when the network access is restored.
However, more severe failure modes such as losing a single node entirely, require more drastic recovery measures. Losing an entire node (virtual machine) could happen for example due to hardware failure or a severe enough software failure.
A failing node is automatically detected by the Aiven monitoring infrastructure. Either the node starts reporting that its own self-diagnostics is reporting problems or the node stops communicating entirely. The monitoring infra automatically schedules a new replacement node to be created when this happens.
Note that in case of database failover the Service URL of your service will remain the same, only the IP address will change to point at the new master node.
Single-node Hobbyist and Startup service plans
Losing the only node from the service immediately starts the automatic process of creating a new replacement node. The new node starts up, restores its state from the latest available backup and resumes serving customers.
Since there was just a single node providing the service, the service will be unavailable for the duration of the restore operation. Also any writes made since the last backup are lost.
Highly Available Business service plans
When the failed node is a Redis Standby, the Master node keeps on running normally and provides normal service level to the client applications. Once the new replacement Standby node is ready and synchronized with the master, it starts replicating the master in real time as the situation reverts back to normal again.
When the failed node is a Redis Master, the combined information from the Aiven monitoring infra and the Standby node is used to make a failover decision. If it looks like the master node is gone for good, the Standby node will promote itself as the new Master node and will immediately start serving clients. A new replacement node is automatically scheduled and will become the new Standby node as described in the Standby node failure case above.
If both Master and Standby nodes fail at the same time, two new nodes are automatically scheduled for creation and will become the new Master and Standby nodes respectively. The Master node will restore itself from the latest available backup, which means that there can be some degree of data loss involved. Namely all the writes to the database since the last backup will be lost.
The amount of time it takes to replace a failed node depends mainly on the used cloud region and the amount of data that needs to be restored. However, in the case of services with two-node Business plans the surviving node will keep on serving clients even during the recreation of the other node. All of this is automatic and requires no administrator intervention.