For a high-level overview of the High Availability features, please refer to our article on service plans and availablility.
Aiven PostgreSQL availability features are defined by the service plan level being used:
- Hobbyist and Startup plans: These are single-node plans and have limited availability combined with two day real-time backup histories
- Business plans: These are two node plans (master + standby) with higher Availability and 14 day backup histories
Premium plans: These are three node plans (master + standby + standby) with even higher Availability characteristics and 30 day backup histories
Minor failures such as a service process crashes or temporary loss of network access are handled automatically in all plans without any major changes to the service deployment. The service automatically restores normal operation once the crashed process is automatically restarted or when the network access is restored.
However, more severe failure modes such as losing a single node entirely, require more drastic recovery measures. Losing an entire node (virtual machine) could happen for example due to hardware failure or a severe enough software failure.
A failing node is automatically detected by the Aiven monitoring infrastructure. Either the node starts reporting that its own self-diagnostics is reporting problems or the node stops communicating entirely. The monitoring infra automatically schedules a new replacement node to be created when this happens.
Note that in case of database failover the Service URL of your service will remain the same, only the IP address will change to point at the new master node.
Single-node Hobbyist and Startup service plans
Losing the only node from the service immediately starts the automatic process of creating a new replacement node. The new node starts up, restores its state from the latest available backup and resumes serving customers.
Since there was just a single node providing the service, the service will be unavailable for the duration of the restore operation. Also any writes made since the backup of the latest Write Ahead Log (WAL) file will be lost. Typically this time window is limited to either one of five minutes of time or one WAL file.
Highly Available Business and Premium service plans
When the failed node is a PostgreSQL Standby, the Master node keeps on running normally and provides normal service level to the client applications. Once the new replacement Standby node is ready and synchronized with the master, it starts replicating the master in real time as the situation reverts back to normal.
When the failed node is a PostgreSQL Master, the combined information from the Aiven monitoring infra and the Standby node is used to make a failover decision. On the nodes themselves we use the Open Source monitoring daemon PGLookout in combination with the information from the Aiven system infra. If it looks like the master node is gone for good, the Standby node will promote itself as the new Master node and will immediately start serving clients. A new replacement node is automatically scheduled and will become the new Standby node as described in the Standby node failure case above.
If both Master and Standby nodes fail at the same time, two new nodes are automatically scheduled for creation and will become the new Master and Standby nodes respectively. The Master node will restore itself from the latest available backup, which means that there can be some degree of data loss involved. Namely any writes made since the backup of the latest Write Ahead Log (WAL) file will be lost. Typically this timewindow is limited to either one of five minutes of time or one WAL file.
The amount of time it takes to replace a failed node depends mainly on the used cloud region and the amount of data that needs to be restored. However, in the case of services with two-node Business plans the surviving node will keep on serving clients even during the recreation of the other node. All of this is automatic and requires no administrator intervention.
Premium plans operate much in the same way as our Business plans. The main difference comes when one of the Standbys or Master nodes fails. In Premium plans because there is an additional redundant Standby node available, Availability is maintained even in the face of the loss of two of the nodes. In cases when the Master node fails PGLookout determines which of the Standby nodes is the furthest along in replication (has the least potential for dataloss) and will do a controlled failover to that node.
The additional redundant standby allows you to reduce the risk of downtime further in cases when your application cannot ever be down. Premium plans also come with a much longer backup history allowing you return your data back in time up to a month in the past.
For backups and restoration Aiven utilizes the popular Open Source backup daemon PGHoard that Aiven maintains. It makes real-time copies of Write Ahead Log (WAL) files to an object store in compressed and encrypted format