For Aiven PostgreSQL business and premium plans, there are one or two standby read-replica servers. Read-replica servers can be queried, but no writes are accepted. In case master server fails, standby replica server is automatically promoted as master. This is different from read-replica services that can be manually created for startup, business and premium services: manually created read-replica services will not be promoted if master server fails.
For business and premium plans, there are two distinct cases for failovers / switchovers to occur:
- Unexpected master/replica leaving (for example, hardware hosting the virtual machine fails)
- Controller switchover during rolling-forward upgrades
Uncontrolled master/replica leaving
For an unexpectedly leaving server, there is no way to know whether the server really disappeared, or whether there is a temporary network glitch with cloud provider's network.
For replica, there is a 300 seconds timeout before Aiven management platform automatically decides the server is gone and spins up a new server. During this 300s period replica.servicename.aivencloud.com points to a server that may not serve queries anymore. DNS record pointing to the master - servicename.aivencloud.com -works fine. If replica server does not come back up within this 300s period, replica.servicename.aivencloud.com is pointed to the master server, until new replica server is built.
In case the master disappears, a replica server waits for 60 seconds before promoting itself as master. During this 60 second timeout the master is unavailable (i.e., servicename.aivencloud.com does not respond), and replica.servicename.aivencloud.com works fine (in read-only mode). After replica server promotes itself as master, servicename.aivencloud.com points to the new master server, and replica.servicename.aivencloud.com does not change (i.e., it continues to point at the new master server. New replica server is built automatically, and after it is in sync, replica.servicename.aivencloud.com is pointed to new replica server.
Controlled switchover during upgrades
When applying upgrades (or plan changes) for business or premium plans, we first replace the standby server(s):
- A new server is started up, a backup is restored, and the new server starts following old master server. After new server is up and running, replica.servicename.aivencloud.com is changed, and old replica server is deleted. For premium plans this step is executed for both replica servers before master server is replaced.
- Another server is started up, backup is restored, and new server is synced up to old master server. After this is done, replication is changed to quorum commit synchronous replication where available (lower performance, higher guarantees on avoiding data-loss when changing master server). At this point there is one extra server running: the old master server, and two or three new replica servers (for business and premium plans, respectively).
- When it is time to switch the master to a new server, the old master is terminated (synchronous replication guarantees data has been received by at least one of the new replica servers), and one of the new replica servers is immediately promoted as a master. At this point servicename.aivencloud.com is updated to point at the new master server. Similarly, the new master is removed from replica.servicename.aivencloud.com record.