For Aiven PostgreSQL business and premium plans, there are one or two standby read-replica servers. Read-replica servers can be queried, but no writes are accepted. In case master server fails, standby replica server is automatically promoted as master. This is different from read-replica services that can be manually created for startup, business and premium services: manually created read-replica services will not be promoted if master server fails.

For business and premium plans, there are two distinct cases for failovers / switchovers to occur:

  1. Unexpected master/replica leaving (for example, hardware hosting the virtual machine fails)
  2. Controller switchover during rolling-forward upgrades

Uncontrolled master/replica leaving

For an unexpectedly leaving server, there is no way to know whether the server really disappeared, or whether there is a temporary network glitch with cloud provider's network. 

For replica, there is a 300 seconds timeout before Aiven management platform automatically decides the server is gone and spins up a new server. During this 300s period points to a server that may not serve queries anymore. DNS record pointing to the master - -works fine. If replica server does not come back up within this 300s period, is pointed to the master server, until new replica server is built.

In case the master disappears, a replica server waits for 60 seconds before promoting itself as master. During this 60 second timeout the master is unavailable (i.e., does not respond), and works fine (in read-only mode). After replica server promotes itself as master, points to the new master server, and does not change (i.e., it continues to point at the new master server. New replica server is built automatically, and after it is in sync, is pointed to new replica server.

Controlled switchover during upgrades

When applying maintenance updates, cloud migrations or plan changes for business or premium plans (for major version upgrades, please see here), we first replace the standby server(s):

  1. A new server is started up, a backup is restored, and the new server starts following old master server. After new server is up and running, is changed, and old replica server is deleted. For premium plans this step is executed for both replica servers before master server is replaced.
  2. Another server is started up, backup is restored, and new server is synced up to old master server. After this is done, replication is changed to quorum commit synchronous replication where available (lower performance, higher guarantees on avoiding data-loss when changing master server). At this point there is one extra server running: the old master server, and two or three new replica servers (for business and premium plans, respectively).
  3. When it is time to switch the master to a new server, the old master is scheduled to be terminated (synchronous replication guarantees data has been received by at least one of the new replica servers), and one of the new replica servers is immediately promoted as a master. At this point is updated to point at the new master server. Similarly, the new master is removed from record. The old master is kept for a period of time and sets up TCP forwarding to its replacement so that clients can connect before learning of the new IP.

Got here by accident? Learn how Aiven simplifies Postgres:

Did this answer your question?