An Aiven PostgreSQL server side failover, i.e. replacing the old acting master server with a new one, could happen for multiple reasons:
- The acting master server has failed (hardware, software, disk system failure, etc.)
- Service has been migrated to a different size service plan
- A maintenance update has been performed
- Service has been migrated to a different cloud or region
The failover moment will appear to a client application as losing its existing database connection. Depending on the reason, this could appear as a clean termination (PostgreSQL server terminating the connection over its line protocol), a socket level connection termination or as a completely silent loss of connection that might appear as a socket send or receive timeout.
Most client library implementations will raise an error or exception to the caller when the connection is lost. Many applications will respond to this simply by raising the error further in the call stack until it gets, for example, returned all the way to the end customer as an HTTP error code (e.g. Internal Server Error 500).
Handling database connection reconnections behind the scenes, without immediately throwing the error all the way to the user requires some application logic around the error handling and here we present a general approach that is suitable for many use cases.
Disconnection resilient database connection handling
The customer visible problem happens when a client initiated transaction is aborted due to query failure or when the client connection gets disconnected for any reason. This causes an error (exception) on the client side and if the client/app side implementation passes the error directly thru the call stack, a typical result is an HTTP error code returned to the customer initiated operation.
A transparent, fault tolerant method of handling this kind of scenarios does require support on the client application side so that transactions are handled like so (Pythonish pseudo code):
return actual_query("SELECT foo, bar FROM update_abc(%s)", data)
def actual_query(query, args):
retries_remaining_count = 5
results = db.execute_query(query, args)
return results # Success!
# Transaction was not committed due to loss of DB connection and was rolled back
if retries_remaining <= 0:
db.reconnect() # Could happen automatically as well
continue # Try same query again
# Some error for which we cannot automatically retry the query
Note that the aborted client connection error will always eventually happen at some point, if not due to maintenance updates or service migrations, then there will eventually be a networking or server failure at some point down the road.