An Aiven PostgreSQL server side failover, i.e. replacing the old acting master server with a new one, could happen for multiple reasons:

  • The acting master server has failed (hardware, software, disk system failure, etc.)
  • Service has been migrated to a different size service plan
  • A maintenance update has been performed
  • Service has been migrated to a different cloud or region

The failover moment will appear to a client application as losing its existing database connection. Depending on the reason, this could appear as a clean termination (PostgreSQL server terminating the connection over its line protocol), a socket level connection termination or as a completely silent loss of connection that might appear as a socket send or receive timeout.

Most client library implementations will raise an error or exception to the caller when the connection is lost. Many applications will respond to this simply by raising the error further in the call stack until it gets, for example, returned all the way to the end customer as an HTTP error code (e.g. Internal Server Error 500).

Handling database connection reconnections behind the scenes, without immediately throwing the error all the way to the user requires some application logic around the error handling and here we present a general approach that is suitable for many use cases.

Disconnection resilient database connection handling

The customer visible problem happens when a client initiated transaction is aborted due to query failure or when the client connection gets disconnected for any reason. This causes an error (exception) on the client side and if the client/app side implementation passes the error directly thru the call stack, a typical result is an HTTP error code returned to the customer initiated operation.

A transparent, fault tolerant method of handling this kind of scenarios does require support on the client application side so that transactions are handled like so (Pythonish pseudo code):

def some_high_level_db_operation_abc(data):
  return actual_query("SELECT foo, bar FROM update_abc(%s)", data)

def actual_query(query, args):
  retries_remaining_count = 5
  while True:
    try:
      db.begin()
      results = db.execute_query(query, args)
      db.commit()
      return results  # Success!
    except ConnectionLostError:
      # Transaction was not committed due to loss of DB connection and was   rolled back
      if retries_remaining <= 0:
        raise UnretriableError(...)
      retries_remaining--
      db.reconnect()  # Could happen automatically as well
      continue  # Try same query again
    except:
      # Some error for which we cannot automatically retry the query
      raise UnretriableError(...)

Note that the aborted client connection error will always eventually happen at some point, if not due to maintenance updates or service migrations, then there will eventually be a networking or server failure at some point down the road.

Did this answer your question?