This article explains in detail what happens when you upgrade an Aiven for Apache Kafka cluster. The term "upgrade" referred to in this article applies to maintenance updates, plan changes, cloud region migrations, as well as manual node replacements performed by an Aiven operator. All of these operations involve creating new broker nodes to replace existing ones.

Upgrade operation procedure

  • New Kafka nodes are started alongside your existing nodes

  • When these new nodes have been set up they will join the Kafka cluster (which, for a while, will contain a mixture of old and new nodes)

  • Aiven management code will coordinate streaming data from old to new nodes and also move partition leadership to new nodes. This process is CPU intensive because there is overhead from work generated per partition in addition to the normal load on the cluster

  • As this process continues, when old nodes have no more data remaining they will be retired from the cluster. Depending on the cluster size more new nodes may be added (by default we may replace up to 6 at a time)

  • The process is complete once the last old node has been removed from the cluster

Console Screenshots for reference

The Metrics : CPU usage, disk and memory shoots up when the nodes are getting replaced.

No downtime during the upgrade

Due to the process described above, there will always be active nodes in cluster and the same service URI will resolve to all the active nodes. Because the upgrade generates extra load during the transfer of partitions, this can slow down or prevent the progress of normal work if the cluster is already under heavy load. (More on mitigations for this below.)

On the client side there may be warning messages about partition leader not found as they move between brokers. This is normal and most client libraries handle this automatically but the warnings may look alarming in the logs.

https://help.aiven.io/en/articles/4898964-kafka-not_leader_for_partition-error

Note: Sometimes failures do occur during the upgrade process which is beyond your control, such as running out of capacity in a cloud region, along with hardware and software-related errors. Rest assured that your Kafka clusters are constantly being monitored and our on-call operators are ready to jump in and resolve these issues when they arise.

Duration of the upgrade

It is difficult to predict the time taken for an upgrade because it depends heavily on:

  • The amount of data stored in the cluster

  • The number of partitions (each has an overhead)

  • The spare resources available on the cluster (which in turn depends on the load from connected producers and consumers)

But, to make the upgrade as quick as possible, we recommend running it at a time of low traffic (to reduce the overhead of producers and consumers). If a service is already tightly constrained on resources then we recommend disabling non-essential use of the service during the upgrade to allow more resources to be used on coordinating and streaming data between nodes.

Rolling back

There is no rollback in the sense of returning the old nodes into use since they are deleted once they are removed from the cluster. Nodes are not removed from the cluster while they hold data so even if an upgrade cannot progress then we do not remove nodes that would lead to data loss. The same effect as a rollback can be achieved by downgrading the plan which would start a new cycle of node replacements if you find the upgraded plan is too large.

Impacts and risks

There may be additional CPU load generated by partition leadership coordination and streaming data to new nodes. This is best mitigated by running the upgrade at a time of low traffic and/or reducing the normal workload on the cluster by disabling non-essential producers and consumers.

Another risk is disk usage reaching the maximum during the upgrade process which can prevent progress of the upgrade. The best mitigation is to upgrade the plan before disk usage is too high but our operations team is able to work around this by adding additional volumes to the old nodes in an emergency.

Did this answer your question?