When nodes are replaced in your Aiven for Kafka cluster you may find a flood of NOT_LEADER_FOR_PARTITION errors in the logs of your producer code.

The exact error message depends on your client library and log formatting, but they look something like this:

[2021-02-04 09:01:20,118] WARN [Producer clientId=test-producer] Received invalid metadata error in produce request on partition topic1-25 due to org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition.. Going to request metadata update now (org.apache.kafka.clients.producer.internals.Sender)

Although a sudden rush of these logs is intimidating, this is actually expected behaviour as part of the failover from old to new nodes.

Inside each producer is a metadata cache which lets it know which broker is the leader of each partition. When your code produces a message, it tries to send it to the broker which is the partition leader to the best of the producer's knowledge (from the metadata cache).

When nodes are replaced, it's expected that partition leaders will change. Every partition leader will change at least once (or more, depending on the node count and on how many nodes are replaced at a time). If a service has 3000 active partitions, then for each of them in each producer we would expect at least one line of this warning.

Usually a producer can update its metadata cache immediately and, after that, continues producing messages just fine. However, sometimes due to high load on the broker side or unfortunate timing in parallel requests, it can happen several times. This might cause a massive flood of these warnings in the logs on the producer side which looks pretty intimidating but is in fact harmless.

Did this answer your question?