Check your topic replication factors
Kafka services rely on replication between brokers to preserve data in case of the loss of a node. Consider how business critical the data is in each topic and ensure the replication is set high enough for them.
This can be set when creating a new topic and modified for existing topics within the Aiven Console. Note that you cannot set the replication factor below 2 in the Aiven Console to prevent data loss from unexpected node termination.
Be careful to choose a reasonable number of partitions for a topic
Too few partitions can be the cause of bottlenecks in data processing (in the most extreme case, a single partition means that messages are effectively processed sequentially) but too many cause strain on the cluster because of the additional overhead. Since reducing the number of partitions cannot be done for existing topics, it is usually best to start with a low number that allows efficient data processing and only increase if needed.
Periodically examine topics using entity-based partitioning for imbalance
If you partition messages based on an entity id (e.g. user id) then watch out for the danger of heavily imbalanced partitions. This will result in uneven load in your cluster and reduce how effectively it can process messages in parallel.
This can be checked from the Aiven Console in the information panel opened by clicking on topics within the list view.
Find the right balance between throughput and latency
In your producer and consumer configs you will want to experiment with batch sizes to find the right trade-off: bigger batches increase throughput but increase the latency for individual messages. Conversely, using smaller batches will decrease message processing latency but the overhead per message increases and the overall throughput will decrease.
Pay attention to
tasks.max for connector configs
By default connectors run a maximum of
1 task and for large Kafka Connect services this usually leads to underutilization (unless you have many connectors with one task each). In general, it is good to keep the cluster CPUs occupied with connector tasks without overloading it. If your connector is under-performing, then you can try increasing
tasks.max to match the number of partitions.
Consider a standalone Kafka Connect service
Kafka connect can be run as part of your existing Kafka service (for service plans business-4 and higher), which you can try out Kafka Connect by enabling the feature within your existing Kafka service, but for heavy usage we recommend enabling a standalone Kafka Connect service to run your connectors. This will allow you to scale the Kafka service and the connector service independently and offer more CPU time and memory to the Kafka Connect service.