Check your topic replication factors

Kafka services rely on replication between brokers to preserve data in case of the loss of a node. Consider how business critical the data is in each topic and ensure the replication is set high enough for them.

This can be set when creating a new topic and modified for existing topics within the Aiven Console. Note that you cannot set the replication factor below 2 in the Aiven Console to prevent data loss from unexpected node termination.

User-uploaded Image

Be careful to choose a reasonable number of partitions for a topic

Too few partitions can be the cause of bottlenecks in data processing (in the most extreme case, a single partition means that messages are effectively processed sequentially) but too many cause strain on the cluster because of the additional overhead. Since reducing the number of partitions cannot be done for existing topics, it is usually best to start with a low number that allows efficient data processing and only increase if needed.

Periodically examine topics using entity-based partitioning for imbalance

If you partition messages based on an entity id (e.g. user id) then watch out for the danger of heavily imbalanced partitions. This will result in uneven load in your cluster and reduce how effectively it can process messages in parallel.

This can be checked from the Aiven Console in the information panel opened by clicking on topics within the list view.

User-uploaded Image

Find the right balance between throughput and latency

In your producer and consumer configs you will want to experiment with batch sizes to find the right trade-off: bigger batches increase throughput but increase the latency for individual messages. Conversely, using smaller batches will decrease message processing latency but the overhead per message increases and the overall throughput will decrease.

You can set, for example, batch.size and linger.ms in the producer config of your application code.

Kafka Connect

Pay attention to tasks.max for connector configs

By default connectors run a maximum of 1 task and for large Kafka Connect services this usually leads to underutilization (unless you have many connectors with one task each). In general, it is good to keep the cluster CPUs occupied with connector tasks without overloading it. If your connector is under-performing, then you can try increasing tasks.max to match the number of partitions.

Consider a standalone Kafka Connect service

Kafka connect can be run as part of your existing Kafka service (for service plans business-4 and higher), which you can try out Kafka Connect by enabling the feature within your existing Kafka service, but for heavy usage we recommend enabling a standalone Kafka Connect service to run your connectors. This will allow you to scale the Kafka service and the connector service independently and offer more CPU time and memory to the Kafka Connect service.

Did this answer your question?