This article describes some of the non service-specific metrics available over Prometheus that may be worth monitoring. You may need to configure Prometheus integration and setup Prometheus server before proceed.

This article will not provide a full list of metrics. However, you can get the full list in Prometheus server, or by sending HTTP GET with curl

curl --insecure --user promXXXX:XXXXXXXXXXXXXXXX https://YOUR-SERVICE-FQDN:9273/metrics

CPU usage

For high-level view on CPU usage for a single CPU service, you may use

100 - cpu_usage_idle{cpu="cpu-total"}

Be aware that it may not be a good idea using cpu_usage_system{cpu="cpu-total"} + cpu_usage_user{cpu="cpu-total"} to calculate total CPU usage because they don't include cpu_usage_iowait. Furthermore, a process with nice value larger than 0, will be categorised as cpu_usage_nice which is not included in cpu_usage_user.

It may also worth monitoring cpu_usage_iowait{cpu="cpu-total"}

If this number is high, indicates the service node is working on something I/O intensive. For example, 40 means 40% of CPU time is waiting for disk or network I/O.

Some important CPU related metrics:

cpu_usage_idle        # CPU doing nothing

cpu_usage_system # kernel code consuming CPU
cpu_usage_user # user space program with nice value <= 0
cpu_usage_nice # user space program with nice value > 0

cpu_usage_iowait # time waiting for IO
cpu_usage_steal # time waiting for hypervisor to give
# CPU cycles to the VM

cpu_usage_irq # system is handling IRQ
cpu_usage_softirq

cpu_usage_guest # CPU time for guest OS. Should be zero because
cpu_usage_guest_nice # we are not running another hypervisor

These metrics are generated from a telegraf plugin

Disk

Consider disk_used_percent and disk_free

disk_free          # free space on service disk
disk_used # 8.0e+9 means 8,000,000,000 bytes
disk_total
disk_used_percent # equal to disk_used / disk_total * 100
# 80 means 80% service disk usage

disk_inodes_free # Number of inodes available on service disk
disk_inodes_used # same as output of "df -i"
disk_inodes_total

Memory

Consider mem_available (in bytes) or mem_available_percent as this is the estimated amount of memory available for application without swapping. Don't worry if mem_free is low because Linux automatically caches disk data in memory to speed-up read operations.

Network

It may worth monitoring the number of established TCP sessions, available in netstat_tcp_established metric.

Conclusion

Prometheus integration allows you to monitor your Aiven services and understand the resource usage. By understanding the resource usage you could select the right plan and upgrade/downgrade according to your need. Feel free to contact us if you have questions on these metrics.

You may also be interested in

Did this answer your question?