When enabling the metrics/dashboard integration (see this article for more info on enabling the integration), Aiven automatically creates a default PostgreSQL dashboard in Grafana which shows metrics that should be useful for most users.

Before we begin, here are a few key points about default dashboards:

  1. All tables and indexes for all logical databases are shown because it can't determine which tables/indexes are relevant and which aren't: this may result in convoluted graphs for large deployments.
  2. Some metrics are gathered but not shown in the default dashboard to avoid making it too hard to follow: these are detailed later
  3. You can create new dashboards that include your preferred set of metrics and filter criteria to make them more useful to you: copy the default dashboard to use as a template to make the process easier. 
  4. When you create new dashboards, don't start their names with “Aiven” because they may be removed/replaced: they are assumed to be system-managed. This applies also to the default dashboard, which should not be directly edited because the changes will be lost at some point.

The default dashboard is split into multiple sections under two larger categories: Generic and PostgreSQL. Let's begin with Generic.

Generic metrics

The dashboard has a fair amount of metrics that aren't specific to the service-type running on the node. They're mostly CPU, memory, disk and network-related.

Overview

This section shows a high-level overview of service node health. Major problems with the service are often visible directly from this section. 

NB!  Business and premium service/s numbers will be averages of all nodes belonging to the service. 

For some metrics, e.g. disk space, this doesn’t typically matter but load usually concentrates on the master node (or in certain setups the standby node/s) and high values are dampened by the average

The system metrics section shows the node specific values.

Uptime: The time the service has been up and running.

Load average: The number of processes that would want to execute. 

NB! If this number is higher than the number of CPUs on the nodes, the service may be underprovisioned.

Memory available: Memory not allocated by running processes.

Disk free: Amount of unused disk space.

System metrics

This section shows a more detailed listing of various generic system (non-PostgreSQL) related metrics.

CPU: System, user, iowait and irq CPU usage. High iowait means the system is reading or writing too much data to/from disk.

Load average: The number of processes that would want to execute. 

NB! If this number is higher than the number of CPUs on the node the service may be underprovisioned.

Memory available: Memory not allocated by running processes.

Memory unused: Memory not allocated by running processes or used for buffer caches.

Context switches: Switches from one process or thread to another.

Interrupts: Number of interrupts per second.

Processes: Number of processes that are active doing something. Processes that are mostly idle aren't included.

Disk free: The amount of disk space that's currently remaining. 

Tip: Actively monitor this value and associate it with an alert since the database will stop working correctly if it runs out of disk space.

Disk i/o: Number of bytes read/written per second on each of the nodes.

Data disk usage: The amount of disk space that's consumed on the service's data disk.

CPU iowait: Percentage of CPU time that's spent waiting for disk to become available for read/write operations. This number is also reported in the general CPU usage but it is important enough to warrant its own graph.

Tip: Create an alert that's triggered when iowait goes beyond a certain threshold for an extended period to give you an opportunity to respond quickly when the database starts to slow down from too many reads/writes.

Network: Inbound/outbound bytes per second from a node.

Network (sum of all nodes): Same as the network graph, but values aren't grouped by service node.

TCP connections: Number of open TCP connections grouped by node.

TCP socket state total on all nodes: TCP connections across all service nodes, grouped by the TCP connection state.

PostgreSQL-specific metrics

For most metrics, the metric name identifies the PostgreSQL internal statistics view. Refer to PostgreSQL documentation for a more detailed explanation of the various metric values.

Metrics that are currently recorded but not shown in the default dashboard include postgresql.pg_stat_bgwriter and postgresql.pg_class metrics as a whole, as well as some individual values from other metrics.

Overview

The metrics in the PostgreSQL overview section are grouped by logical database. Part of the metrics are additionally grouped by host.

Database size: Size of the files associated with a logical database. 

NB! There are some potentially large files that aren't included in this number. Most notably, the write-ahead log (WAL) isn't included in the size of the logical databases as it isn't tied to any specific logical database.

Connections: The number of open connections to the database. 

NB! Each connection puts a large burden on the PostgreSQL server and this number should typically be fairly small even for large plans

Tip: Connection pooling may be used to reduce the number of connections to the actual database server.

Oldest running query age: Age of the oldest running query. 

Tip: Typical queries execute in milliseconds and having queries that run for minutes is often an indication of some problem.

Oldest connection age: Age of the oldest connection. Old open connections with open transactions are a problem because they prevent VACUUM from doing its job, resulting in bloat and performance degradation.

Commits / s: The number of commits per second.

Rollbacks / s: The number of rollbacks per second.

Disk block reads / s: The number of 8 kB disk blocks PostgreSQL reads per second, excluding reads that were satisfied by buffer cache (see below). The reads may have been satisfied by the operating system's file system cache.

Buffer cache disk block reads / s: The number of 8 kB disk blocks PostgreSQL reads per second that were already in buffer cache.

Temp files created / min: The number of temporary files that PostgreSQL created per minute. Temporary files are usually created when a query requests a large result set that needs to be sorted or a query joins large result sets. 

Tip: Temporary files only need to be created when the results are too large to fit into memory and high number of temp files / temp file bytes may indicate that the work mem setting should be increased. The queries that cause the temp files could also potentially be optimized to make the temporary result sets smaller or an index added to allow sorting directly based on an index.

Temp file bytes written / s: Number of bytes written to temporary files per second. 

Tip: This value should be kept at reasonable levels to avoid the server becoming IO bound from having to write so much data into temp files.

Deadlocks / min: Number of deadlocks per minute. Deadlocks occur when different transactions obtain row level locks for two or more same rows in different order. 

NB! Deadlock situations may be resolved by retrying on client-side but they can create significant bottlenecks and high deadlock counts are something that should be looked into. 

Tip: Using SELECT … ORDER BY … FOR UPDATE  may be useful for ensuring rows are locked in desired order.

Indexes

This section has graphs related to size and the use of indexes. Because the default dashboard contains all indexes in all logical databases, it'll become convoluted for complex databases. 

You might want to make a copy of the default dashboard and add additional constraints for the graphs to filter out uninteresting indexes, e.g. for the size graph you might want to include only indexes that are above X megabytes in size.

Index size: Size of indexes on disk.

Index scans / s: Number of scans per second per index.

Index tuple reads / s: Number or tuples read from an index during index scans.

Index tuple fetches / s: Number of table rows fetched during index scans.

Tables

This section has graphs related to the size and use of tables. As with indexes, the graph will be convoluted for complex databases and you may want to make a copy of the dashboard to add additional filters to exclude uninteresting tables.

Table size: Size of tables, excluding indexes and TOAST data.

Table size total: Total size of tables, including indexes and TOAST data.

Table seq scans / s: Number of sequential scans per table per second. 

Tip: For small tables, sequential scans may be the best way of accessing the table data and having a lot of seq scans may be normal, but for large tables sequential scans should be very rare.

Table tuple inserts / s: Number of tuples inserted per second.

Table tuple updates / s: Number of tuples updated per second.

Table tuple deletions / s: Number of tuples deleted per second.

Table dead tuples: Number of rows that have become unreferenced due to an update or delete for the same row and uncommitted transactions older than the update/delete are no longer running. The rows will be marked reusable during the next VACUUM

Tip: High values here may indicate that vacuuming isn't aggressive enough and its configuration possibly adjusted to make it run more often because frequent vacuums reduce table bloat and make the system work better. 

NB! The n_live_tup value is available and can be used to create graph/s showing tables with high dead vs. live tuple ratio.

Table modifications since analyze: Number of inserts, updates or deletions since last ANALYZE

NB! A high number here means the query planner may end up creating bad query plans because it's operating on obsolete data. 

Tip: Vacuuming also performs ANALYZE and you may want to adjust your vacuum settings if you see slow queries and high table modification counts for the related table(s).

Vacuum and analyze

This section has graphs related to vacuum and analyze runs. The graphs are grouped by table and, for complex databases, you probably want to add additional filter criteria to only show results where values are outside of the expected range.

Last vacuum age: Time since last manual vacuum for a table.

Last autovacuum age: Time since last automatic vacuum for a table.

Last analyze age: Time since last manual analyze for a table.

Last autoanalyze age: Time since last automatic analyze for a table.

Maint ops / min Number of vacuum/analyze operations per table, per minute.

Miscellaneous

This section has miscellaneous graphs.

Xact replay lag: The replication lag between master and standby nodes.

Replication bytes diff: Replication lag in bytes. This is the total diff across all replication clients. 

Tip: To differentiate between different standby nodes you can additionally group by the client_addr tag. This graph shows diff based on write_lsn ; flush_lsn is also available.

Unfrozen transactions: Number of transactions that have not been frozen and the freeze limit. 

NB! In very busy systems, the number of transactions that haven't been frozen by vacuum may rise rapidly and you should monitor this value to ensure the freeze limit isn't reached. Reaching the limit will cause the system to stop working. In case the txns values gets close to the freeze limit vacuum settings need to be made more aggressive and any problems that prevent vacuum from doing it's job, such as long running open transactions, need to be resolved.

Did this answer your question?