[RELEASE] ScyllaDB Monitoring 4.10.0

The ScyllaDB team is pleased to announce the release of ScyllaDB Monitoring Stack 4.10.0

ScyllaDB Monitoring Stack is an open-source stack for monitoring ScyllaDB based on Prometheus and Grafana. ScyllaDB Monitoring Stack 4.10.0 supports:

  • ScyllaDB 2024.1, 2024.2, 2025.1, and the upcoming 2025.2 release
  • ScyllaDB Manager 3.x

This release includes multiple updates to the overview, detailed, alternator, and advanced dashboards, including in CPU Utilization, Disk utilization, and others.

Related Links

Version updates for ScyllaDB Monitoring Stack 4.10.0

  • Prometheus upgraded to version 3.3.1
  • Grafana upgraded to version 11.6.1

New Information in ScyllaDB Dashboards

Overview Dashboard Change

  • CPU utilization to show priority classes that are involved in query processing #2478

ScyllaDB’s advanced use of Service Level for query processing and background operations can cause confusion when observing CPU consumption.

When monitoring ScyllaDB, high CPU usage is not necessarily an indication of system overload.

To clarify this, the overview dashboard now displays only the query-related priority group consumption, split by priority group.

  • Switch disk utilization to show percentile #2499

In non-heterogeneous clusters, mixing different instance size nodes, each with a different number of cores, storage volume and disk usage is harder to interpret when viewed in bytes.

Instead, the graph now shows percentile usage, displaying the average percentage of disk space used.

Detailed Dashboard Change

  • Misleading titles and descriptions for Tombstones panels in Grafana #2481

The descriptions for tombstones in SSTables graphs have been clarified.

The relative numbers represent the number of tombstones found in an SSTable and are updated after flush, compaction, or streaming operations.

This clarification helps users understand when to expect updates to these values.

  • Compressed Bytes Sent by Algorithm - add aggregated value #2500

When viewing the compressed bytes sent graph, it’s useful to track the aggregated total number of bytes.

A new total graph within the panel makes this easier to follow.

  • Add the scylla_load_balancer_load metric to the tablet section #2514

When examining tablet balance across the system, the total number of tablets per node can be misleading in non-heterogeneous clusters.

Instead, the tablet load balancer’s load metric should be used. This metric defines load in proportion to each node’s capacity.

In a balanced non-heterogeneous cluster, the load balancer load metric will be equal across nodes, even if the number of tablets is not.

  • RPC delay metrics #2349

The new RPC delay graph in the RPC section shows the total round-trip time of an RPC message between the verb caller and the server.

  • Querier cache sub-panel #2471

The querier cache stores queries paused due to paging and resumes them later, reducing query startup costs.

If it misbehaves due to overload or bugs, performance can degrade.

The new querier cache section displays population, lookup rate, and miss rate.

Alternator Dashboard Change

  • Expose HTTP metrics in the Alternator dashboard #2506

Alternator relies on the HTTP protocol. There is now an HTTP section in the Alternator dashboard, currently showing open connections and new connections.

This helps identify situations where there are too few or too many connections.

CQL Dashboard Change

  • Split non-token-aware queries to show reads and writes #2468

Non-token-aware queries result in performance loss.

When viewing the non-token-aware graph, it’s helpful to distinguish whether the source is reads or writes.

The panel now displays two separate graphs: one for reads and one for writes.

  • Support new large partition columns #2483

Large partitions can lead to performance degradation.

To address this, ScyllaDB collects information about large partitions.

The updated panel now includes additional columns: dead rows and range tombstones.

OS Dashboard Change

  • Add node_netstat_Tcp_RetransSegs #2472

TCP retransmission segments may indicate a network problem.

A new graph now shows the rate of TCP retransmissions.

Advanced Dashboard Change

  • “Commit log Should use aggregation function” #2516

The commit log information was previously always aggregated using averages. This caused confusion, and in some cases, it was necessary to aggregate using other methods, such as sum.

It now uses the same aggregation functions available from the drop-down menu as the rest of the graphs.

Bug Fixes

  • Fix a typo in the log message (“grafna” → “grafana”) #2519

Operational Changes

  • Set the auto dashboard refresh to 5m #2501
  • Manager versions will be based on Major releases instead of minor releases. When specifying ScyllaDB Manager 3.x versions, use 3 instead of a specific minor version (like 3.3 or 3.4)