[RELEASE] Scylla Monitoring Stack 4.4.0

The ScyllaDB team is pleased to announce the release of ScyllaDB Monitoring Stack 4.4.0

ScyllaDB Monitoring Stack is an open-source stack for monitoring ScyllaDB Enterprise and ScyllaDB Open Source, based on Prometheus and Grafana. ScyllaDB Monitoring Stack 4.4.0 supports:

  • ScyllaDB Open Source versions 5.0, 5.1 and 5.2
  • ScyllaDB Enterprise versions 2020.x, 2021.x and 2022.x, 2023.1
  • ScyllaDB Manager 3.0.x, 3.1.x

This release focuses on adaptation for ScyllaDB Open Source 5.2 and the upcoming ScyllaDB Enterprise 2023.1 metrics changes and reduced load from Prometheus servers for large clusters. It is advised to upgrade the monitoring stack before upgrading ScyllaDB.

Related Links

Versions updates for Scylla Monitoring Stack 4.4.0

  • Set Prometheus version to 2.44.0
  • Set Grafana version to 9.5.2

New Information in ScyllaDB Dashboards

Overview Dashboard Changes

  • Add a scheduling_group_name filter to the overview dashboard #1940

The extended use of both internal and external SERVICE LEVEL (e.g. scheduling groups) made it complicated to understand the true latency in the overview dashboard.

The new filter would allow users to explicitly choose which scheduling groups are shown.

[image]

  • Remove internal scheduling_group_name from p99 #1939

Internal SERVICE LEVEL (e.g. scheduling groups) are used for internal tasks, like compaction, are typically set to run at lower priority not to interfere with user activity. They are now removed from the p99 report as they do not represent real traffic.

  • Disk size panel update #1958

The disk size panel was updated to make it clearer what is the used part out of the entire available disk.

[image]

  • Update the alerts table #1930

Following the changes in Grafana integrated Alerts, ScyllaDB Monitoring now uses Grafana’s Alert table for clearer representation.

[image]

  • Update the Advisor section #1953

The advisor section was updated to match Grafana’s alerts system. It is now completely alert based.

Detailed Dashboard Changes

  • Added panels for CQL request and response sizes #1928

There is a new section under the per-scheduling group section for payload size.

It reflects the network usage for different kinds of CQL messages.

There are also estimation panels for read and write, they give a ballpark estimate of an average read/write message.

The section helps identify issues that result from large messages.

[image]

Advanced Dashboard

  • Show scylla_io_queue_starvation_time_sec metrics #1915

[image]

Scylla-Manager Dashboard Changes

  • Add progress of the restore process enhancement #1960

Scylla Manager 3.1 supports a new restore from backup implementation, which exposes how much data remains to complete restore, in bytes. A new panel in the Manager dashboard graphs this data. [image]

General Changes

  • Added missing descriptions to panels #1966

As part of an effort to make the dashboard clearer, a description (a popup with an explanation) was added to the panels where it was missing.

Bug Fixes

  • Empty alert manager rules when using consul #1951
  • The links in the advisor balance section are broken #1949
  • Fix legend of IO-Queue panels #1916

Operational Changes

  • Prepare for Scylla open source 5.2 summaries #1972
  • Add a command line option to drop cas/cdc metrics #1948
  • Reduce the number of collected histograms: do not collect “internal” scheduling groups’ latency histograms #1971
  • Datadog - do not report per-shard metrics #1944
  • Change alert levels to text #1927