How do I debug latency spikes when there haven't been any changes to my query patterns or app workload?

For example, after a recent OS upgrade, kernel change or Scylla version upgrade.

Also, how can I find hotspots on specific CPU cores? I’d like to better understand the CPU usage and find the bottlenecks.

Context / Problem Statement

Flame Graphs in ScyllaDB help visualize CPU usage and spot performance bottlenecks (like latency, high CPU usage). They map call stacks to show which functions take the most time, letting you identify specific code paths and check imbalances across shards (CPU cores).

Step-by-Step Instructions

A Flame Graph is built by sampling call stacks (stack traces) and showing them in a hierarchical view.

Visual Components:

  • X-Axis (Horizontal): The width of each box shows how much time (or samples) that function used.

  • Y-Axis (Vertical): Shows the depth of the call stack.

Example Use Case:

In the example below, the customer reported an elevated average read latency (red line) after a new node was provisioned (got online).

This was causing a general degradation which was showing up in higher disk delays and higher I/O starvation. Which in line translated into the elevated latency.

FlameGraph:

With FlameGraphs it becomes clear who is consuming the most time. The widest “flame” at the top is the function eating most of the total time.

The investigation showed that this node had very high kernel CPU times and the dominating symbol was osq_lock.

The problem happened because the kernel installed on the new node had a change that caused a regression in seastar, which caused the behavior described above.

Downgrading the kernel on the affected node worked as a temporary solution until a corresponding fix was added to seastar.

Without a Flamegraph it would be much more difficult to figure out the problem, and to understand why the node was slow.

Expected Outcome / Benefit

Flame Graphs turn confusing CPU profiles into clear. They spot hotspots on specific shards (CPU cores).

Key points:

  • Focus on CPU usage per shard.

  • Hot paths show issues like I/O delays, compaction overload, network bottlenecks, etc.

  • Cross-check with Prometheus metrics for full picture.