Originally from the User Slack
@Igor_Q: hey, guys
i have a cluster of 3 datacenters, using scylla v6.0.4
it worked fine for several months, until yesterday read latency went through the roof up to 5 seconds in two datacenters
i’ve contemplated performing rolling restart, and after rebooting a single node everything seemed fine (read latency dropped to appropriate values)
but today this happened again, now read latency maxed out in all 3 datacenters
how do i diagnose the problem? i see a bunch of metrics in monitoring stack, but i can’t make out, which are the cause and which are the consequence
also, i’ve noticed that upon rebooting compaction queue becomes really large (40+ compactions) – aren’t compactions supposed to run in the background?
@avi: You can look at the Advanced dashboard (in per-shard mode) to see whether CPU or I/O are the bottleneck, and for which scheduling group.
Compactions aren’t supposed to be affected by restarts.
@Igor_Q: > Compactions aren’t supposed to be affected by restarts.
you mean they shouldn’t spike when a node is restarted, right?
@avi: Right
Check if the compaction type is RESHAPE, we had bugs in this area where unnecessary reshapes were generated
btw 40 compactions across the cluster isn’t a lot
@Igor_Q: > Check if the compaction type is RESHAPE, we had bugs in this area where unnecessary reshapes were generated
is there a way to see this in nodetool? i don’t see the type in nodetool compactionhistory
also, is it possible that background compactions in my case are somehow broken? after i removed all load from the cluster and manually ran compaction on each node, the problem seems to be fixed
if so, then how can i monitor for such cases? check compaction history once in a while?
@avi: The logs show the compaction type.
Hard to say what to do, I don’t have a clear image of what’s going on.
@Igor_Q: i can provide the necessary info if you’re willing to look into it
@avi: Post snapshots of the Advanced dashboard in per-shard mode when the event happens
@Igor_Q: this proved to be a non-trivial task, but you should be able to see the snapshot here: https://grafs.sonc.top/dashboard/snapshot/c9S7NR1Alroyni1c0FAtpEcdved0Xx2E?orgId=0
this is the first occurrence: read latency started growing up to 5 seconds at 19:13 (correlates with C++ exceptions)
this is the second occurrence: https://grafs.sonc.top/dashboard/snapshot/P7yJOe24LjU8A6iSKM0nvLz5I1rid5Vm?orgId=1
probably worth mentioning that we retry queries at most 1 time – this is due to high retransmit timeouts in our network
but this never posed a problem until the incidents
@avi: Looks like you’re out of CPU on shard 48.
Probably have an imbalanced workload (use nodetool toppartitions
)
@Igor_Q: Thank you.
But how is it possible that the same shard across all nodes in two datacenters (10.200., 10.144.) is getting overloaded?
First of all, aren’t these all different shards? We have replication factor of 1, it seems pretty weird that the same shard 48 both on 10.144.65.37 and 10.144.65.14 is out of CPU. These are two different shards, they shouldn’t share any data, no? What am I missing here?
I also see the same picture on “Detailed” dashboard. Annotation on “Reads per Shard – Coordinator” states “Amount of requests served as the coordinator. Imbalances here represent dispersion at the connection level, not your data model”. Should this be understood as “the driver decided to flood shard 48 with requests on multiple nodes due to its internal logic”?
@avi Would you mind taking a look here, please?
@avi: Reactor load per shard doesn’t give enough information, use the CPU panels in the advanced dashboard to see which scheduling group uses the CPU
@Igor_Q: Didn’t you already link the corresponding panel on the Advanced tab here? I see no other significant load on these shards in this time interval.
Does “statement” scheduling group include coordination?
[April 9th, 2025 4:30 AM] avi: Looks like you’re out of CPU on shard 48.
Probably have an imbalanced workload (use nodetool toppartitions
)
@avi: Statement includes coordination.
With the shard-aware driver, the driver sends requests directly to the shards that own the data. So make sure you’re using a shard-aware driver. These problems are especially noticeable with large shard counts