Need help in explaination on rlatencyp95 metrics exposed by scylla

We are facing latency issues and while debugging through grafana rlatencyp95, it is exposing on instance,shard level and cluster and DC level.
When the issue occurs, rlatencyp95 for instance level still seems in microseconds but on cluster level it seems spiking in seconds.
Whats the diff in metrics on instance and cluster level ?Below is the query which i used to debug the latency issue. I wanted a breakdown at instance level to identify if there is any particular instance causing this latency issue.

avg(rlatencyp99{by=“cluster”, instance, cluster=~“scyl-test”, scheduling_group_name!=“streaming”} > 0) by (cluster, instance)

But after applying this query, I observed that the latency shown at instance level is still in microseconds while the cluster level metrics still shows 30 sec as latency.
Can we get any official documentation for understanding the scope. ?

above snapshot graph has 2 scylla nodes

Hi! Hot partition or other imbalance is the first thing to check, please experiment with:

and in general better to debug using scylla monitoring because we have a lot of specialized graphs for such things (e.g. it’s better to view the whole cluster on a per shard (vcpu) level).

1 Like

Marcin, the problem over here is that this is happening to all nodes in same project (GCP ) irrespective of cluster.
So we feel its not workload issue. So any network related issue can cause this ?