How to troubleshoot an issue with high latency happening every once in a while?

Guy · March 25, 2025, 7:08am

Originally from the User Slack

@Roy: Hi, We see sometimes our Cluster has high latency (99% Read and Writes goes to 300+ ms) and it is gone after sometime. The amount of Reads and Writes from KS level generally DO NOT differs much during that time. So we are unable to clearly detect the culprit. We could also see many : reader_concurrency_semaphore in syslog. Beside many occurrence of prepared_cache_eviction in Scylla monitoring advisor. So to get a clear idea what is the culprit we are thinking of enabling more detail logging as per: https://opensource.docs.scylladb.com/stable/operating-scylla/nodetool-commands/setlogginglevel.html. Could someone suggest the logger we should enable ? Is it safe to enable debug level on all component in a High to Moderately loaded PROD cluster?

Nodetool setlogginglevel | ScyllaDB Docs

@Botond_Dénes: Blanked-enabling all loggers to debug is a really bad idea.
You should start debugging this in monitoring, look for events correlating with the elevated timeouts. Switch the dashboards to “by shard” view, so aggregation doesn’t hide outlier shards.

@Roy: Well at least can see spike on Compactions. But they are there all the time, when and when not the issue is there. We are using default compaction settings without any tuning in scylla.yaml. I will have a look at the Shard level in Monitoring.

@Botond_Dénes: Check the compaction scheduling group shares – do they climb to 1000?

@Roy: Hi, Compaction CPU Runtimes spikes to 150-220% Node level and 25-30% in Shard Level but the Compaction Shares stays close to 50

@Botond_Dénes: In that case, I don’t think compaction is to blame here. ScyllaDB has schedulers to isolate scheduling groups from each other, with 50 shares, compaction will not be able to impact the statement scheduling group (which has 1000 shares).

@Roy: Also at node level “details” dashboard i can see high number of tombtone write and cell tombstone writes.

@avi: In Advanced view, look for scheduling groups that have non-zero Task Quota Violations

@Roy: could only see this for sl: default

image (1)1898×554 93.9 KB

@Botond_Dénes: A common source of task quota violations are stalls. Do you see any in the logs?
Stalls are known to cause high 99% latencies.

@Roy: Hi morning. Yes see Reactor stalls and read-concurrency_semaphores in syslog

@Botond_Dénes: There is a good chance the stalls are the direct cause of the elevated latencies.
Please open a github issue with the stalls (please decode them with http://backtrace.scylladb.com/index.html)
The Build ID can be obtained from the logs, it is printed right at the beginning of startup.

Botond_Denes · March 26, 2025, 11:51am

The Build ID can also be obtained with scylla --build-id.

Topic		Replies	Views
Mutliple Datacenter cluster, diagnosing high latency spike and performance issues ScyllaDB performance , drivers , compaction , scylladb-monitoring	0	21	April 20, 2025
Need help in explaination on rlatencyp95 metrics exposed by scylla ScyllaDB scylladb-monitoring	3	223	May 8, 2024
Read_concurrency_semaphore & p99 read latency ScyllaDB performance , troubleshooting	13	989	October 26, 2023
ScyllaDB timeout error comes while writing/reading from/to table ScyllaDB troubleshooting	2	447	January 21, 2024
Last week in scylladb.git master (issue #160; 2022-12-25) ScyllaDB git-news	0	307	December 25, 2022

How to troubleshoot an issue with high latency happening every once in a while?

Related topics