Originally from the User Slack
@Roy: Hi, We see sometimes our Cluster has high latency (99% Read and Writes goes to 300+ ms) and it is gone after sometime. The amount of Reads and Writes from KS level generally DO NOT differs much during that time. So we are unable to clearly detect the culprit. We could also see many : reader_concurrency_semaphore in syslog. Beside many occurrence of prepared_cache_eviction in Scylla monitoring advisor. So to get a clear idea what is the culprit we are thinking of enabling more detail logging as per: https://opensource.docs.scylladb.com/stable/operating-scylla/nodetool-commands/setlogginglevel.html. Could someone suggest the logger we should enable ? Is it safe to enable debug level on all component in a High to Moderately loaded PROD cluster?
Nodetool setlogginglevel | ScyllaDB Docs
@Botond_Dénes: Blanked-enabling all loggers to debug is a really bad idea.
You should start debugging this in monitoring, look for events correlating with the elevated timeouts. Switch the dashboards to “by shard” view, so aggregation doesn’t hide outlier shards.
@Roy: Well at least can see spike on Compactions. But they are there all the time, when and when not the issue is there. We are using default compaction settings without any tuning in scylla.yaml. I will have a look at the Shard level in Monitoring.
@Botond_Dénes: Check the compaction scheduling group shares – do they climb to 1000?
@Roy: Hi, Compaction CPU Runtimes spikes to 150-220% Node level and 25-30% in Shard Level but the Compaction Shares stays close to 50
@Botond_Dénes: In that case, I don’t think compaction is to blame here. ScyllaDB has schedulers to isolate scheduling groups from each other, with 50 shares, compaction will not be able to impact the statement
scheduling group (which has 1000 shares).
@Roy: Also at node level “details” dashboard i can see high number of tombtone write and cell tombstone writes.
@avi: In Advanced view, look for scheduling groups that have non-zero Task Quota Violations
@Roy: could only see this for sl: default
@Botond_Dénes: A common source of task quota violations are stalls. Do you see any in the logs?
Stalls are known to cause high 99% latencies.
@Roy: Hi morning. Yes see Reactor stalls and read-concurrency_semaphores in syslog
@Botond_Dénes: There is a good chance the stalls are the direct cause of the elevated latencies.
Please open a github issue with the stalls (please decode them with http://backtrace.scylladb.com/index.html)
The Build ID can be obtained from the logs, it is printed right at the beginning of startup.