High read latency

kobejn · August 27, 2024, 11:53am

Hi,

I’ve scylla cluster deployed on self hosted kubernetes cluster with k8s operator.
I have 3 nodes of 16 CPUs and 32GB of mem. Each node is using node local storage with local-path-provisioneron ssd.

When client service is started read latency reaches 4s at peak and hangs at 1-2s on average. That is with reads at 6kosp/s and that does not seem like a lot. I’ve found this in the logs:

INFO 2024-08-26 14:10:20,132 [shard 8:stat] reader_concurrency_semaphore - (rate limiting dropped 1 similar messages) Semaphore _read_concurrency_sem with 100/100 count and 2280960/38755368 memory resources: timed out, dumping permit diagnostics:
permits count memory table/operation/state
61 61 1299K user_activity.user_internal_ids/data-query/active/await
39 39 928K user_activity.device_internal_ids/data-query/active/await
1 0 0B user_activity.device_internal_ids/mutation-query/waiting_for_admission
132 0 0B user_activity.device_internal_ids/data-query/waiting_for_admission
151 0 0B user_activity.user_internal_ids/data-query/waiting_for_admission

384 100 2228K total

Stats:
permit_based_evictions: 15
time_based_evictions: 0
inactive_reads: 0
total_successful_reads: 298208
total_failed_reads: 2729
total_reads_shed_due_to_overload: 0
total_reads_killed_due_to_kill_limit: 0
reads_admitted: 299593
reads_enqueued_for_admission: 67237
reads_enqueued_for_memory: 0
reads_admitted_immediately: 234191
reads_queued_because_ready_list: 29875
reads_queued_because_need_cpu_permits: 7514
reads_queued_because_memory_resources: 29848
reads_queued_because_count_resources: 0
reads_queued_with_eviction: 4
total_permits: 301438
current_permits: 384
need_cpu_permits: 100
awaits_permits: 100
disk_reads: 100
sstables_read: 115

I’m not sure how to interpret that, especially “100/100 count”, cpu usage is low, memory is not saturated, I want to blame IO but can’t find any metrics to support it.

Thank you

Botond_Denes · August 29, 2024, 4:28am

32GB of RAM for 16 CPUs is very little. ScyllaDB will use most of that, leaving very little for cache, which will make most of your reads go to disk. ScyllaDB has a hard-limit on the maximum amount of concurrent disk reads: 100. Once you have this many disk reads, new read request will be queued. This is what we see above. Some read requests sit in this queue for so long, waiting for their turn, that they time out while waiting.

We usually provide around 8GB of RAM per CPU. That is a much more comfortable amount, which leaves room for a healthy amount of cache, which improves latencies a lot.

kobejn · August 29, 2024, 11:32am

Thank you, I’m starting to understand some aspects better like the reads limit and queue.

That was my first thought that there is not enough resources for scylla to run stable, but when I check metrics in grafana scylla uses only 8 CPUs and 25 GB of memory. As a new scylla user it’s hard for me to know what performance should I be expecting from scylla in terms of reads/writes per second depending on what resources I assign to scylla pods.

I know it depends on the actual queries but is there like a rule of thumb like each 2 CPUs and 16 GB or ram should result in at least 1000 reads per second so I know at least it I’m getting performance in order of values I should be.

Topic		Replies	Views
Read errors in a cluster, IO setup failure, disk performance and optimizations ScyllaDB error-message , administration , configuration , io , disk	0	8	August 4, 2025
P99 and p95 spikes, hot partitions, performance and data modeling ScyllaDB data-model , performance , kubernetes , hot-partition	0	37	June 3, 2025
Read_concurrency_semaphore & p99 read latency ScyllaDB performance , troubleshooting	13	981	October 26, 2023
Cluster overloaded? Getting error message: reader_concurrency_semaphore - (rate limiting dropped 49 similar messages) Semaphore _read_concurrency_sem with 55/100 count and 13255581/147597557 memory resources: timed out, dumping permit diagnostics: ScyllaDB error-message , troubleshooting , nodetool , hot-partition	0	67	February 13, 2025
Reader_concurrency_semaphore Database Community troubleshooting	1	207	September 4, 2024

High read latency

Related topics