Hi,
I’ve scylla cluster deployed on self hosted kubernetes cluster with k8s operator.
I have 3 nodes of 16 CPUs and 32GB of mem. Each node is using node local storage with local-path-provisioneron ssd.
When client service is started read latency reaches 4s at peak and hangs at 1-2s on average. That is with reads at 6kosp/s and that does not seem like a lot. I’ve found this in the logs:
INFO 2024-08-26 14:10:20,132 [shard 8:stat] reader_concurrency_semaphore - (rate limiting dropped 1 similar messages) Semaphore _read_concurrency_sem with 100/100 count and 2280960/38755368 memory resources: timed out, dumping permit diagnostics:
permits count memory table/operation/state
61 61 1299K user_activity.user_internal_ids/data-query/active/await
39 39 928K user_activity.device_internal_ids/data-query/active/await
1 0 0B user_activity.device_internal_ids/mutation-query/waiting_for_admission
132 0 0B user_activity.device_internal_ids/data-query/waiting_for_admission
151 0 0B user_activity.user_internal_ids/data-query/waiting_for_admission384 100 2228K total
Stats:
permit_based_evictions: 15
time_based_evictions: 0
inactive_reads: 0
total_successful_reads: 298208
total_failed_reads: 2729
total_reads_shed_due_to_overload: 0
total_reads_killed_due_to_kill_limit: 0
reads_admitted: 299593
reads_enqueued_for_admission: 67237
reads_enqueued_for_memory: 0
reads_admitted_immediately: 234191
reads_queued_because_ready_list: 29875
reads_queued_because_need_cpu_permits: 7514
reads_queued_because_memory_resources: 29848
reads_queued_because_count_resources: 0
reads_queued_with_eviction: 4
total_permits: 301438
current_permits: 384
need_cpu_permits: 100
awaits_permits: 100
disk_reads: 100
sstables_read: 115
I’m not sure how to interpret that, especially “100/100 count”, cpu usage is low, memory is not saturated, I want to blame IO but can’t find any metrics to support it.
Thank you