Originally from the User Slack
@Shantanu_Sachdev:
Feb 15 07:56:00 ip-172-31-129-69 scylla[1797805]: [shard 11] reader_concurrency_semaphore - (rate limiting dropped 11 similar messages) Semaphore _read_concurrency_sem with 6/100 count and 4074705/169324052 memory resources: timed out, dumping permit diagnostics:
permits count memory table/description/state
2 2 2307K traffickeyspace.alarm_data_createdate_asc_view/data-query/active/used
2 2 1635K traffickeyspace.alarm_data_createdate_asc_view/data-query/inactive
2 2 37K traffickeyspace.current_vehicle_data/data-query/inactive
1 0 0B system.local/shard-reader/waiting
10 0 0B traffickeyspace.current_vehicle_data/mutation-query/waiting
10 0 0B traffickeyspace.trips_by_vehicles/data-query/waiting
2 0 0B traffickeyspace.raw_device_data_2_2024/data-query/waiting
22 0 0B traffickeyspace.alarm_data_createdate_asc_view/data-query/waiting
5 0 0B traffickeyspace.vehicle_daily_data/data-query/waiting
35 0 0B traffickeyspace.current_vehicle_data/data-query/waiting
2 0 0B traffickeyspace.vehicle_hourly_data/data-query/waiting
1 0 0B system.peers/shard-reader/waiting
94 6 3979K total
Total: 94 permits with 6 count and 3979K memory resources
Hello All,
I am getting similar logs as above frequently in the scylla nodes on production. Can someone help me understand what exactly timed out as per above log. It seems that neither count nor memory threshold reached, still the query timed out. What are permits and how can I increase it so that time outs do not happen.
Thanks.
@Botond_Dénes: If neither count (I/O) or memory limits is saturated, it will be the third (implicit) one: CPU.
Look at your monitoring, you will likely see the shard reporting this hoverig at round 100% CPU usage.
@Shantanu_Sachdev: yes, a lot of shards Load frequently peaks to 100%. But the node’s overall CPU never goes above 50%. How can I optimally utilize it ?
@Botond_Dénes: You probably have partitions that are more popular then other partitions. The shards hosting these popular partiitons are maxxed out. A scale-out or scale-up can help, if it the problem is multiple reasonably hot partitions – the scale-{out,up} can move these to separate nodes/shards, helping spread the load. If there is just a handful of really hot partitions, I’m afraid you will have to look into changing your data model so that you don’t have such hot partitions.
@Shantanu_Sachdev: okay, scale out would become costly for us since we are using i4i.8xlarge nodes. Will inspect more on the hot partitions