Hi Team,
We are seeing lots of “_read_concurrency_sem: wait queue overload” exception in scylladb logs. Any specific reason for that. And how to resolve this.
Oct 9 10:35:45 Cass-1 scylla: message repeated 22 times: [ [shard 28:stat] storage_proxy - Exception when communicating with 10.0.1.4, to read from .
: _read_concurrency_sem: wait queue overload]
Oct 9 10:35:45 Cass-1 scylla: [shard 30:stat] storage_proxy - Exception when communicating with 10.0.1.4, to read from .
: _read_concurrency_sem: wait queue overload
Oct 9 10:35:45 Cass-1 scylla: message repeated 8 times: [ [shard 30:stat] storage_proxy - Exception when communicating with 10.0.1.4, to read from .
: _read_concurrency_sem: wait queue overload]
Oct 9 10:35:45 Cass-1 scylla: [shard 8:stat] storage_proxy - Exception when communicating with 10.0.1.4, to read from .
: _read_concurrency_sem: wait queue overload
Oct 9 10:35:45 Cass-1 scylla: message repeated 24 times: [ [shard 8:stat] storage_proxy - Exception when communicating with 10.0.1.4, to read from .
: _read_concurrency_sem: wait queue overload]
Oct 9 10:35:45 Cass-1 scylla: [shard 6:stat] storage_proxy - Exception when communicating with 10.0.1.4, to read from .
: _read_concurrency_sem: wait queue overload
Thanks
In general @denesb keeps a very good detailed doc on the reader concurrency semaphore internals at scylladb/docs/dev/reader-concurrency-semaphore.md at master · scylladb/scylladb · GitHub
You should have received a diagnostics dump with detailed information on what happened the moment the overload manifested.
Most of the time (though not always) this happens due to clients overloading the database, so Monitoring should also be your ally there. Look at the by-shard data on foreground and background reads, latencies, the advanced dashboard and take it from there.
I can not find coredump in /var/lib/scylla/coredump/
Any idea how to setup coredump for scylladb. I tried with scylla_coredump_setup and following output.
root@(Cass-6)[/var/log/scylla]->> scylla_coredump_setup
kernel.core_pattern = |/lib/systemd/systemd-coredump %P %u %g %s %t 9223372036854775808 %h
kernel.core_pipe_limit = 16
fs.suid_dumpable = 2
Generating coredump to test systemd-coredump…
PID: 1041734 (bash)
UID: 0 (root)
GID: 0 (root)
Signal: 11 (SEGV)
Timestamp: Thu 2024-10-10 06:31:38 UTC (3s ago)
Command Line: /bin/bash /tmp/tmps43gavyw
Executable: /usr/bin/bash
Control Group: /user.slice/user-0.slice/session-34040.scope
Unit: session-34040.scope
Slice: user-0.slice
Session: 34040
Owner UID: 0 (root)
Boot ID: 026dc33a62f84052af04918dffe1170d
Machine ID: 299f966af56e46e5b31a8e89157e832e
Hostname: Cas05
Storage: /var/lib/systemd/coredump/core.bash.0.026dc33a62f84052af04918dffe1170d.1041734.1728541898000000 (present)
Disk Size: 544.0K
Message: Process 1041734 (bash) of user 0 dumped core.
Found module linux-vdso.so.1 with build-id: 14bef519b0ba51c4f5319b4e6db5b6f6db2ad708
Found module ld-linux-x86-64.so.2 with build-id: 4186944c50f8a32b47d74931e3f512b811813b64
Found module libc.so.6 with build-id: 490fef8403240c91833978d494d39e537409b92e
Found module libtinfo.so.6 with build-id: e22ba7829a55a0dec2201a0b6dac7ba236118561
Found module bash with build-id: 7a6408ba82a2d86dd98f1f75ac8edcb695f6fd60
Stack trace of thread 1041734:
#0 0x00007fead490575b kill (libc.so.6 + 0x4275b)
#1 0x000055a6d7165539 kill_builtin (bash + 0xb5539)
#2 0x000055a6d710398c n/a (bash + 0x5398c)
#3 0x000055a6d70fc6b4 n/a (bash + 0x4c6b4)
#4 0x000055a6d70fdb5d execute_command_internal (bash + 0x4db5d)
#5 0x000055a6d71001b8 execute_command (bash + 0x501b8)
#6 0x000055a6d70f13cb reader_loop (bash + 0x413cb)
#7 0x000055a6d70e2c46 main (bash + 0x32c46)
#8 0x00007fead48ecd90 n/a (libc.so.6 + 0x29d90)
#9 0x00007fead48ece40 __libc_start_main (libc.so.6 + 0x29e40)
#10 0x000055a6d70e2f15 _start (bash + 0x32f15)
systemd-coredump is working finely.
root@(Cass-6)[/var/log/scylla]->> ll /var/lib/scylla/coredump/
total 8
drwxr-xr-x 2 scylla scylla 4096 Aug 24 2015 ./
drwxr-xr-x 7 scylla scylla 4096 Sep 1 02:10 …/
root@(Cass-6)[/var/log/scylla]->>
Thanks
This looks fine. We delete the generated core to save space. You may trigger a core yourself and check for its presence.