Different cpu consumption by scylla threads with different linux kernels after the "nodetool drain" command

Hi All,

The question is about different cpu consumption by scylla threads with different linux kernels after the nodetool drain command.
All the results are from a single node system, but behavior of multi-node systems is nearly the same.
scylladb 5.1.5 Open Source.
We suspect, that different kernel functions are called depending on an OS kernel version at least, and this explains different behavior.
The details are below.

Questions:
Is this known behavior?
Does it work as designed?

top [-1] -H -n1 -b -p $(pidof scylla)

Linux kernels 3.x / 4.x
Ubuntu 18.04, Centos 7/8, RHEL 8.1

Threads:  12 total,   1 running,  11 sleeping,   0 stopped,   0 zombie
%Cpu(s): 14.8 us,  9.8 sy,  0.0 ni, 73.8 id,  0.0 wa,  0.0 hi,  1.6 si,  0.0 st
KiB Mem :  3861256 total,  3041724 free,   455896 used,   363636 buff/cache
KiB Swap:  4063228 total,  4063228 free,        0 used.  3171324 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
  8993 scylla    20   0   16.0t 201992  32912 R 93.3  5.2   1:46.35 scylla    <--
  8994 scylla    20   0   16.0t 201992  32912 S  0.0  5.2   0:01.36 reactor-1
  ...

Linux kernels 5.x
Ubuntu 20.04, Centos 7 (5.x kernel is installed manually)

Threads:   8 total,   0 running,   8 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  1.7 sy,  0.0 ni, 98.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  3990136 total,  3286472 free,   406180 used,   297484 buff/cache
KiB Swap:  4063228 total,  4063228 free,        0 used.  3354412 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
  1263 scylla    20   0   16.0t 239768  65260 S  0.0  6.0   0:02.38 scylla
  1265 scylla    20   0   16.0t 239768  65260 S  0.0  6.0   0:01.70 reactor-1
  ...

On distros with the top -1 option available we see that first 1 or 2 threads are 100% busy.
The situation is slightly different in a multi-node environment:
On the drained node the reactor-1 thread consumes 100% cpu as well (2 theads are 100% busy in this case).
But not on other nodes, where the main scylla process consumes 100 of cpu only.

strace -p $(pidof scylla) -c

Linux kernels 3.x / 4.x
Ubuntu 18.04, Centos 7/8, RHEL 8.1

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.58    4.569387           3   1359826           epoll_pwait
  0.22    0.010306          20       510           write
  0.16    0.007365           6      1115           timerfd_settime
  0.02    0.000973          10        97           timer_settime
  0.01    0.000638           6        95           rt_sigreturn
  0.00    0.000046           5         8           rt_sigprocmask
------ ----------- ----------- --------- --------- ----------------
100.00    4.588715               1361651           total

Linux kernels 5.x
Ubuntu 20.04, Centos 7 (5.x kernel is installed manually)

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 71.14    0.692684         507      1365           io_pgetevents
  6.17    0.060087          12      4635           timerfd_settime
  5.31    0.051699          27      1859           io_submit
  4.71    0.045870          57       794           write
  3.95    0.038451          20      1901       182 read
  3.46    0.033645          22      1510           membarrier
  2.78    0.027074           9      2886           rt_sigprocmask
  2.48    0.024196          14      1702           timer_settime
------ ----------- ----------- --------- --------- ----------------
100.00    0.973706                 16652       182 total

The answer is in the strace output. On the older kernels ScyllaDB falls back to use epoll for polling the kernel for I/O. This is not as efficient and involves the DB busy-polling the kernel, hence the 100% CPU usage, even when idle.
On newer kernels, where it is available, ScyllaDB will use the AIO kernel interface to poll for I/O completion, which is much more efficient.

Botond, thanks for your answer.

So:

  • there is a kernel check in the ScyllaDB engine, and it uses conditional kernel function calls depending on the kernel version
  • the observed behavior on the older kernels is not a bug in the ScyllaDB engine, there is no other way to implement the required functionality on these kernels more efficiently as it has been done on the newer kernels

Are these statements correct?

Yes, your assessment is correct.

To provide some more detail on the kernel check: Seastar, the application framework on top of which ScyllaDB is implemented, has a various reactor backend implementations. It chooses the best one, based on what the kernel it is running on supports. Currently, the following backends exist (in order of preference):

  • io-uring (guarded by compile-time flag)
  • AIO
  • epoll

Seastar will try to create the best one it can, finally falling back to epoll which should be supported on any kernel currently still supported on any distro.
Note that there is also a command-line flag (--reactor-backend) which allows you to select the desired backend (out of thoose supported on your kernel). You can look at the available options (scylla --help-seastar) to see what is available.

2 Likes

Botond, thanks a lot for your detailed explanation!

Hello, we have found that reactor_backend_epoll::kernel_submit_work() calls epoll_wait with zero timeout and it leads the problem with 100% cpu consumption. Could you please explain us the idea of using epoll_wait with zero timeout here? Is it possible to release cpu in some cases?

The event loop of seastar (the reactor) must not be blocked, it is always looking for work (polling). Events that unblock a currently blocked fiber can come from multiple sources. Therefore, the event loop must not block waiting for any individual source.
With the AIO backend, the seastar reactor has a sleep mode, which it can use when it has nothing to do (waiting for events). I do not know why this is not available with the epoll backend and why it has to resort to busy polling.

Thank you for the quick answer.