Abnormal cluster's node(s) behaviour. High CPU usage on 4/5 nodes

Installation details

ScyllaDB version: 6.2.2

Cluster size: 5 x 8vCPU/32GB RAM

OS (RHEL/CentOS/Ubuntu/AWS AMI): ubuntu-jammy-22.04-amd64-server-20250327

Hi, we running ScyllaDB in k8s on dedicated EC2 instances with 5 nodes. On each node exists only scyllaDB-related pod and k8s service pods.

After one month of usage increased CPU usage for only 4 nodes of 5 were detected, when 5th node use CPU for ~20-40%; while all others are in ~70%

Details:

  • for this period workload not changed
  • all nodes receiving read/write requests with almost identical RPS
  • compaction are enabled on all nodes and based on history running in background as usual on ALL nodes
  • we do not have scylla manager
  • we do no do repair (in our case it is not needed at all)
  • we do not setup backup/restore
  • based on system.clients table all nodes has the same number of clients connected: ~70
  • thombstone setup to cleanup on compaction (not on repair as its done by default)

Is there a way to see what tasks ScyllaDB perform in the background? - I investigated via ‘nodetool tasks’ but did not found anything suspicious

Datacenter: eu-central-1
========================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
– Address Load Tokens Owns Host ID Rack
UN 100.65.105.82 500.14 GB 256 ? 99c64a19-f0eb-46c3-80c4-e290f7a5fd3e eu-central-1
UN 100.65.61.129 544.89 GB 256 ? 51af2d90-587c-4951-819d-309c4ed1268e eu-central-1
UN 100.66.85.110 531.18 GB 256 ? efe0d085-9a42-4fd0-abda-fa7bc3532752 eu-central-1
UN 100.70.9.99 558.64 GB 256 ? 9cc55415-d641-450d-8b3e-d1c59c9eb047 eu-central-1
UN 100.71.73.244 544.99 GB 256 ? c6a865b9-7030-4012-8c30-fb0addb22a0e eu-central-1

node 100.115.164.196 - is a node with small CPU usage

ScyllaDB monitoring stack dashboards

Overview:

Load - comparation node where CPU ~20%(100.) vs all others. On last image graph appear not for all period because we did restart the node:

Detailed read:

Detailed write:

k8s Node exporter metrics (random node vs node with small CPU usage):

I want to believe that one node with small CPU consumption is NORMAL one and all others are stuck somewhere in the high intensive CPU task.

Will provide any extra details if needed.

k8s Node exporter metrics - ScyllaDB node with small CPU usage:

~20%(100.) = ~20%(100.115.164.196)
its a note to first screenshot on post

We are facing similar sort of issue, can someone guide how to fix this

@mflendrich take a look