Abnormal cluster's node(s) behaviour. High CPU usage on 4/5 nodes

Podhornyi · June 6, 2025, 1:55pm

Installation details

ScyllaDB version: 6.2.2

Cluster size: 5 x 8vCPU/32GB RAM

OS (RHEL/CentOS/Ubuntu/AWS AMI): ubuntu-jammy-22.04-amd64-server-20250327

Hi, we running ScyllaDB in k8s on dedicated EC2 instances with 5 nodes. On each node exists only scyllaDB-related pod and k8s service pods.

After one month of usage increased CPU usage for only 4 nodes of 5 were detected, when 5th node use CPU for ~20-40%; while all others are in ~70%

Details:

for this period workload not changed
all nodes receiving read/write requests with almost identical RPS
compaction are enabled on all nodes and based on history running in background as usual on ALL nodes
we do not have scylla manager
we do no do repair (in our case it is not needed at all)
we do not setup backup/restore
based on system.clients table all nodes has the same number of clients connected: ~70
thombstone setup to cleanup on compaction (not on repair as its done by default)

Is there a way to see what tasks ScyllaDB perform in the background? - I investigated via ‘nodetool tasks’ but did not found anything suspicious

Datacenter: eu-central-1
========================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
– Address Load Tokens Owns Host ID Rack
UN 100.65.105.82 500.14 GB 256 ? 99c64a19-f0eb-46c3-80c4-e290f7a5fd3e eu-central-1
UN 100.65.61.129 544.89 GB 256 ? 51af2d90-587c-4951-819d-309c4ed1268e eu-central-1
UN 100.66.85.110 531.18 GB 256 ? efe0d085-9a42-4fd0-abda-fa7bc3532752 eu-central-1
UN 100.70.9.99 558.64 GB 256 ? 9cc55415-d641-450d-8b3e-d1c59c9eb047 eu-central-1
UN 100.71.73.244 544.99 GB 256 ? c6a865b9-7030-4012-8c30-fb0addb22a0e eu-central-1

node 100.115.164.196 - is a node with small CPU usage

ScyllaDB monitoring stack dashboards

Overview:

Podhornyi · June 6, 2025, 1:56pm

Load - comparation node where CPU ~20%(100.) vs all others. On last image graph appear not for all period because we did restart the node:

Podhornyi · June 6, 2025, 1:57pm

Podhornyi · June 6, 2025, 1:57pm

Podhornyi · June 6, 2025, 1:59pm

Podhornyi · June 6, 2025, 1:59pm

Detailed read:

Podhornyi · June 6, 2025, 2:00pm

Detailed write:

Podhornyi · June 6, 2025, 2:01pm

k8s Node exporter metrics (random node vs node with small CPU usage):

I want to believe that one node with small CPU consumption is NORMAL one and all others are stuck somewhere in the high intensive CPU task.

Will provide any extra details if needed.

Podhornyi · June 6, 2025, 2:04pm

k8s Node exporter metrics - ScyllaDB node with small CPU usage:

Podhornyi · June 6, 2025, 8:25pm

~20%(100.) = ~20%(100.115.164.196)
its a note to first screenshot on post

yashwant_chandrakar · June 13, 2025, 10:13am

We are facing similar sort of issue, can someone guide how to fix this

Gabriel · June 13, 2025, 1:53pm

@mflendrich take a look

Topic		Replies	Views
Scylla in EC2 EC2 load high ScyllaDB	6	308	August 1, 2023
Scylladb process is more than 100% utilized even it is ideal ScyllaDB troubleshooting	5	311	May 31, 2024
Missing nodes in the Monitoring dashboards, lack of resources? ScyllaDB performance , scylladb-monitoring	0	12	July 28, 2024
Testing ScyllaDB performance issue with Kubernetes clusters, using containers ScyllaDB performance , kubernetes , docker	0	51	January 20, 2025
P99 and p95 spikes, hot partitions, performance and data modeling ScyllaDB data-model , performance , kubernetes , hot-partition	0	21	June 3, 2025

Abnormal cluster's node(s) behaviour. High CPU usage on 4/5 nodes

ScyllaDB version: 6.2.2

Cluster size: 5 x 8vCPU/32GB RAM

OS (RHEL/CentOS/Ubuntu/AWS AMI): ubuntu-jammy-22.04-amd64-server-20250327

Related topics