ScyllaDB node spikes at 00h00 UTC

Installation details
#ScyllaDB version: 5.0.5-0.20221009
#Cluster size: 6 nodes (3 - us-east-1; 3 - us-east-2) Replication factor 3
os (debian10-base-amd64-202408061511):

Hello!

We randomly have ScyllaDB load/latency spikes in some nodes every day shortly after 00h00 UTC. This lasts for about 30 seconds, in each node, and during that period, all the queries to node time out, it looks like that node is not available.

In the metrics of the EC2 instance, we see this load spike:

The compactions are running during the day and nothing out of normal is running at this time. We also don’t have a spike in throughput, in fact is decreasing at this time of the day. We don’t have any scheduled jobs, backup, repair, etc, scheduled for this time frame.

Do you have any idea why this might be occurring? We are running out of ideas…

Thank you!

5.0.5 has reached end-of-life ages ago.

The problem you’re describing is likely due to fstrim running to discard unused disk space. It has been replaced (in 5.1 timeframe) by online discard (see dist: scylla_raid_setup: mount XFS with online discard · scylladb/scylladb@a19d00e · GitHub). Note that upgrading won’t transition the nodes to online discard, you either have to apply the changes manually, or bootstrap new nodes and decommission the old ones.

2 Likes

Thank you for your reply.
We’ve checked and fstrim is disabled.
Nevertheless, upgrading the ScyllaDB version seems like a good idea.