Throughput and latency impact when doing a rolling restart in a 3 node cluster

Guy · June 25, 2024, 4:30am

Originally from the User Slack

@J: Hi

I have a 3 node cluster with and RF of 3 and network topology strategy. The clients have a CL of Local Quorum. When I did a rolling restart of the nodes 1 at a time I saw reads and writes dramatically drop off. My understanding was that if local quorum was in place with this that reads would still be maintained? So I’m not sure what I misunderstood or what setting I got wrong?

@Felipe_Cardeneti_Mendes: Well yes but you lose 1/3 of your capacity as you shutdown a replica. If the remaining capacity is unable to supply increased demand your latency increases.

Sometimes it may also happen that you could have restarted too fast and all nodes came up with cold caches, thus reads became more expensive as they needed to go to disk instead of fetching from the database cache.

These are some examples, you probably want to look into metrics and find out what really happened

@J: Ok thanks. I noticed it as soon as the first node came down. So I don’t think it’s that. In terms of capacity we were only doing 80krps across the cluster when it came down. These nodes are in K8s running on i4i.8xlarge each pod has 12 cpu and 32Gi RAM. Is that not enough capacity? Is there any other setting I should check?
I did during this move the CL to one and it continued fine during the rest of the rolling restarts. But I do see in the logs both our clients saying cannot achieve CL LOCAL_ONE requires 1, alive 0. Though at no point was more than one node down

@Felipe_Cardeneti_Mendes: From your description seems like enough capacity. I doubt CPU was a bottleneck, but check on the advanced dashboard if anything pops up wrt disks or CPU

@J: Ok thanks, let me look into that

Topic		Replies	Views
Running on multiple data centers in different locations, latency (and performance) impact and async replication ScyllaDB	0	165	June 3, 2024
Performance issue, throughput drop and latency increase ScyllaDB data-model , performance , troubleshooting , sizing	0	124	November 28, 2024
Multi data-center cluster setup, latencies and data replication ScyllaDB data-model , performance , multi-dc	0	28	January 26, 2025
Consistency LeveL (CL), rollback on failure, retries and repair ScyllaDB data-model , drivers , consistency , high-availability	0	40	August 28, 2024
Using Read and Write Consistency Level of Quorum - how many nodes can be down? ScyllaDB	1	185	March 14, 2023

Throughput and latency impact when doing a rolling restart in a 3 node cluster

Related topics