Originally from the User Slack
@J: Hi
I have a 3 node cluster with and RF of 3 and network topology strategy. The clients have a CL of Local Quorum. When I did a rolling restart of the nodes 1 at a time I saw reads and writes dramatically drop off. My understanding was that if local quorum was in place with this that reads would still be maintained? So I’m not sure what I misunderstood or what setting I got wrong?
@Felipe_Cardeneti_Mendes: Well yes but you lose 1/3 of your capacity as you shutdown a replica. If the remaining capacity is unable to supply increased demand your latency increases.
Sometimes it may also happen that you could have restarted too fast and all nodes came up with cold caches, thus reads became more expensive as they needed to go to disk instead of fetching from the database cache.
These are some examples, you probably want to look into metrics and find out what really happened
@J: Ok thanks. I noticed it as soon as the first node came down. So I don’t think it’s that. In terms of capacity we were only doing 80krps across the cluster when it came down. These nodes are in K8s running on i4i.8xlarge each pod has 12 cpu and 32Gi RAM. Is that not enough capacity? Is there any other setting I should check?
I did during this move the CL to one and it continued fine during the rest of the rolling restarts. But I do see in the logs both our clients saying cannot achieve CL LOCAL_ONE requires 1, alive 0. Though at no point was more than one node down
@Felipe_Cardeneti_Mendes: From your description seems like enough capacity. I doubt CPU was a bottleneck, but check on the advanced dashboard if anything pops up wrt disks or CPU
@J: Ok thanks, let me look into that