Raft majority loss issue - no raft quorum after node failure

Guy · September 18, 2024, 4:28am

Originally from the User Slack

@saranya_R_B: Hello scylla Team,
Scylla_version: 6.0 keyspace enabled with tablets
We have configured two data centers with three nodes each, but one data center has completely failed and cannot be recovered. We’ve tried using removenode with the --ignore-dead-nodes option and the replace_node_at_first_boot procedure, but neither has been successful in removing or adding new nodes. Instead, we’re encountering the following error. Can anyone help with this?

"there is no raft quorum, total voters count 6, alive voters count 3, dead voters
f-390e10ddc662] raft operation [read_barrier] timed out, there is no raft quorum, total voters count 6, alive voters count 3, dead voters"

@Piotr_Smaroń: cc @Kamil_Braun

@Botond_Dénes: What do you mean exactly by “one data center has completely failed”? Did you loose all nodes in that DC?

@saranya_R_B: yes all nodes are down in one dc. we are running multi-dc architecture .
Datacenter: SSD_datacenter

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
– Address Load Tokens Owns Host ID Rack
UN 10.92.3.5 94.17 GB 256 ? a86af8b2-7b56-44a9-80b6-c3fb1852ad01 rack_ssd
UN 10.92.3.6 93.28 GB 256 ? b4362683-532d-4185-a5b9-e2b8acb0658e rack_ssd
UN 10.92.3.7 93.92 GB 256 ? e1284d67-ca46-4e4b-a396-75fdd8142e61 rack_ssd
Datacenter: local_ssd_dc

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
– Address Load Tokens Owns Host ID Rack
DN 10.92.3.8 81.08 GB 256 ? 7deb42d7-dd1d-4848-97d0-88462cf442ae local_ssd_rack
DN 10.92.3.9 82.17 GB 256 ? 26c1a8d3-e30f-496e-8509-d658ad758eda local_ssd_rack
DN 10.92.3.10 82.79 GB 256 ? fcde1ff7-05f4-44bd-b3d6-496f7a409ece local_ssd_rack

Using local ssd for reads and peristant disk for writes .ALll nodes in local_ssd_dc went down.
Removal of node
We’ve tried using removenode with the --ignore-dead-nodes option and the replace_node_at_first_boot procedure, but neither has been successful in removing or adding new nodes. Instead, we’re encountering error.

@Botond_Dénes: If you have two DCs, then you ran into the problem of raft majority loss. This is why it is advisable to always have odd number DCs, so you always have a majority left.

@avi: You need to run the raft recovery procedure to reassemble the cluster

@Kamil_Braun: https://opensource.docs.scylladb.com/branch-6.0/troubleshooting/handling-node-failures.html#manual-recovery-procedure

Handling Node Failures | ScyllaDB Docs

ah actually it won’t work with tablets
> The manual recovery procedure is not supported if tablets are enabled on any of your keyspaces. In such a case, you need to restore from backup.

Data Distribution with Tablets | ScyllaDB Docs

Restore from a Backup and Incremental Backup | ScyllaDB Docs

in this case the only way is to setup a new cluster and move the data over, unfortunately

Topic		Replies	Views
Scylla db boostrap issue ScyllaDB troubleshooting , administration , nodetool , raft	1	83	April 20, 2025
Scylla cluster membership issue after failed change ScyllaDB open-source , troubleshooting , multi-dc	18	1305	March 29, 2024
Rebuilt Node Missing from Raft State, Not Starting ScyllaDB troubleshooting , administration , nodetool , raft	9	185	April 24, 2025
Raft Upgrade stuck waiting for ghost nodes ScyllaDB administration , raft , upgrade	1	80	November 28, 2024
Node stuck in "none" state after topology changes, raft recovery mode ScyllaDB troubleshooting , administration , raft , bootstrap	0	24	December 8, 2024

Raft majority loss issue - no raft quorum after node failure

Related topics