Originally from the User Slack
@saranya_R_B: Hello scylla Team,
Scylla_version: 6.0 keyspace enabled with tablets
We have configured two data centers with three nodes each, but one data center has completely failed and cannot be recovered. We’ve tried using removenode
with the --ignore-dead-nodes
option and the replace_node_at_first_boot
procedure, but neither has been successful in removing or adding new nodes. Instead, we’re encountering the following error. Can anyone help with this?
"there is no raft quorum, total voters count 6, alive voters count 3, dead voters
f-390e10ddc662] raft operation [read_barrier] timed out, there is no raft quorum, total voters count 6, alive voters count 3, dead voters"
@Piotr_Smaroń: cc @Kamil_Braun
@Botond_Dénes: What do you mean exactly by “one data center has completely failed”? Did you loose all nodes in that DC?
@saranya_R_B: yes all nodes are down in one dc. we are running multi-dc architecture .
Datacenter: SSD_datacenter
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
– Address Load Tokens Owns Host ID Rack
UN 10.92.3.5 94.17 GB 256 ? a86af8b2-7b56-44a9-80b6-c3fb1852ad01 rack_ssd
UN 10.92.3.6 93.28 GB 256 ? b4362683-532d-4185-a5b9-e2b8acb0658e rack_ssd
UN 10.92.3.7 93.92 GB 256 ? e1284d67-ca46-4e4b-a396-75fdd8142e61 rack_ssd
Datacenter: local_ssd_dc
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
– Address Load Tokens Owns Host ID Rack
DN 10.92.3.8 81.08 GB 256 ? 7deb42d7-dd1d-4848-97d0-88462cf442ae local_ssd_rack
DN 10.92.3.9 82.17 GB 256 ? 26c1a8d3-e30f-496e-8509-d658ad758eda local_ssd_rack
DN 10.92.3.10 82.79 GB 256 ? fcde1ff7-05f4-44bd-b3d6-496f7a409ece local_ssd_rack
Using local ssd for reads and peristant disk for writes .ALll nodes in local_ssd_dc went down.
Removal of node
We’ve tried using removenode
with the --ignore-dead-nodes
option and the replace_node_at_first_boot
procedure, but neither has been successful in removing or adding new nodes. Instead, we’re encountering error.
@Botond_Dénes: If you have two DCs, then you ran into the problem of raft majority loss. This is why it is advisable to always have odd number DCs, so you always have a majority left.
@avi: You need to run the raft recovery procedure to reassemble the cluster
@Kamil_Braun: https://opensource.docs.scylladb.com/branch-6.0/troubleshooting/handling-node-failures.html#manual-recovery-procedure
Handling Node Failures | ScyllaDB Docs
ah actually it won’t work with tablets
> The manual recovery procedure is not supported if tablets are enabled on any of your keyspaces. In such a case, you need to restore from backup.
Data Distribution with Tablets | ScyllaDB Docs
Restore from a Backup and Incremental Backup | ScyllaDB Docs
in this case the only way is to setup a new cluster and move the data over, unfortunately