Node stuck in "none" state after topology changes, raft recovery mode

Guy · December 8, 2024, 5:28am

Originally from the User Slack

@Thomas_Foubert: Hi, I just encountered a bootstrap failure and now a node is stuck in none state : can’t be replaced at boot time :
init - Startup failed: std::runtime_error (the topology coordinator rejected request to join the cluster: Cannot replace node 729e606c-f8c6-47f9-b460-3b60bccd0fd5 because it is not in the 'normal' state)
and cant be nodetool remove :
error executing POST request to <http://localhost:10000/storage_service/remove_node> with parameters {"host_id": "729e606c-f8c6-47f9-b460-3b60bccd0fd5"}: remote replied with status code 500 Internal Server Error:
std::runtime_error (removenode: node 729e606c-f8c6-47f9-b460-3b60bccd0fd5 is in 'none' state. Wait for it to be in 'normal' state)
Here’s a small dump of the raft tables :
cqlsh> select server_id from system.raft_state where group_id = 1d5d7bb0-9cee-11ef-951f-55f4ab3ca193;

 server_id
--------------------------------------
 1555dc83-08e8-4498-8d9a-dc415a1a1c3a
 24ac70d2-70ef-4fc9-9c63-7d3ee5374a92
 6861e576-a526-4205-867e-6c207f1a5fea
 94ea4f93-9345-4a06-9ef3-7f14f458434d

(4 rows)
cqlsh> select * from system.cluster_status;

 peer            | dc          | host_id                              | load         | owns     | status        | tokens | up
-----------------+-------------+--------------------------------------+--------------+----------+---------------+--------+-------
          ***.85 | datacenter1 | 1555dc83-08e8-4498-8d9a-dc415a1a1c3a | 271557489304 | 0.313048 |        NORMAL |    256 |  True
         ***.166 | datacenter1 | 24ac70d2-70ef-4fc9-9c63-7d3ee5374a92 |              |     null | BOOTSTRAPPING |      0 |  True
         ***.163 |        null |                                 null |              |     null |          null |      0 | False
          ***.83 | datacenter1 | 94ea4f93-9345-4a06-9ef3-7f14f458434d | 254815552550 | 0.336257 |        NORMAL |    256 |  True
          ***.84 | datacenter1 | 6861e576-a526-4205-867e-6c207f1a5fea | 262956317438 | 0.350695 |        NORMAL |    256 |  True

(5 rows)
cluster is 6.2.1

This failure happened while I was adding 2 new nodes to an existing cluster after a cluster upgrade from 6.1 to 6.2
The other node is still joining and seems to be doing it well

EDIT: I remember manipulating raft and system tables to forcefully remove a node when I was first experimenting with Scylla, but I can’t find the doc anymore

@Kamil_Braun: please open an issue and post logs from the topology coordinator node and the node that failed to bootstrap, and ping me on github (@kbr-scylla)
> The other node is still joining and seems to be doing it well
I didn’t notice this sentence before. Given that, I suspect that the none node will most likely be removed automatically once the other joining node finishes. If that doesn’t happen, then you should open the issue.

@Thomas_Foubert: It didn’t happen so I put the cluster into raft recovery mode, wiped the topology tables and performed a rolling restart as per : https://opensource.docs.scylladb.com/stable/troubleshooting/handling-node-failures.html#manual-recovery-procedure

And it worked

Handling Node Failures | ScyllaDB Docs

Topic		Replies	Views
Rebuilt Node Missing from Raft State, Not Starting ScyllaDB troubleshooting , administration , nodetool , raft	9	260	April 24, 2025
Ghost node in none state ScyllaDB administration , unanswered , nodetool , raft , topology	0	42	June 4, 2025
Scylla db boostrap issue ScyllaDB troubleshooting , administration , nodetool , raft	1	99	April 20, 2025
How do I solve node stuck in DJ state (it was destroyed during join)? ScyllaDB troubleshooting , administration , raft , gossip	0	19	March 3, 2025
Scylla cluster membership issue after failed change ScyllaDB open-source , troubleshooting , multi-dc	18	1487	March 29, 2024

Node stuck in "none" state after topology changes, raft recovery mode

Related topics