Originally from the User Slack
@Thomas_Foubert: Hi, I just encountered a bootstrap failure and now a node is stuck in none
state : can’t be replaced at boot time :
init - Startup failed: std::runtime_error (the topology coordinator rejected request to join the cluster: Cannot replace node 729e606c-f8c6-47f9-b460-3b60bccd0fd5 because it is not in the 'normal' state)
and cant be nodetool remove
:
error executing POST request to <http://localhost:10000/storage_service/remove_node> with parameters {"host_id": "729e606c-f8c6-47f9-b460-3b60bccd0fd5"}: remote replied with status code 500 Internal Server Error:
std::runtime_error (removenode: node 729e606c-f8c6-47f9-b460-3b60bccd0fd5 is in 'none' state. Wait for it to be in 'normal' state)
Here’s a small dump of the raft tables :
cqlsh> select server_id from system.raft_state where group_id = 1d5d7bb0-9cee-11ef-951f-55f4ab3ca193;
server_id
--------------------------------------
1555dc83-08e8-4498-8d9a-dc415a1a1c3a
24ac70d2-70ef-4fc9-9c63-7d3ee5374a92
6861e576-a526-4205-867e-6c207f1a5fea
94ea4f93-9345-4a06-9ef3-7f14f458434d
(4 rows)
cqlsh> select * from system.cluster_status;
peer | dc | host_id | load | owns | status | tokens | up
-----------------+-------------+--------------------------------------+--------------+----------+---------------+--------+-------
***.85 | datacenter1 | 1555dc83-08e8-4498-8d9a-dc415a1a1c3a | 271557489304 | 0.313048 | NORMAL | 256 | True
***.166 | datacenter1 | 24ac70d2-70ef-4fc9-9c63-7d3ee5374a92 | | null | BOOTSTRAPPING | 0 | True
***.163 | null | null | | null | null | 0 | False
***.83 | datacenter1 | 94ea4f93-9345-4a06-9ef3-7f14f458434d | 254815552550 | 0.336257 | NORMAL | 256 | True
***.84 | datacenter1 | 6861e576-a526-4205-867e-6c207f1a5fea | 262956317438 | 0.350695 | NORMAL | 256 | True
(5 rows)
cluster is 6.2.1
This failure happened while I was adding 2 new nodes to an existing cluster after a cluster upgrade from 6.1 to 6.2
The other node is still joining and seems to be doing it well
EDIT: I remember manipulating raft and system tables to forcefully remove a node when I was first experimenting with Scylla, but I can’t find the doc anymore
@Kamil_Braun: please open an issue and post logs from the topology coordinator node and the node that failed to bootstrap, and ping me on github (@kbr-scylla)
> The other node is still joining and seems to be doing it well
I didn’t notice this sentence before. Given that, I suspect that the none
node will most likely be removed automatically once the other joining node finishes. If that doesn’t happen, then you should open the issue.
@Thomas_Foubert: It didn’t happen so I put the cluster into raft recovery mode, wiped the topology tables and performed a rolling restart as per : https://opensource.docs.scylladb.com/stable/troubleshooting/handling-node-failures.html#manual-recovery-procedure
And it worked
Handling Node Failures | ScyllaDB Docs