Raft Upgrade stuck waiting for ghost nodes

Womsel · November 27, 2024, 1:22pm

Installation details
#ScyllaDB version: 6.1.3
#Cluster size: 7 nodes (6 nodes + 1 nodes in different DCs)
os (RHEL/CentOS/Ubuntu/AWS AMI): Ubuntu 22

Today, I finally started enabling raft, by hitting the API-endpoint with curl

curl -X POST "http://curhost:10000/storage_service/raft_topology/upgrade"

Now, as nothing happened, I did a rolling restart, which in the end resulted in

Nov 27 10:32:24 o-2 scylla[632625]: [shard 0:strm] raft_topology - waiting for all nodes to finish upgrade to raft schema

The last node, however, wasn’t able to finish and instead is spitting out

Nov 27 13:18:25 o-backup scylla[3576536]:  [shard 0:strm] raft_group0_upgrade - future<> service::raft_group0::wait_for_all_nodes_to_finish_upgrade(abort_source &): failed to resolve IP addresses of some of the cluster members ([e16a9c96-d8a0-47fe-8044-37be077f45b9, 8a627941-2f40-47ad-8e5d-6f6e891ab85d, d728fc9d-81ca-4f34-ab5b-3b0858144c61])
Nov 27 13:18:25 o-backup scylla[3576536]:  [shard 0:strm] raft_group0_upgrade - future<> service::raft_group0::wait_for_all_nodes_to_finish_upgrade(abort_source &): sleeping for 16s seconds before retrying..

Now, to get rid off those unavailable, not replacable ghost nodes, I tried using

removenode --ignore-dead-nodes 8a627941-2f40-47ad-8e5d-6f6e891ab85d,d728fc9d-81ca-4f34-ab5b-3b0858144c61,e16a9c96-d8a0-47fe-8044-37be077f45b9 e16a9c96-d8a0-47fe-8044-37be077f45b9

but that only results in

error executing POST request to http://localhost:10000/storage_service/remove_node with parameters {"ignore_nodes": "8a627941-2f40-47ad-8e5d-6f6e891ab85d,d728fc9d-81ca-4f34-ab5b-3b0858144c61,e16a9c96-d8a0-47fe-8044-37be077f45b9", "host_id": "e16a9c96-d8a0-47fe-8044-37be077f45b9"}: remote replied with status code 500 Internal Server Error:
std::runtime_error (removenode is not allowed at this time - the node is still in the process of upgrading to raft topology)

So, the question is: How do I get rid off those ghost nodes?

Womsel · November 28, 2024, 2:06pm

Answering my own question here:

For some odd reason, the 2nd DC still had information about already removed nodes, making the Raft Upgrade impossible.

I had to follow the following documentation, in particular the section of “Manual Recovery Procedure” which allowed me to cancel the Raft Migration, and remove the ghost nodes, after which everything went the normal (and rather quick) way.

Topic		Replies	Views
Upgrading ScyllaDB version, downtime, and Raft configuration ScyllaDB administration , raft , upgrade	1	55	September 12, 2024
Ghost node in none state ScyllaDB administration , nodetool , raft , topology	1	74	August 10, 2025
Rebuilt Node Missing from Raft State, Not Starting ScyllaDB troubleshooting , administration , nodetool , raft	9	311	April 24, 2025
Upgrade process stuck on build_coordinator_state - replication settings for system tables issue ScyllaDB troubleshooting , raft , upgrade , replication	0	32	September 30, 2024
Scylla cluster membership issue after failed change ScyllaDB open-source , troubleshooting , multi-dc	18	1654	March 29, 2024

Raft Upgrade stuck waiting for ghost nodes

Related topics