Stuck topology requests

Andrey_Kojevnikov · May 26, 2025, 1:37pm

Installation details
#ScyllaDB version: 6.2
#Cluster size: 12 nodes x 3 dc
#OS: Ubuntu 24.10

Hi.

I had issues with couple of nodes which led to them being down, unable to be removed and replaced.

SELECT * FROM system.topology_requests WHERE done = False ALLOW FILTERING ;

 id           | done  | end_time | error | initiating_host | request_type | start_time
--------------+-------+----------+-------+-----------------+--------------+-----------
 cd9b84c0-... | False |     null |  null | 59853b48-...    |      replace |       null
 524e7f0c-... | False |     null |  null | a48691bf-...    |       remove |       null
 8a5185e6-... | False |     null |  null | a48691bf-...    |       remove |       null
 acc4f340-... | False |     null |  null | 094374c3-...    |      replace |       null

I don’t have exact steps that led to this, one of them ran out of disk space and I’ve tried to replace it with the other.

As I understand they are blocking other cluster ops, and removing records from topology_requests table will not help.

How can i remove/cancel stuck topology requests?

Andrey_Kojevnikov · May 26, 2025, 1:41pm

Also I have ks with tablets enabled, so no manual recovery.

tgrabiec · June 11, 2025, 6:25am

It would help if you attached logs.

I have two guesses from your description:

if you have more than one node which is lost, you have to call removenode/replace on all of them (in parallel, not waiting for completion). This is so that those nodes are marked as “permanently dead” and topology operations don’t try to synchronize with them. If there is a regular node which is down, topology operations will be stuck trying to synchronize with them.
if you lost majority of nodes, group0 quorum is lost, and topology operations will be stuck on commits to raft. In that case, you have to perform group0 recovery procedure first.

tgrabiec · June 11, 2025, 6:28am

Before logs, the contents of system.topology and system.cluster_status would also help.

Andrey_Kojevnikov · June 12, 2025, 9:14am

Hi. Thank You for the reply.

two nodes are down, yes
I don’t think removenode can take more than one host id
removenode was executed with --ignore-dead-nodes for both
subsequents attempts result in “Concurrent request for removal already in progress” for both

no, quorum is not lost, i have 12 nodes per each of 3 dc
does restoring group0 implies “manual recovery”?
because per documentation it is incompatible with tablets being enabled

I also have a ghost node, which might be the cause of this: Ghost node in none state

Andrey_Kojevnikov · June 12, 2025, 9:20am

I have ungodly amount of logs.
What should I look for and on which node?
I have yet to find anything of practical value.

tgrabiec · June 14, 2025, 2:13pm

Maybe we don’t need logs. First, let’s check the status of topology. On one of the nodes:

select * from system.cluster_status;
select * from system.topology;

tgrabiec · June 14, 2025, 2:29pm

removenode was executed with --ignore-dead-nodes for both
subsequents attempts result in “Concurrent request for removal already in progress” for both

That’s fine, the nodes should still be marked as excluded (can be observed in system.topology in the “ignore_nodes” column).

no, quorum is not lost, i have 12 nodes per each of 3 dc
does restoring group0 implies “manual recovery”?
because per documentation it is incompatible with tablets being enabled

Yes, but it’s not needed here.

If you have a ghost node, then indeed it could block progress. You can manually fix it like this:

on every alive node, execute in cqlsh:
delete from system.topology where key = ‘topology’ and host_id = <hostid_of_ghost_node>;
delete from system.peers where peer = <ip_of_ghost_node>;
then do a rolling restart

Observe if things start moving. If they are, the node on which topology coordinator runs should start logging progress. Leader runs on the node which was the last to log raft_group0 - gaining leadership. On topology updates, so also when tablet migration is moving forward, it will logs “raft_topology - updating topology state:”

Topic		Replies	Views
Last week in scylladb.git master (issue #213; 2024-01-15) ScyllaDB git-news	0	162	January 15, 2024
Cannot remove unreachable dead nodes from my cluster ScyllaDB open-source , troubleshooting	1	740	February 7, 2024
Can you change the topology with down nodes ScyllaDB administration , consistency , topology-change	1	31	September 29, 2024
Node stuck in "none" state after topology changes, raft recovery mode ScyllaDB troubleshooting , administration , raft , bootstrap	0	55	December 8, 2024
Node failed to join the cluster when I use the experimental feature "consistent-topology-changes" ScyllaDB experimental	2	439	January 19, 2024

Stuck topology requests

Related topics