Installation details
#ScyllaDB version: 6.2
#Cluster size: 12 nodes x 3 dc
#OS: Ubuntu 24.10
Hi.
I had issues with couple of nodes which led to them being down, unable to be removed and replaced.
SELECT * FROM system.topology_requests WHERE done = False ALLOW FILTERING ;
id | done | end_time | error | initiating_host | request_type | start_time
--------------+-------+----------+-------+-----------------+--------------+-----------
cd9b84c0-... | False | null | null | 59853b48-... | replace | null
524e7f0c-... | False | null | null | a48691bf-... | remove | null
8a5185e6-... | False | null | null | a48691bf-... | remove | null
acc4f340-... | False | null | null | 094374c3-... | replace | null
I donât have exact steps that led to this, one of them ran out of disk space and Iâve tried to replace it with the other.
As I understand they are blocking other cluster ops, and removing records from topology_requests table will not help.
How can i remove/cancel stuck topology requests?
Also I have ks with tablets enabled, so no manual recovery.
It would help if you attached logs.
I have two guesses from your description:
- if you have more than one node which is lost, you have to call removenode/replace on all of them (in parallel, not waiting for completion). This is so that those nodes are marked as âpermanently deadâ and topology operations donât try to synchronize with them. If there is a regular node which is down, topology operations will be stuck trying to synchronize with them.
- if you lost majority of nodes, group0 quorum is lost, and topology operations will be stuck on commits to raft. In that case, you have to perform group0 recovery procedure first.
1 Like
Before logs, the contents of system.topology and system.cluster_status would also help.
Hi. Thank You for the reply.
two nodes are down, yes
I donât think removenode can take more than one host id
removenode was executed with --ignore-dead-nodes for both
subsequents attempts result in âConcurrent request for removal already in progressâ for both
no, quorum is not lost, i have 12 nodes per each of 3 dc
does restoring group0 implies âmanual recoveryâ?
because per documentation it is incompatible with tablets being enabled
I also have a ghost node, which might be the cause of this: Ghost node in none state
I have ungodly amount of logs.
What should I look for and on which node?
I have yet to find anything of practical value.
Maybe we donât need logs. First, letâs check the status of topology. On one of the nodes:
select * from system.cluster_status;
select * from system.topology;
removenode was executed with --ignore-dead-nodes for both
subsequents attempts result in âConcurrent request for removal already in progressâ for both
Thatâs fine, the nodes should still be marked as excluded (can be observed in system.topology in the âignore_nodesâ column).
no, quorum is not lost, i have 12 nodes per each of 3 dc
does restoring group0 implies âmanual recoveryâ?
because per documentation it is incompatible with tablets being enabled
Yes, but itâs not needed here.
If you have a ghost node, then indeed it could block progress. You can manually fix it like this:
- on every alive node, execute in cqlsh:
delete from system.topology where key = âtopologyâ and host_id = <hostid_of_ghost_node>;
delete from system.peers where peer = <ip_of_ghost_node>;
- then do a rolling restart
Observe if things start moving. If they are, the node on which topology coordinator runs should start logging progress. Leader runs on the node which was the last to log raft_group0 - gaining leadership
. On topology updates, so also when tablet migration is moving forward, it will logs âraft_topology - updating topology state:â
1 Like