Topology coordinator appears idle after cleanup incident — new node joins placed but never processed (6.1.1)

lbowley · April 15, 2026, 11:28pm

Version: Scylla 6.1.1-0.20240814.8d90b817660a
Topology: 12 nodes, single DC (us-east), single rack, Ec2Snitch. Raft-based topology. Tablets not in use.
State: All 12 UN, single schema version, quiet (no compactions, no streams, no active repair, no tasks in any task_manager module).

What happened

Started at 10 nodes → added 2 (A and B) → ran fleet-wide nodetool cleanup.
Node A (one of the two newly-added nodes) had its disk fill during cleanup and crashed. Recovered by deleting commitlog and immediately stopping compactions on startup to free space.
Tried nodetool stop CLEANUP — compactions paused, but new cleanup work gradually crept back in and disk usage resumed climbing toward exhaustion. Eventually identified a parent cleanup task in task_manager and aborted it:
curl -s localhost/task_manager/list_module_tasks/compaction
curl -X POST localhost/task_manager/abort_task/<task_id>
That stopped it
After the cleanup incident, attempts to add a new node hung indefinitely: never reached UJ. Tried multiple times; each attempt left an additional stale entry behind.
Cleaned up ghost host_ids (side effect of the retries) on every alive node per the stuck-topology-requests thread:
DELETE FROM system.topology WHERE key = ‘topology’ AND host_id = ;
(system.peers was already clean for those IPs.)
Rolling restart of all 12 nodes.
After the rolling restart, a fresh cleanup task reappeared in task_manager on node A and had to be aborted again via the same endpoint. Cleanup work appears to persist across restarts.
Attempted to add a new node again. Same hang.

Current stuck state

Joining node log ends at:
raft_topology - join: request to join placed, waiting for the response from the topology coordinator

Coordinator log for this join — exactly two lines, nothing after:
raft_topology - received request to join from host_id: <new_node_hostid>
raft_topology - placed join request for <new_node_hostid>
No raft_topology - updating topology state: since the rolling restart — coordinator appears idle.

system.topology_requests: 3 rows, done=False, start_time=null, error=null, all initiating_host=leader. Two from earlier failed attempts, one current. (Table has 6 columns on this version — no request_type.)

system.topology: 12 normal, 1 none (current joiner), 36 left.

Ghost gossip: 2 IPs, RPC_READY:0, frozen heartbeats. Both unreachable by either path:

nodetool removenode <host_id> → host id … is not found in the cluster
POST /gossiper/assassinate/?unsafe=true → Unable to calculate tokens for (topology rows already deleted)

Question

Our goal is simply to get back to a state where we can add nodes. What’s the recommended path from here?

Two obstacles, both appearing to date from the original cleanup incident:

Topology coordinator is idle. Since the rolling restart it has not logged raft_topology - updating topology state:. New join requests are placed but never begin processing. The queued done=False rows in system.topology_requests (including the two stale ones from failed
retries) look suspicious — could those be blocking the queue, and if so, is there a supported way to drain them short of full Raft RECOVERY mode?
Cleanup work respawns on restart. Aborting the parent task in task_manager stops the current cleanup, but a fresh one reappears on node startup. Where is the pending-cleanup state persisted, and can it be drained?

Are these related, and is there a targeted recovery path that gets us back to accepting node joins?

Topic		Replies	Views
Stuck topology requests ScyllaDB troubleshooting , administration , topology	7	253	June 14, 2025
Node stuck in "none" state after topology changes, raft recovery mode ScyllaDB troubleshooting , administration , raft , bootstrap	0	123	December 8, 2024
Last week in scylladb.git master (issue #213; 2024-01-15) ScyllaDB git-news	0	199	January 15, 2024
Node failed to join the cluster when I use the experimental feature "consistent-topology-changes" ScyllaDB experimental	2	499	January 19, 2024
Stuck topology request after increasing replication factor ScyllaDB troubleshooting , nodetool , replication , topology-change	2	80	December 12, 2025

Topology coordinator appears idle after cleanup incident — new node joins placed but never processed (6.1.1)

Related topics