Version: Scylla 6.1.1-0.20240814.8d90b817660a
Topology: 12 nodes, single DC (us-east), single rack, Ec2Snitch. Raft-based topology. Tablets not in use.
State: All 12 UN, single schema version, quiet (no compactions, no streams, no active repair, no tasks in any task_manager module).
What happened
- Started at 10 nodes → added 2 (A and B) → ran fleet-wide nodetool cleanup.
- Node A (one of the two newly-added nodes) had its disk fill during cleanup and crashed. Recovered by deleting commitlog and immediately stopping compactions on startup to free space.
- Tried
nodetool stop CLEANUP— compactions paused, but new cleanup work gradually crept back in and disk usage resumed climbing toward exhaustion. Eventually identified a parent cleanup task in task_manager and aborted it:
curl -s localhost/task_manager/list_module_tasks/compaction
curl -X POST localhost/task_manager/abort_task/<task_id> - That stopped it
- After the cleanup incident, attempts to add a new node hung indefinitely: never reached UJ. Tried multiple times; each attempt left an additional stale entry behind.
- Cleaned up ghost host_ids (side effect of the retries) on every alive node per the stuck-topology-requests thread:
DELETE FROM system.topology WHERE key = ‘topology’ AND host_id = ; - (system.peers was already clean for those IPs.)
- Rolling restart of all 12 nodes.
- After the rolling restart, a fresh cleanup task reappeared in task_manager on node A and had to be aborted again via the same endpoint. Cleanup work appears to persist across restarts.
- Attempted to add a new node again. Same hang.
Current stuck state
Joining node log ends at:
raft_topology - join: request to join placed, waiting for the response from the topology coordinator
Coordinator log for this join — exactly two lines, nothing after:
raft_topology - received request to join from host_id: <new_node_hostid>
raft_topology - placed join request for <new_node_hostid>
No raft_topology - updating topology state: since the rolling restart — coordinator appears idle.
system.topology_requests: 3 rows, done=False, start_time=null, error=null, all initiating_host=leader. Two from earlier failed attempts, one current. (Table has 6 columns on this version — no request_type.)
system.topology: 12 normal, 1 none (current joiner), 36 left.
Ghost gossip: 2 IPs, RPC_READY:0, frozen heartbeats. Both unreachable by either path:
- nodetool removenode <host_id> → host id … is not found in the cluster
- POST /gossiper/assassinate/?unsafe=true → Unable to calculate tokens for (topology rows already deleted)
Question
Our goal is simply to get back to a state where we can add nodes. What’s the recommended path from here?
Two obstacles, both appearing to date from the original cleanup incident:
- Topology coordinator is idle. Since the rolling restart it has not logged raft_topology - updating topology state:. New join requests are placed but never begin processing. The queued done=False rows in system.topology_requests (including the two stale ones from failed
retries) look suspicious — could those be blocking the queue, and if so, is there a supported way to drain them short of full Raft RECOVERY mode? - Cleanup work respawns on restart. Aborting the parent task in task_manager stops the current cleanup, but a fresh one reappears on node startup. Where is the pending-cleanup state persisted, and can it be drained?
Are these related, and is there a targeted recovery path that gets us back to accepting node joins?