Replacing a dead (DN) node with the same IP, syncing data

Guy · August 22, 2024, 8:55am

Originally from the User Slack

@Dominik_Mankowski: Is it normal, that during replace operation (https://opensource.docs.scylladb.com/branch-5.4/operating-scylla/procedures/cluster-management/replace-dead-node.html), a node that is replacing the dead node (the new node has the same IP as the old one) has status (nodetool status) UN, even though it hasn’t synced all the data? i.e. nodetool status on the new node looks like this (all the other nodes in the cluster also report this node as UN, while I had expected UJ status):
root@scylladb-drp-test-p-0:~# nodetool status
Datacenter: az_we_dc1
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens       Owns    Host ID                               Rack
UN  172.26.108.8  ?          256          ?       fafffa99-40ab-49f4-9c3e-864cb4fd1ad3  rack1
UN  172.26.108.6  ?          256          ?       7129e789-afc4-4211-ba32-57cf776e4550  rack1
UN  172.26.108.7  ?          256          ?       39d14830-9a6c-42c1-917b-97abbd6edc0a  rack1
UN  172.26.108.4  ?          256          ?       a2018d48-80e7-4b3b-b65e-f16bedb648f2  rack1
UN  172.26.108.5  0 bytes    256          ?       null                                  rack1
Scylla 5.4.9

@avi: @Kamil_Braun do you know?

@Felipe_Cardeneti_Mendes: It is normal because the node isn’t joining. It was dead (DN), and now you are replacing it with another. So think about it this way: The node has already joined, you are bringing it back up

@Dominik_Mankowski: > It is normal because the node isn’t joining.
This is in contradiction to what the metrics/dashboard did show (that node was reported as Joining)

@Felipe_Cardeneti_Mendes: Well — then that’s a monitoring issue — probably replacing would be more accurate

@Kamil_Braun: in short: don’t use replace-with-same-IP, because it’s dangerous and has a bunch of these stupid quirks
one recent issue we found with replace-with-same-IP: https://github.com/scylladb/scylladb/issues/19975

GitHub: Failure during replace-with-same-IP leaves the node without STATUS application_state (permanently), and token_metadata inconsistent (until restart) (applies to gossiper / “node-ops” based topology changes) · Issue #19975 · scylladb/scylladb

well, if it completes, then it will be fine
but generally, try avoiding it, use replace-with-different-IP instead

@Dominik_Mankowski: @Kamil_Braun thanks for the hint. Would it be ok if we first removed a dead node (nodetool removenode) from the cluster (scale in), and then just simply add it to the cluster (scale out), with the same IP?

@Kamil_Braun: yes, but it will take 2x much time as replace, data streaming phase will have to be done twice
even more since at the end you should run cleanup (and IIRC cleanup is not really necessary if you use replace. But it is if you use remove + add)

@Dominik_Mankowski: got it, thanks

Topic		Replies	Views
Removing a dead node ScyllaDB open-source	2	611	August 18, 2023
Replacing disks in running nodes, cleanup and what is the correct process? ScyllaDB administration , nodetool	0	35	February 27, 2025
Replaced one dead node and new node came up same hostid in scylla 4.6 ScyllaDB troubleshooting , administration , cluster-management	2	44	January 5, 2025
Load imbalance after replacing a failed node ScyllaDB administration , repair	0	21	November 13, 2024
Error on removing dead nodes using nodetool removenode ScyllaDB troubleshooting , nodetool	0	125	April 9, 2024

Replacing a dead (DN) node with the same IP, syncing data

Related topics