Originally from the User Slack
@Shobhana: Hi,
I have a 3 node cluster on Ubuntu 24.04. All of them are running Scylla version 6.2.2-0.20241128.c6ef055e9c3b. Each of them have 600+ GB of data (as observed from the df -h command). We use a replication factor of 2 for all table spaces.
The cloud provider where this cluster is running mentioned that the disk on one of these nodes is about to get corrupted and has asked us to replace the disk. As of now, nodetool status shows all 3 are Up and Normal. So I want to add a new node to the cluster and decommission the old node. I have followed instructions mentioned in Replace a Running Node in a ScyllaDB Cluster . I have added the new node. But I am not sure when I can decommission the old node.
nodetool status shows all 4 are Up and Normal:
– Address Load Tokens Owns Host ID Rack
UN 10.7.76.12 549.13 GB 256 ? cd40950f-eb9b-41f1-b51e-e4423abe64a6 Rack1
UN 10.7.76.24 535.67 GB 256 ? baa1733e-0eec-49ae-9f59-750d71f7f1cd Rack1
UN 10.7.76.25 549.97 GB 256 ? fa90b7bd-36af-4723-975a-febc77a66fd8 Rack1
UN 10.7.76.79 191.55 GB 256 ? 92f7189f-c265-4828-914a-427cb6befc12 Rack1
However nodetool netstats shows some error:
~# nodetool netstats
Mode: NORMAL
Tablet migration-maps-index-0 7de8bb30-001a-11f1-9a23-e7ba21636979
/10.7.76.24
Receiving 1 files, 0 bytes total. Already received 1 files, 627034336 bytes total
rxnofile 627034336/627034336 bytes(100%) received from idx:0/10.7.76.24
error running operation: rjson::error (JSON assert failed on condition ‘false’, at: 0x617acae 0x4e1f4cd 0x517663e 0x523135d 0x526681b 0x6051466)
I see that the load is distributing since I can see an increase in the load on the new node. How long should I wait before I can decommission the old node?
Replace a Running Node in a ScyllaDB Cluster | ScyllaDB Docs
@aviavi**:** First, this version is long out of support.
Second, that page has two procedures, which did you follow?
@Shobhana: Than@avis @avi
I followed the first procedure. I added a new node and decommissioned the old node after checking that the load was stable from the nodetool status response.
The document says that the next step is to run nodetool cleanup on all the remaining nodes in the cluster. I assume this has to be run on one node at a time. Should it be run on the new node which was added also? Are there any other tips to ensure it completes faster without impacting the production workload?
@avi**:** It needs to run on all nodes, including the new node. It can be run in parallel, but running it sequentially reduces impact.