New(?) Data Resurrection Without Cleanups

There was a change in the Adding a New Node Into an Existing ScyllaDB Cluster (Out Scale) documentation for 5.2, that says there is a chance of data resurrection if cleanups are not run in a timely manner.

Is this related to the data resurrection for not repairing often enough, or are they unrelated? My understanding in the past was that cleanups were just to recover disk space, so just trying to understand what/if anything changed with it or the full risks of not doing it between node additions and removals.

The primary goal of cleanup is avoiding data resurrection. Consider a write W1 which is written to a node, N1. After a new node Nx is added, N1 no longer owns W1. If no cleanup is run, this data will stay on N1. Some time down the line W1 is deleted and the tombstone is garbage collected. Then at one point Nx is removed from the cluster and the ownership of W1 comes back to N1. Remember that this write was deleted earlier, but the tombstone was garbage collected. Since there is currently no newer entry for W1, the old value N1 has, becomes the latest value and therefore it is resurrected.
To avoid this we run cleanup, which ensures that no such stale data lingers on nodes after token movement. Freeing up disk space is a secondary, albeit also important aspect.

Thanks for the info Botond. I assume this has always been a possibility then, and not something new?

This has some impact to our process around AWS node decommissions that we’ll have to figure out. AWS gives around 2 weeks notice that a node will be decommissioned, so we bootstrap a new node into the cluster, decommission the old one, then worry about cleanups. With this process we would need to insert the cleanups in the middle and need to get it done within the 2 week window, which could be cutting it close on some clusters (our largest has 39 nodes).

Do you have any recommendation on how you would handle that situation? Decommissioning the node first would fix it, though then we have to undersize the cluster for a short window instead of oversizing it.

Doing bootstrap + decomission in quick succession, then doing cleanup after should be fine. Just don’t delay running cleanup too long.

Also, maybe look into replace operation. With replace, no cleanup is needed, although it has its drawback in that the cluster temporarily looses a replica, while the replace is going on, so read QUORUMs are more susceptible to failing.

Interesting, I’m fairly sure that is new as well. We used to use the dead node replacement procedure, but ran into issues with data loss (on 4.X, before RBNO was enabled for replace actions). I can’t find the old docs, but I have notes saying the bootstrap/decommission was recommended at the time, so we switched to that.

If the replace is considered as a valid alternative then we may switch back. Thank you.

Yes, replace was made safe by using RBNO for it. This change was made in 4.6, where we enabled RBNO by default for replace. See ScyllaDB Open Source 4.6 - ScyllaDB.