Scylla 5.2 Load and Stream

Hello,

I am trying to understand this feature. Nodetool refresh | ScyllaDB Docs

What I don’t understand is if you are going to from 6 nodes(RF=3) to 4 nodes(RF=2), do you need to need to load data from all 6 nodes even if the replication factor in 6 node cluster is 3? If we do need to load data from all 6 nodes into 4 node cluster, are there any risk of running out of space in the new cluster?

Excellent question. I addressed Load and Stream specifics in https://www.scylladb.com/2023/09/18/5-more-intriguing-scylladb-capabilities-you-might-have-overlooked/ , so you may also want to check on that.

if you are going to from 6 nodes(RF=3) to 4 nodes(RF=2), do you need to need to load data from all 6 nodes even if the replication factor in 6 node cluster is 3?

Copying all 6 nodes indeed seem an overkill. But the real answer is that it depends.

Are you dual-writing to both clusters? Do you expect all data present in the source cluster to match its target? Also, Do you use NetworkTopologyStrategy and spread the data to 3 AZs?

If yes, then you can start dual-writing, run a repair job and once that repair job finishes snapshot your data from a single AZ and copy it over. Both cluster should be in sync afterwards.

If we do need to load data from all 6 nodes into 4 node cluster, are there any risk of running out of space in the new cluster?

All SSTable data is going to get streamed to its replicas, so you may want to let compaction pick up as you go through each Load and Stream step.

@felipemendes good input. Should we add it to the docs?

Do I need to disable compaction during load and stream? And then enable after each load and stream is completed.

Not really. You may want to disable tombstones from getting compacted though, in case their gc_grace_seconds happen to expire. You can do so by setting tombstone_gc to repair. See Preventing Data Resurrection with Repair Based Tombstone Garbage Collection - ScyllaDB

@tzach , that’s definitely a good idea. :slight_smile: Ping me if anything