Originally from the User Slack
@Terence_Liu: Have been reading this doc https://opensource.docs.scylladb.com/stable/operating-scylla/procedures/cluster-management/add-node-to-cluster.html. By my understanding, if I just ingested to a big single-node cluster, and backed up the relevant keyspace/tables to the cloud, I should restore this backup to a node prod space, add two more nodes to make it RF=3, and wait for streaming from the first node to the other two.
Can I do this instead - restore the same backup to all three nodes, and boot them up together to avoid the streaming process? Assuming all three nodes have identical data, this should be possible?
I understand it’s harder when the original backup is more than one host, because it’s a lot harder to know how the key ranges are distributed. I assume nodetool refresh
will help in this case?
ScyllaDB node will ignore the partitions in the sstables which are not assigned to this node. For example, if sstable are copied from a different node.
Adding a New Node Into an Existing ScyllaDB Cluster (Out Scale) | ScyllaDB Docs
@Pete_Aven: Hi Terence, you should be able to boot up an empty 3 node cluster, then follow the restore process for a table. It consists of:
• Create table schema on empty cluster
• Copy all the table data into the table directory - usually /var/lib/scylla/data/keyspacename/tablename-uuid/upload/
( copy the table data from the single node to each node in the new cluster - a 1:3 copy
• Then use nodetool refresh -- keyspacename tablename
https://opensource.docs.scylladb.com/stable/operating-scylla/nodetool-commands/refresh.html
This will ingest the backup into a running cluster, which could also be serving traffic (especially writes)
@Terence_Liu: Thank you Pete. Do I need to execute nodetool refresh
1 time on each of the nodes? Or doing it once on any node will cause every node to take in the upload
folder sstables?
Hi @Pete_Aven. If I do a 1:3 copy (RF=3), will this speed up ingestion by bypassing the load-and-stream process? Or will it actually create more load because each node needs to duplicate the streams to other nodes over essentially the same data?
@Pete_Aven: @Terence_Liu Following the suggestion above, there should be no streaming. You copy the table data over to each node . You run nodetool refresh on each node. All refresh operations will be node local. Then you run repair when all that’s done and the cluster will be in sync.
@Terence_Liu: thank you!