Hi, I have several Scylla clusters that have been running in production for multiple years. I realized last year that we still use the default SimpleStrategy
for replication on all of our data. This is obviously a bad choice for production, and I have switched over what keyspaces I could using the recommended procedure.
The problem I have however, is that our largest keyspaces are 100s of TBs, and just running some exploratory repairs gives me an estimate of this whole procedure taking ~40 hours. My same estimates for the smaller keyspaces ran over almost x2, so this could take up to 80 hours. I cannot get that long of a downtime, because it will require a full company outage. So, I’m exploring other options.
The main alternatives that I’ve been able to come up are not great, so hoping that maybe there is some feature of Scylla that I’m missing that will make these easier, or some other method to accomplish this:
- Create whole new cluster and move data over
- Doubles cluster cost until completed
- Requires mirroring the data, likely with a custom app since multiple datacenters would handle this in any normal situation
- Still requires a downtime at the end for a final sync, but likely smaller window
- All apps can be switched over at once, fairly easily
- Create new keyspaces on the current cluster, copy data over
- Requires x2 disk space
- Process would be something like this:
- Create new keyspace with correct settings
- Snapshot old keyspaces
- Copy snapshots into new keyspace
- Run repairs
- My assumption is that I would still need to do the repairs from the recommended approach, but I could do it while the normal keyspace is being used
- If this is true, then it still requires a large final repair to finalize, so may run over on time as well