Hi, I have several Scylla clusters that have been running in production for multiple years. I realized last year that we still use the default SimpleStrategy
for replication on all of our data. This is obviously a bad choice for production, and I have switched over what keyspaces I could using the recommended procedure.
The problem I have however, is that our largest keyspaces are 100s of TBs, and just running some exploratory repairs gives me an estimate of this whole procedure taking ~40 hours. My same estimates for the smaller keyspaces ran over almost x2, so this could take up to 80 hours. I cannot get that long of a downtime, because it will require a full company outage. So, I’m exploring other options.
The main alternatives that I’ve been able to come up are not great, so hoping that maybe there is some feature of Scylla that I’m missing that will make these easier, or some other method to accomplish this:
- Create whole new cluster and move data over
- Doubles cluster cost until completed
- Requires mirroring the data, likely with a custom app since multiple datacenters would handle this in any normal situation
- Still requires a downtime at the end for a final sync, but likely smaller window
- All apps can be switched over at once, fairly easily
- Create new keyspaces on the current cluster, copy data over
- Requires x2 disk space
- Process would be something like this:
- Create new keyspace with correct settings
- Snapshot old keyspaces
- Copy snapshots into new keyspace
- Run repairs
- My assumption is that I would still need to do the repairs from the recommended approach, but I could do it while the normal keyspace is being used
- If this is true, then it still requires a large final repair to finalize, so may run over on time as well
In addition to my ideas here, I got the following from Felipe on Slack:
- Update writes to CL = ALL, run the initial repair, deal with less performance until strategy switch, then change read CL = ALL until all data has been repaired again
- My thought here is it could alternatively cut the downtime in half, running that initial repair live and keeping data consistent until the downtime for the strategy switch and final repair
- Migrate to new cluster with scylla-migrator (or load-and-stream with 5.1), write to both clusters, switch whenever possible
1 Like
What is the nodetool status of that problematic cluster? (you can ping it on slack to me as well (Lubos) to keep privacy)
Lots of things can be done(or can be skipped) if the layout of the cluster is prepared for the switch.
So Garrett has 3 racks, lots of nodes and most nodes only 15% disk free and using Scylla 4.5 with STCS as compaction strategy
Garret also uses nodetool repair -pr for repairs
in such case having Scylla Enterprise with ICS and Scylla Manager(SM) would mitigate the situation a bit
Anyhow Felippes idea is safe.
If you however feel adventurous you can go the route that you repair the cluster, then if you have SM you can resume repair after writes are paused (or run repair anew).
Then after ALTER if you resume writes you should have them consistent.
But reads will still be inconsistent, roughly 11% of the data will be on RF=1 until you repair it.
So until this repair is done, the reads will be eventually consistent despite using QUORUM.
If one can live with that for the duration of second repair, you can have write downtime ideally for single repair (or for resume from SM)
Speed up of repair can be done by using NullCompactionStrategy - assuming there is enough disk space to accommodate streamed repairs and writes for the duration of repair - which with 15% free won’t be flying.
Anyhow, I am curious how Garrett will decide