We have scylladb multi-dc clusters and our network provider SLA of dropping packets in cross-dc link is <= 0.1% per month. This means in the worst case, we could have dropping packets rate close to 3% one day in a month. Then if our cluster (at least RF=3 in each DC, CL=LOCAL_QUORUM) have 500000 writes/s, we doubt this worst case scenario could be a problem in scylladb, isn’t it? And would like to know if scylladb team have any basics or advanced cross-dc link network requirements for like dropping packets rate … etc?
We are asking this question is because we face the similiar problem recently, dropping packets rate in cross-dc link reach close to 0.2% for an hour, then we saw many hints are written in the both DCs, and scylladb cluster spent almost a day to re-play all these hints generated in that hour. And currently our cluster only have 5000 writes/s, so we are worry about when write queris increase in the cluster, this problem will become more seriously. There is setting --max-hint-window-in-ms in scylladb, but not sure if this could help in this case. We also think about triggering repair job in whole cluster after incident, but since repair will produce more packets between DCs and repair multi-dc cluster need much more time, so in this repairing period we could face this dropping packets rate hike again and cause even more hints to be written (we faced this problem recently also).
So what’s the recommendation by ScyllaDB team in this situation?
Given your situation, it seems like it is doable to assume that not only you are connected through an unreliable network link, but also to a slow one, given that hints took long to get replayed, and repair can likewise take long.
Ideally, hints replay shouldn’t take more than some minutes, so it is definitely strange it took almost a day.
Occasional network failures aren’t a problem per se, given that you can - and SHOULD - always repair. You have the flexibility to break down a repair task per keyspace/table/replicas/token-range/etc, so even if repair fails (and notice ScyllaDB manager has retry mechanisms), you should still run it to completion in a regular basis.
You’ve asked:
So what’s the recommendation by ScyllaDB team in this situation?
Ideally, run your database behind a network you can trust and is fast enough for your needs, and handle occasional hiccups which will happen anyway.
Thanks for your response. The cross-dc link we are using which have SLA of average dropping packets rate in a month should be <= 0.1% but there is no guarantee in short period of time which mean it could higher than 0.1% like in my previous expression about worst case scenario as long as average dropping packets rate in a month still below 0.1%. And cross-dc bandwidth is not a problem currently, since our multi-dc scylladb cluster use 1Gbps at maximum in cross-dc link and available bandwidth in the link is at least 10Gbps. Because our data centers are in the east & west of US, so the latency is about 100ms in the link.
As you can see there is large hints written in about a hour and then scylladb re-play (Hints sent) it in much slower pace. We alreay check other graphs and make sure there is no cql writes error within each DC, and cluster load is relatively low (<=25%) in the period, so the hints written here are most because of cross-dc sync. Originally, we think because this is hints for cross-dc (have larger latency) and scylladb re-play it in the background at lower priority so that’s why it re-play in such longer period, is this understanding not correct?
Ideally, hints replay shouldn’t take more than some minutes, so it is definitely strange it took almost a day.
This is true in our experience if drop packets rate hike last only few seconds or minutes, but is it still true when cross-dc link drop packets rate hike last more than a hour in multi-dc environment?
Ideally, run your database behind a network you can trust and is fast enough for your needs, and handle occasional hiccups which will happen anyway.
Occasional hiccups is not the problem we are concern about but we would like to know if scylladb have more specific network requirements such dropping packet rate or something … etc when people want to run multi-dc scylladb cluster in write query rate like 500000 or 1000000 writes/s? Sorry, “unreliable” or “fast enough” is a little too vague to us.
Well, as you’ve explained before the local DC nodes aren’t even overloaded, so there is little reason to lessen the hints priority. It shouldn’t take a day.
This is true in our experience if drop packets rate hike last only few seconds or minutes, but is it still true when cross-dc link drop packets rate hike last more than a hour in multi-dc environment?
Hints replay to a remote DC will definitely take longer given the RTT latency. Yet, a day definitely seems exaggerated. How long does a repair typically take, how large are your tables? Did you check the latency of a cross-DC query (this may be worth a try as you should expect an average being your RTT time, plus some overhead, but nothing very far from it). You may also shutdown a remote node for a while and check whether the situation reproduces.
we would like to know if scylladb have more specific network requirements such dropping packet rate or something … etc when people want to run multi-dc scylladb cluster in write query rate like 500000 or 1000000 writes/s? Sorry, “unreliable” or “fast enough” is a little too vague to us.
No, we don’t, and we recommend a network of 10Gbps or more Our architecture accepts the fact that failures and partitions can occur, and the cluster should be able to recover under that situation. I’d still check on the network side of things, as you clearly stated repairs also take a significant amount of time.
Should you still feel stuck, please follow-up with an issue, provide thorough details on your setup, ScyllaDB version, and Prometheus data covering the timeframe for the reported incident.
How long does a repair typically take, how large are your tables?
Whole cluster repair (total 7 nodes) need ~6.5 hours, and data size per replica is 1.4 TB.
Did you check the latency of a cross-DC query (this may be worth a try as you should expect an average being your RTT time, plus some overhead, but nothing very far from it).
We don’t do cross-DC query directly, but we do monitor data sync cross-dc which meet our expectation which is like you said “RTT time, plus some overhead”.