Installation details #ScyllaDB version: #Cluster size: os (RHEL/CentOS/Ubuntu/AWS AMI):
If there is failure in a datacenter, for example, a cluster with 6 replicas, which means every datacenter owns 2 replicas, when the 2 replica suddenly down without data stream, could scylla aware an az-level failure ?
A best practice is to consider an Availability Zone (AZ) as a Rack
ScyllaDB uses the NetworkTopologyStrategy and supports rack awareness via Snitch configuration (like GossipingPropertyFileSnitch).
When you define racks within a datacenter, Scylla tries to spread replicas across racks to improve resilience — assuming you have more replicas than racks.
So if:
1 datacenter = 1 region
1 rack = 1 availability zone
And you have 2 replicas per DC, with each replica placed in a different AZ (rack)
Then yes, ScyllaDB will:
Try to distribute replicas across those 2 AZs (racks).
Detect that a node in a rack (AZ) is down via gossip.
Continue serving requests as long as your consistency level allows it.
Consider a failure like this, cluster as I describe above, a put req send to 6 nodes and two nodes in each az will get the req. Suddenly, two nodes in az-1 is down without response, another node in az-2 is down without response. Then in this case, 6 replicas with quorum write consistency level is bound to get failed. Would it retry, I means send to other nodes ??
And consider another case, one node suddenly down in each az ??
TLDR; you should use LOCAL_QUORUM instead of QUORUM - which also consumes quite significant bandwidth and will have perfomance impact (due to the need to read/write across DCs)
Then, for 3 DCs and 3 Racks, create the keyspaces as below -