Could scylla identify AZ-level failures

Installation details
#ScyllaDB version:
#Cluster size:
os (RHEL/CentOS/Ubuntu/AWS AMI):

If there is failure in a datacenter, for example, a cluster with 6 replicas, which means every datacenter owns 2 replicas, when the 2 replica suddenly down without data stream, could scylla aware an az-level failure ?

A best practice is to consider an Availability Zone (AZ) as a Rack

ScyllaDB uses the NetworkTopologyStrategy and supports rack awareness via Snitch configuration (like GossipingPropertyFileSnitch).

When you define racks within a datacenter, Scylla tries to spread replicas across racks to improve resilience — assuming you have more replicas than racks.

So if:

  • 1 datacenter = 1 region
  • 1 rack = 1 availability zone
  • And you have 2 replicas per DC, with each replica placed in a different AZ (rack)

Then yes, ScyllaDB will:

  • Try to distribute replicas across those 2 AZs (racks).
  • Detect that a node in a rack (AZ) is down via gossip.
  • Continue serving requests as long as your consistency level allows it.

Consider a failure like this, cluster as I describe above, a put req send to 6 nodes and two nodes in each az will get the req. Suddenly, two nodes in az-1 is down without response, another node in az-2 is down without response. Then in this case, 6 replicas with quorum write consistency level is bound to get failed. Would it retry, I means send to other nodes ??

And consider another case, one node suddenly down in each az ??

TLDR; you should use LOCAL_QUORUM instead of QUORUM - which also consumes quite significant bandwidth and will have perfomance impact (due to the need to read/write across DCs)

Then, for 3 DCs and 3 Racks, create the keyspaces as below -

CREATE KEYSPACE my_app_keyspace
WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'DC1': 3,
  'DC2': 3,
  'DC3': 3
};

For the above setup, LOCAL_QUORUM = 2, which is much tollerable for failures

Normally I would not recommend a replication factor higher than 3.

1 Like