Hello Scylla experts! My team is trying out Scylla Cloud for the first time and are already impressed with the immediate performance improvements on the read path - we’re heavy users of LWTs and atomic batches in C* and we’re noticing regular timeouts when attempting large batches that all start with one CAS operation followed by 10k INSERT INTO statements for different keys (
Cassandra timeout during CAS write query at consistency SERIAL). It seems like these actually happen more frequently in Scylla than when using DataStax Astra. Our use case for atomic batches is essentially to “fence” older writers; this isn’t exactly it but you can think about it as each write operation having a unique “generation” and then the first statement of the batch is
UPDATE foo SET generation = 10 WHERE partitionKey = '0' IF generation < 10; to ensure that no writes with generation greater than 10 have succeeded previously. All other statements are just basic INSERT INTOs with no CAS operation. In steady state, there should only be a single writer to a partition (hence the fencing!) - which means contention should be nearly zero.Two questions:
- How do LWTs work in tandem with atomic batches? Is Paxos run once for the entire batch (if there’s only one CAS statement)? Any kind of documentation here would be super useful.
- Would we be better of having more smaller batches (e.g. break up one write of 10k events to 100 batches of 100 events with CAS at the start of each batch)?
- Are there performance considerations for having many concurrent LWTs if they don’t contend with the same partition key?
- Is there a better way to do fence writers than LWTs with a generation id?
Note that I haven’t upgraded to the Scylla driver yet (I’m just keeping the old DSE driver from our code) if it’s possible that this could help I can easily do that (just involves some CI/CD toil).
*transcript from a discussion on the user slack channel
Upgrading to Scylla drivers would help, python and Java drivers are LWT aware and prefer sending LWT writes to the primary replica, which reduces paxos conflicts.
The main performance consideration is a thing called disk write dma alignment, which comes from the underlying filesystem used. By default it’s 4k, which means every LWT related write to the commit log is at least 4k of commit log space. If you have low concurrency (i.e. few requests per shard) that could be a lot of writeamp.
In other words, unless your workload is exclusively using LWT, you don’t need to worry.
I don’t know if there is a better way to do barriers than the one with generation id you described. I don’t fully understand the scenario, in order to do it, I would need to see all queries to this data and all the nuances how the results are used. The way you describe it - to ensure that no writes with generation greater than 10 have succeeded previously - is not making a lot of sense to me, since just another write can take place right after the conditional statement and succeed before the results of the conditional statement are delivered to the client.
I don’t know what you mean under atomic batches. We call batches that have at least one conditional statement in them a conditional batch. There is only 1 Paxos round for such batch. Hope this helps.
This is super helpful, thanks!
The way you describe it - to ensure that no writes with generation greater than 10 have succeeded previously - is not making a lot of sense to me, since just another write can take place right after the conditional statement
By “atomic batch” I meant “conditional batch” - all writes to Scylla would go through these conditional batches and the first statement of each batch is the CAS, so I imagine only one writer can “win” unless I’m not understanding something? (e.g. imagine node A attempts to write with generation ID 10 while node B attempts to write with generation ID 9 for the same key - if the batch from A goes through first, then the batch from B should fail). We use the barrier on every write.
We call batches that have at least one conditional statement in them a conditional batch. There is only 1 Paxos round for such batch.
That’s good news for us! I suspect that means to answer my second question it would hypothetically be better to do fewer larger conditional batches than more smaller ones? Each batch is likely to far exceed 4K so hopefully the alignment consideration isn’t too big of a deal, but I’ll do some investigation to confirm that. Are there any specific metrics we should keep an eye on to figure out whether we’re poorly utilizing the commitlog (hitting write amplification)? I guess so long as we’re below the node’s IOPs limit we should be good from that perspective…
yes, if you have a batch like:
UPDATE foo SET generation = 10 WHERE partitionKey = '0' IF generation < 10;
the entire batch is conditional and it goes through a single paxos round. It still means some writeamp (write amplification), since the data is written 4 times instead of two: first time into system.paxos and its commit log, second time into the base table and its commit log, instead of just the base table and its commit log. I think though that overall it’s going to be quite efficient.
As to poorly using the commit log, I think if you’re writing in batches like above this is not relevant to you. What I would keep an eye on is how much space your commit log takes overall and how many availiable segments there are over time. I believe both metrics are present in our stock grafana monitor.