Originally from the User Slack
@Andres: How does pagination state account for concurrent inserts, if at all? I am assuming it doesn’t and can thus cause the pagination state to be “stale” and queries relying on pagination state can then return duplicate and/or skip data, is that correct?
@Felipe_Cardeneti_Mendes: > ScyllaDB aims to provide partition-level write isolation, which means that reads must not see only parts of a write made to a given partition, but either all or nothing. To support this, the cache and memtables use MVCC internally. Multiple versions of partition data, each holding incremental writes, may exist in memory and later get merged.
https://www.scylladb.com/2018/07/26/how-scylla-data-cache-works/
@Andres: So as long as the partition is still in the cache layer then it will be fine? But if it has been evicted before the next page is grabbed that’s when issues may occur?
@Felipe_Cardeneti_Mendes: no, due to MVCC. The read will retrieve whatever existed the moment it gets executed
@Andres: I am having a hard time understanding how it solves paging, but perhaps I am just not seeing something obvious.
But let’s say:
- A client needs to query all rows in partition P, table T.
- To avoid overwhelming the client and the scylla node, paging is used to only return the first 5000 rows and the paging state is returned. So far everything makes sense and I can understand how MVCC avoids any concurrency issues.
- An insert on partition P table T is done such that the new row’s clustering key is a new 4000th row
- In the next iteration, a client uses the paging state from before to request for the next page for the same table/partition. At this point, what does scylla return? Does it return skipping the first 5000 rows (and thus ends up duplicating a result because there is a new 4000th row) or does it keep enough state to somehow know that it needs to skip 5001 rows because one of the rows came in after the first page state was gathered?
Same question applies if the modification had been a deletion (either manual or through TTL)
@Felipe_Cardeneti_Mendes: we can’t move backwards on a clustering slice. What happens is that the paging state will hold the last position and continue the read as the application requests for the next page.
In other words, if you have a paging state of 1, read 10 non-consecutive clustering rows, inserted a new clustering row which is before what had already been provided, it will just be skipped - but will be available upon a subsequent read.
On the other hand, if you insert data on a clustering slice on a clustering slice which is yet to be paged back, and this insert happens prior to when the client requests for the next page, then you’ll see the data.
If this doesn’t addresses your question, you should be able to mimic it fairly easily with a paging size of 1 and just playing with another cqlsh opened while tracing your initial read.
@avi: Pagination is one of many things that break partition isolation
An insert on an already-passed row will obviously not be seen. An insert on a page that hasn’t been fetched yet may or may not be seen.