Best way to Fetch N rows in ScyllaDB: Count, Limit or Paging

Guy · December 11, 2022, 8:22am

I have a use case where I utilize ScyllaDB to limit users’ actions in the past 24h. Let’s say the user is only allowed to make an order 3 times in the last 24h. I am using ScyllaDB’s ttl and making a count on the number of records in the table to achieve this. I am also using https://github.com/spaolacci/murmur3 to get the hash for the partition key.

However, I would like to know what is the most efficient way to query the table. So I have a few queries in which I’d like to understand better and compare the behavior(please correct me if any of my statement is wrong):

using count()

count() will implement a full-scan query, meaning that it may query more than necessary records into the table.

SELECT COUNT(1) FROM orders WHERE hash_id=? AND user_id=?;

using limit

limit will only limit the number of records returned to the client. Meaning it will still query all records that match its predicates but only limit the ones returned.

SELECT user_id FROM orders WHERE hash_id=? AND user_id=? LIMIT ?;

using paging

I’m a bit new to this, but if I read the docs correctly, it should only query the up until it received the first N records without having to query the whole table. So if I limit the page size to the number of records I want to fetch and only query the first page, would it work correctly? and will it have a consistent result?

docs: Paging | ScyllaDB Docs

my query is still using limit, but utilizing the driver to achieve this with https://github.com/gocql/gocql

iter := conn.Query( “SELECT user_id FROM orders WHERE hash_id=? AND user_id=? LIMIT ?”, hashID, userID,3 ).PageSize(3).PageState(nil).Iter()

Please let me know if my analysis was correct and which method would be best to choose

*The question was asked on Stack Overflow by Radhian Amri

Guy · December 11, 2022, 8:27am

Your client should always use paging - otherwise, you risk adding pressure to the query coordinator, which may introduce latency and memory fragmentation. If you use the Scylla Monitoring stack (and you should if you don’t!), refer to the CQL Optimization dashboard and - more specifically - to the Paged Queries panel.

Now, to your question. It seems to be that your example is a bit minimalist for what you are actually wanting to achieve, and - even then - should it not be, we have to consider such set-up at scale. Eg, There may be a tenant allowed which is allowed to place 3 orders within a day, but another tenant allowed to place 1 million orders within a week?

If the above assumption is correct - and with the options at hand you have given - you are better off using LIMIT with paging. The reason is that there are some particular problems with the description you’ve given at hand:

First, you want to retrieve N amount of records within a particular time frame, but your queries don’t specify such a time-frame
Second, either COUNT or LIMIT will initiate a partition scan, and it is unclear how a hash_id + user_id combination can be done to determine the number of records within a time frame.

Of course, it may be that I am wrong, but I’d like to suggest different approaches which may or may not apply to you and to your use case.

Consider a timestamp component as part of the clustering key. This will allow you to avoid full partition scans with queries such as:

SELECT something FROM orders WHERE hash_id=? AND user_id=? AND ts >= ? AND ts < ?;

If the above is not applicable, perhaps a Counter Table would suffice your needs? You could simply increment a counter after an order is placed, and - after - query the counter table as in:

SELECT count FROM counter_table WHERE hash_id=? AND user_id=? AND date=?;

I hope that helps!

*The answer was provided on Stack Overflow by Felipe Mendes

Guy · December 11, 2022, 8:32am

I have a few points I want to add to what Felipe wrote already:

First, you don’t need to hash the partition key yourself. You can use anything you want for the partition key, even consecutive numbers. The partition key doesn’t need to be random-looking. Scylla will internally hash the partition key to improve the load balancing. You don’t need to know or care which hashing algorithm ScyllaDB uses, but interestingly, it’s a variant of murmur3 too (which is not identical to the one you used - it’s a modified algorithm originally picked by the Cassandra developers).

Second, you should know - and decide whether you care - that the limit you are trying to enforce is not a hard limit when faced with concurrent operations: Imagine that the given partition already has two records - and now two concurrent record addition requests come in. Both can check that there are just two records, decide it’s fine to add the third - and then when both add their record - and you end up with four records.
You’ll need to decide whether this is fine for you that a user can get in 4 requests in a day if they are lucky, or it’s a disaster. Note that theoretically, you can get even more than 4 - if the user manages to send N requests at exactly the same time, they may be able to get 2+N records in the database (but in the usual case, they won’t manage to get many superfluous records). If you’ll want 3 to be a hard limit, you’ll probably need to change your solution - perhaps to one based on LWT and not use TTL.

Third, I want to note that there is not an important performance difference between COUNT and LIMIT when you know a-priori that there will only be up to 3 (or perhaps, as explained above, 4 or some other similarly small number) results. If you assume that the SELECT only yields three or less results, and it can never be a thousand results, then it doesn’t really matter if you just retrieve them or count them - you should just do whichever is convenient for you.
In any case, I think that paging is not a good solution for your need. For such short results and you can just use the default page size, and you’ll never reach it anyway, and also paging hints to the server that you will likely continue reading on the next page - and it caches the buffers it needs to do that - while in this case, you know that you’ll never continue after the first three results. So in short, don’t use any special paging setup here - just use the default page size (which is 1MB), and it will never be reached anyway.

*The answer was provided on Stack Overflow by Nadav Har’El

Topic		Replies	Views
Queries with millions of records and pagination in a distributed database ScyllaDB data-model , performance , drivers	1	840	October 7, 2023
How do I use `query()` safely? ScyllaDB open-source , troubleshooting , rust-driver	1	178	February 1, 2024
ScyllaDB returning a smaller page than expected to maintain low latency, handling of tombstones ScyllaDB performance , drivers , rust-driver , go-driver , paging	0	116	June 9, 2024
Latency issue and data retrieval, page size and number of threads (Java Driver) ScyllaDB data-model , java-driver , paging , latency , threads	0	33	December 23, 2024
Counter Table vs SELECT COUNT for partition row count ScyllaDB	2	254	July 8, 2024

Best way to Fetch N rows in ScyllaDB: Count, Limit or Paging

Related topics