Best way to Fetch N rows in ScyllaDB: Count, Limit or Paging

Your client should always use paging - otherwise, you risk adding pressure to the query coordinator, which may introduce latency and memory fragmentation. If you use the Scylla Monitoring stack (and you should if you don’t!), refer to the CQL Optimization dashboard and - more specifically - to the Paged Queries panel.

Now, to your question. It seems to be that your example is a bit minimalist for what you are actually wanting to achieve, and - even then - should it not be, we have to consider such set-up at scale. Eg, There may be a tenant allowed which is allowed to place 3 orders within a day, but another tenant allowed to place 1 million orders within a week?

If the above assumption is correct - and with the options at hand you have given - you are better off using LIMIT with paging. The reason is that there are some particular problems with the description you’ve given at hand:

  • First, you want to retrieve N amount of records within a particular time frame, but your queries don’t specify such a time-frame
  • Second, either COUNT or LIMIT will initiate a partition scan, and it is unclear how a hash_id + user_id combination can be done to determine the number of records within a time frame.

Of course, it may be that I am wrong, but I’d like to suggest different approaches which may or may not apply to you and to your use case.

  • Consider a timestamp component as part of the clustering key. This will allow you to avoid full partition scans with queries such as:

SELECT something FROM orders WHERE hash_id=? AND user_id=? AND ts >= ? AND ts < ?;

  • If the above is not applicable, perhaps a Counter Table would suffice your needs? You could simply increment a counter after an order is placed, and - after - query the counter table as in:

SELECT count FROM counter_table WHERE hash_id=? AND user_id=? AND date=?;

I hope that helps!

*The answer was provided on Stack Overflow by Felipe Mendes