Installation details
ScyllaDB version: 6.2.1
Data model:
CREATE TABLE data(
p TEXT,
c TIMESTAMP,
v BLOB,
PRIMARY KEY(p, c)
)
Queries: Give me the first and the last value for each partition.
- SELECT * FROM data WHERE p IN (‘p1’, ‘p2’, ‘p3’) PER PARTITION LIMIT 1
Here is we can get the first point for each partition and don’t care about “global” ordering, so resulted rows won’t be sorted by clustering key and natural order will be applied only locally within each partition. Result may look like this:
p1 2024-11-01 00:35:00
p2 2024-11-01 00:30:00
p3 2024-11-01 00:31:00
- SELECT * FROM data WHERE p IN (‘p1’, ‘p2’, ‘p3’) ORDER BY c DESC PER PARTITION LIMIT 1
It’s not possible to use IN and ORDER BY together with paging enabled. The reason is that scylla applies order “localy” within each partition, but also applies it “globally” across the resulted rows that makes scylla store result in-memory to apply “global” sorting. But if I don’t care about “global” ordering and only wanted to specify ordering within each partition that introduce performance overhead. Result may look like this:
p1 2024-11-27 23:35:00
p2 2024-11-27 23:30:00
p3 2024-11-27 23:31:00
Question: Does scylla supports “per partition ordering” only? And if not, are there any plans to support it?
The short answer is that ScyllaDB doesn’t support this and I’m not aware of any plans to support it either.
Note that making the ordering restricted to a single partition will still not make it possible to execute this query: a partition can be arbitrarily big and to order it according to any other order than the clustering one, we still have to read it all into memory, which might not be possible.
1 Like
Thanks for the answer, but I still didn’t get why we need read whole result in memory when we are sorting by clustering key, but in reversed order? Let’s have a look at the following queries:
- Get me first row for each partition and ignore sorting across the final result
SELECT * FROM data WHERE p IN (‘p1’, ‘p2’, ‘p3’)
PER PARTITION LIMIT 1
We don’t need to read everything in the memory, since we don’t need to sort the final result, right? To follow further examples, let’s introduce new keyword “PER PARTITION ORDER”, so we can write the same query in more explicit way
SELECT * FROM data WHERE p IN (‘p1’, ‘p2’, ‘p3’)
PER PARTITION LIMIT 1
PER PARTITION ORDER DEFAULT
- Get me first row for each partition and perform sorting across the final result
SELECT * FROM data WHERE p IN (‘p1’, ‘p2’, ‘p3’)
PER PARTITION LIMIT 1
ORDER BY c ASC
We do need to read everything in memory, because we want the final result to be sorted.
- And now I want to make the same query like the first one, but in reversed order. So, it should return the last row for each partition, but the final result not necessary to be sorted. With suggested keyword it will look like that
SELECT * FROM data WHERE p IN (‘p1’, ‘p2’, ‘p3’)
PER PARTITION LIMIT 1
PER PARTITION ORDER REVERSED
But instead I need to do this that introduce overhead and only possible to do with pagging disabled
SELECT * FROM data WHERE p IN (‘p1’, ‘p2’, ‘p3’)
PER PARTITION LIMIT 1
ORDER by c DESC
You are right, examples (1) and (3) do not require reading all results into memory, as the end results are partially sorted only.
I think this PER PARTITION ORDER
would be possible to implement.
2 Likes
Thanks, sounds great. Do I need to create feature request on the github for that? Or this feature may be considered only for the Enterprise version?
Feel free to create a feature request issue in the open-source repository.