Question: I still can find lot’s of information about large partition problems. And also in scylla monitoring dashboard I could see this information. Is it still a problem with large partitions in scylla 6.2.1 and should I manually do time bucketing? Because it seems your team did optimizations in that regards https://youtu.be/n7ljgazbxzA?si=TB2IhDFz-cQT9Uxq&t=37
We recently added a ScyllaDB University topic that deals exactly with Large Partitions, see here.
Regarding manual time bucketing, this topic might help you.
If you have more specific questions, feel free to ask them here.
Thanks for the answer. More specifically I’m interested in time series long term history use case (10 years of data for example). Typical queries is getting the data by sensor_id within a time range. And let’s say we have 10M of sensors in overall
CREATE TABLE data
(
sensor_id text,
time timestamp,
value double,
PRIMARY KEY (sensor_id, time, value)
) WITH CLUSTERING ORDER BY (time ASC)
AND compaction = { 'class' : 'TimeWindowCompactionStrategy' }
AND tombstone_gc = { 'mode' : 'immediate' };
In case of using TWCS compaction will never be done across windows. So, by default after 1 day we’ll have 1 SSTable for that day. Therefore large partitions should not be a problem in case of compaction, since the maximum amount of data that we’re compacting is 1 day. In another words in whole partition we may have a petabytes of data (for 10 years), but within 1 SStable we are storing the data only for 1 day, so it should not be a problem. Is it right?
I did not get relations between large partitions and SSTables (https://youtu.be/9HBEDVswLQM?si=3mBmXjJZ3180paRC&t=383). Let’s say I’m using TWCS and data for 1 day will be compacted to one big SSTable. Then let’s have a look at two use cases: big amount of small partitions vs small amount of big partitions. SStable is written not per each partition key, but it contains many partition keys. So, at the end of the day TWCS produce one SSTable which will contains all partitions, but the overall size of SSTable does not depend on the number of partition keys. In another words, if I need to write 100GB of data for 1 day it does not matter if I’ll have 1 partition, 100 partitions or 1M partitions, because anyway the whole data for that 1 day will be stored in 1 SStable file.
Yes, ScyllaDB has done a lot of work to better handle large partitions, they should not crash ScyllaDB anymore and the performance penalty of working with large partitions was reduced.
That said, large partitions above a certain size are still problematic. What this certain size is exactly, I don’t know. Let’s say, if you have partitions in the gigabytes, you should start thinking about bucketing. In general, if you know you have large partitions, watch out for any sign of latency degradation, that is the best way to determine when to take action.
As for large partitions vs. SSTables: the problem here is that ScyllaDB currently doesn’t cut SSTables mid-partition. Some compaction strategies want to control the size of the SSTables they create: an example would be LCS or ICS (enterprise-only). These compaction strategies struggle with large partitions, because a single partition will always be in a single SSTable, so they loose control over the size of the SSTables. To my knowledge, TWCS doesn’t suffer from this problem and like you said, partitions are scattered among the windows, further reducing any chance of this problem popping up.
In another words in whole partition we may have a petabytes of data (for 10 years), but within 1 SStable we are storing the data only for 1 day, so it should not be a problem. Is it right?
I hope this is just a fictitious example. This results in 36510shard windows which is too much. Keep your bucket count reasonable.
And just add a corresponding composite key so partitions dont grow indefinitely, it is an append only use case after all so you can do it deterministically anyway…