Manual bucketing and large partitions problem

Installation details
ScyllaDB version: 6.2.1

Question: I still can find lot’s of information about large partition problems. And also in scylla monitoring dashboard I could see this information. Is it still a problem with large partitions in scylla 6.2.1 and should I manually do time bucketing? Because it seems your team did optimizations in that regards https://youtu.be/n7ljgazbxzA?si=TB2IhDFz-cQT9Uxq&t=37

We recently added a ScyllaDB University topic that deals exactly with Large Partitions, see here.
Regarding manual time bucketing, this topic might help you.
If you have more specific questions, feel free to ask them here.

1 Like

Thanks for the answer. More specifically I’m interested in time series long term history use case (10 years of data for example). Typical queries is getting the data by sensor_id within a time range. And let’s say we have 10M of sensors in overall

CREATE TABLE data
  (
    sensor_id    text,
    time         timestamp,
    value        double,
    PRIMARY KEY (sensor_id,  time, value)
  ) WITH CLUSTERING ORDER BY (time ASC)
    AND compaction = { 'class' : 'TimeWindowCompactionStrategy' }
    AND tombstone_gc = { 'mode' : 'immediate' };

I found lot’s of articles about “time bucketing”, e.g.:
https://www.scylladb.com/glossary/cassandra-time-series-data-modeling/
https://www.scylladb.com/2019/08/20/best-practices-for-data-modeling/
KariosDB doing also time bucketing (row width argument)
I also had watched the video on scylla university and have the following questions:

  1. In case of using TWCS compaction will never be done across windows. So, by default after 1 day we’ll have 1 SSTable for that day. Therefore large partitions should not be a problem in case of compaction, since the maximum amount of data that we’re compacting is 1 day. In another words in whole partition we may have a petabytes of data (for 10 years), but within 1 SStable we are storing the data only for 1 day, so it should not be a problem. Is it right?

  2. I did not get relations between large partitions and SSTables (https://youtu.be/9HBEDVswLQM?si=3mBmXjJZ3180paRC&t=383). Let’s say I’m using TWCS and data for 1 day will be compacted to one big SSTable. Then let’s have a look at two use cases: big amount of small partitions vs small amount of big partitions. SStable is written not per each partition key, but it contains many partition keys. So, at the end of the day TWCS produce one SSTable which will contains all partitions, but the overall size of SSTable does not depend on the number of partition keys. In another words, if I need to write 100GB of data for 1 day it does not matter if I’ll have 1 partition, 100 partitions or 1M partitions, because anyway the whole data for that 1 day will be stored in 1 SStable file.

  3. Also, since your team did an optimization for large partitions(https://youtu.be/n7ljgazbxzA?si=TB2IhDFz-cQT9Uxq&t=37) I’m wondering do I still need to worry about manual bucketing