I need to bucket my data so as to fit it in partitions. Im using time based bucketing. & the current modelling is as follows- Im counting the number of buckets that are unused. Example- bucket 10, counter 100 means bucket 10-110 are unused. bucket 3 counter 0, means only bucket 3 is unused.
So in the db its stored as- bucket 10 counter 10, bucket 22 counter 1, bucket 50 counter 40, bucket 92 counter 8.
Thus, used buckets will be- 0-9, 21, 24-49, 91
but since im paginating I need only 5 buckets at a time. using this modal, saaves storage but I cant get 5 buckets from current bucket offset from db itself, I need to write business logic for that & I suspect that will impact RAM.
Is there any better method to track unused buckets such that I can get buckets that are used, so as to query
I am probably missing some details, but here goes a long shot.
IIUC you are using an auxiliary table for that. Wouldn’t simply clustering by used buckets work instead? That way, you simply read from whatever you retrieve as part of a scan.
there are pros & cons to that. Im using time based bucketing so simplest would be to store bucket once used, every week. But then for high traffic clusters, which are always seeing inserts, itll be a waste of storage.
Unused buckets are being used since we expect the volume of inserts per partition key to be greater than 0 for majority clients. Hence only a few ppl will leave buckets unused by creating a stream & forgetting about it hence not inserting anything.
So tracking unused buckets, only saves some storage.
Hope you r understanding?
Yeah, I think I get it now. You’ll definitely need to make some choices there.
Ideally, you could perhaps never bucket these outliers if you had the option to. ie: if its users running a “free-trial” chances are they will forget about it, when regular users will benefit from bucketing.
Feel free to share schema and describe the use case in question if you feel like you need more assistance
for example, take discords time based partitioning. They track unused buckets for storing available buckets to query against for channel messages so as not to iterate over every bucket.
More specifically this-
We noticed Cassandra was running 10 second “stop-the-world” GC constantly but we had no idea why. We started digging and found a Discord channel that was taking 20 seconds to load. The Puzzles & Dragons Subreddit public Discord server was the culprit. Since it was public we joined it to take a look. To our surprise, the channel had only 1 message in it. It was at that moment that it became obvious they deleted millions of messages using our API, leaving only 1 message in the channel.
If you have been paying attention you might remember how Cassandra handles deletes using tombstones (mentioned in Eventual Consistency). When a user loaded this channel, even though there was only 1 message, Cassandra had to effectively scan millions of message tombstones (generating garbage faster than the JVM could collect it).
We solved this by doing the following:
- We lowered the lifespan of tombstones from 10 days down to 2 days because we run Cassandra repairs (an anti-entropy process) every night on our message cluster.
- We changed our query code to track empty buckets and avoid them in the future for a channel. This meant that if a user caused this query again then at worst Cassandra would be scanning only in the most recent bucket.
I decided to follow in their foorsteps & do the same. I tried amount based bucketing but that requires me to setup kafka & run background jobs per week to check every channel (in discords terminology) in existence to see their current count. So yeah, not as low mentainance as I needed it to be. So I was looking for such a method, where I could query efficiently without a lot of storage wastage & have a decent bucketing strat thats low maintenance