Memory issue - deleting big partitions in ScyllaDB with TimeWindowCompactionStrategy

Guy · March 19, 2024, 5:59am

Originally from the User Slack

@golang_learner: Hi,

We try to delete a big partition by just giving partition key, however, no row is deleted since Scylla restart itself in our test env. I tried to delete a partition with around 1 million rows. We have much powerful cluster in our production environments but we try to find the best practices and we come with two solution and I want to discuss with you.

1- We delete the partition chunk by chunk (cluster key consists of just one column with uuid type and we can cut all uuid space into multiple chunks)
2- Divide the only one table into multiple table which contain just one partition and once I want to delete the partition, I just delete the table instead.

Solution one seems okey but we found solution 2 much more convenient to implement, however, this time we would create many tables and we thought that it could be overhead for scylla. (we are talking about more than 1000 table)

Which solution do you recommend us? How do you delete your big partitions in Scylla.

Thanks a lot.

@Felipe_Cardeneti_Mendes: Well, neither 1 nor 2. Scylla shouldn’t restart itself, is it running out of memory? Best is to correct the restart situation and just delete the partition and compact (major) the data away

@Jesper_Lundgren: With Cassandra I’ve had to update tombstone_threshold when doing partition deletes as the number of tombstones will be low compared to regular data when doing range deletes. Deletes are not instant, it’s a tombstone write and will eventually delete data in background. Reads won’t return deleted data but reclaiming disk takes time.
And it also depends on compaction strategy

@golang_learner: @Felipe_Cardeneti_Mendes Hi, actually it does not restart itself but it gives me NoHostAvailable error when I delete the big partition with 1 million rows.
@Jesper_Lundgren Thanks jesper. We are using Time-window Compaction Strategy for this table.
We do not care about disk claim for now. We just want to delete the big partition in a table which could reach 100 million rows in a safe way.

@Felipe_Cardeneti_Mendes: Yeah this shouldn’t happen

@golang_learner: Then what should I do? Does the delete operation trigger compaction or something that makes Scylla use a lot memory which causes NoHostAvailable error? I checked that there are 2GB left in the container in which Scylla is running on. Then, can I say that if I allocate more memory such as 8 GB, this NoHostAvailable error should be thrown?

@Felipe_Cardeneti_Mendes: Nope, a delete is just a write marker of a tombstone. Compactions are a later process. Is it some kind of LWT statement? Do you have an index/MV involved? These 2 could have a read-before-write which could cause you to bad_alloc. You probably want to trace it to see understand what’s going on.

@golang_learner: Our delete statement is like ‘DELETE FROM members where country=‘UK’ and city=‘london’’;
I reviewed the logs when we got the error and the errors started with;

[shard 1] seastar_memory - oversized allocation: 2052096 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues.

and finalized with;

[shard 1] storage_proxy - exception during mutation write to 10.18.64.146: std::bad_alloc (std::bad_alloc)
We plan to delete the partition chunk by chunk so that we target smaller rows.

The queries would be

'DELETE FROM members where country=‘UK’ and city=‘london’ and member_id >=‘00000000-0000-0000-0000-00000000’ and member_id <=‘3fffffff-ffff-ffff-ffff-ffffffffffff’;

'DELETE FROM members where country=‘UK’ and city=‘london’ and member_id >=‘40000000-0000-0000-0000-00000000’ and member_id >=‘7fffffff-ffff-ffff-ffff-ffffffffffff’;

…

This way we got not error but would you recommend it? We are not sure if it is the best practice.

@Felipe_Cardeneti_Mendes: Yeap, bad_alloc is basically an out of memory situation. The question is why… Is this table backed up by a view or an index? A range delete working as opposed to simply a partition delete seems to point to a read-before-write, which would explain why you would see bad_alloc’s and a 2M allocation…

The recommendation would be to resolve the OOM situation and just run a normal delete, adding more memory helps, if it still doesn’t address the problem, then maybe opening an issue so we can understand it better.

In any case, it is not like range deletes will crash your cluster nor anything, so if you are just comfortable with it, do it and run a major after. But be warned this isn’t a standard approach, you are just working around a problem.

@golang_learner: Yes, we defined a materialized view for our main table.
Is there any scylla config we can increase so that we do not get this bad_alloc error?
Our Scylla version is 4.5.3 open source.

@Felipe_Cardeneti_Mendes: not really. maybe one thing to try if you can would be to drop the view, try the delete, and recreate it. Or add more memory. Or do you workaround.

@golang_learner: We had deployed scylladb with “–smp 2 --memory 2G --overprovisioned 1 --developer-mode 1” in our test environment. When I re-deployed it with “–memory 14G”, the “seastar_memory - oversized allocation: 2052096 bytes.” warning is logged but there were no bad_alloc error.
Can we say that the “seastar_memory - oversized allocation: 2052096 bytes.” warning log is normal for deleting a partition for 1 million row. Since the deletion was success and there was no such error, I think we can say that even though we have seen warning log

@Felipe_Cardeneti_Mendes: yeap, given you have a view and a delete to a base table needs to propagate the deletes to the underlying view (which involves a read before write), and given that you have a large partition, a large allocation is expected to some extent
see ? Much easier than range deletes That’s how it should be

@golang_learner: Our production environments are much more powerful than our test environment and we do not set any memory so ScyllaDB uses all the available memory. We may not expect any bad_alloc error but we are writing a lot of events such as 5 million in a minute. Does the writes consume memory a lot?

@Felipe_Cardeneti_Mendes: No, writes are cheap. But you’re right that a delete is effectively a write, thus a delete paired with a view can show off some large allocations, but it shouldn’t get to a point where you bad_alloc if you have sufficient memory.

@golang_learner: Okey. We will delete the whole partition with sufficient memory. We would be ready for timeout exceptions (deleting a partition may last more than 30 seconds) with retries in our Kafka consumers.
Thanks a lot. I appreciate it

Topic		Replies	Views
Safey of deleting very large partitions to reclaim disk space ScyllaDB data-model , large-partitions	0	26	October 30, 2024
Compaction task occurred bad_alloc when there is large partition ScyllaDB error-message , compaction , large-partitions	1	38	April 15, 2025
Manual bucketing and large partitions problem ScyllaDB twcs , large-partitions	4	148	December 14, 2024
Scylla constantly flushes memtables and runs huge number of compactions ScyllaDB	0	74	March 18, 2025
Big rows do not compact multi sstable to one? ScyllaDB data-model , alternator	8	78	April 3, 2025

Memory issue - deleting big partitions in ScyllaDB with TimeWindowCompactionStrategy

Related topics