Data model for frequent deletes with partition key

liana12 · July 30, 2024, 7:19am

Hello!

I want to ask for a suggestion on how to rethink my data model for a use case in which I’m using ScyllaDB, if anyone can help me out with a suggestion.

I’m using ScyllaDB to store the data generated by running multiple scripts. All of them run in parallel, and they generate a different amount of data. Up until now I’ve been using a bad partition key as the run_id for each script run. However, since they generate different amount of data, that caused some issues because some partitions are bigger than others and some nodes in my cluster have run out of disk space because of that.

What I want to know is how to deal with this, keeping in mind that:

at the end of each script I want to be able to delete all of the data created by the run
randomly throughout the script runtime, I want to access the entries written
These use cases are problematic because, from what I’ve seen, I need to be able to provide the full partition key for both SELECT and DELETE queries.
I’ve been thinking of using a hash value for the partition key, unique for each entry inserted. However, in the end it will be impossible to delete all of these entries created without a way to store all of these hash values since DELETE needs the full partition key value.

My question is, what is better to do about DELETES like this that need to target data from multiple partitions, especially when the partition key has high cardinality? Thanks

danielhe4rt · August 6, 2024, 1:14am

Hey Liana!

I can’t imagine how is your modeling fully but here it goes: maybe you can use a partition key with more than one value.

CREATE TABLE scripts (
run_id int,
script_name text,
// other fields
ts timestamp,
PRIMARY KEY ((run_id, script_name), ts)

As I could understand from your modeling is that you’re adding everything inside the partition with run_id. If you can share a snippet of your data modeling would be good either (don’t need to be a real one, but something that shows the problem you’re facing).

Cheers!

liana12 · August 6, 2024, 12:05pm

hello! thanks and ofc, here is how the table is created:

CREATE TABLE scripts(
    run_id text,
    request_id text,
    request_info frozen<request_info_type>,
    PRIMARY KEY (run_id,  request_id)
) ;

what I need to do throughout the script run is :

SELECT request_info FROM scripts WHERE run_id = <run_id> AND request_id = <request_id>;

and at the end I need to be able to do

DELETE FROM scripts WHERE run_id = <run_id>;

the values of the request_id column are hashed values that would be really good for spreading the data as a partition key, however with it I won’t be able to correctly target the delete for all the data saved throughout the run.

danielhe4rt · August 6, 2024, 2:37pm

Your developer environment doesn’t have something that can get merged at the partition key to make it traceable?

If you’re running scripts, probably it has a name or something.
Also, you can use CDC to listen your table changes and do whatever you want and then erase these data right after.

But you can also add a field like timeuuid as a clustering and uses run_id and request_id as your partition ((run_id, request_id), ts_uuid) so you will never have duplicates during the run and still traceable.

liana12 · August 7, 2024, 8:01am

Hello, as I mentioned previously, what could be merged to the partition key to really increase the cardinality and split the data better is the values from the request_id column. But that won’t really help me with targeting all the data of a scrip in the DELETE at the end.
The name of the scripts is what’s being used as a partition key now, as I’ve mentioned, but since scripts generate different amount of data, that’s not a good partition key since the partitions are really different in size.
I’m not really sure how CDC might help me in this situation.
Also, I don’t have problems with duplicated entries so adding a timeuuid will not help me either. Adding that won’t help with either the SELECTS, nor the DELETES.

tzach · August 8, 2024, 12:53am

One more idea:
If you do not have many scripts, you can create a table per script and use the run_id as the partition of this table.
Once you are done with the script, drop the table.

liana12 · August 8, 2024, 7:15am

thank you! this sounds good, I think I might try this but I have one more question, will multiple tables be a in any way problematic? cus I’ll eventually need to run a couple of hundreds of scripts and I’m not really sure if this will badly affect the db

tzach · August 8, 2024, 11:14am

You are right; the DB was designed for only a few tables, but a couple hundred should work.

Topic		Replies	Views
Data modeling issues ScyllaDB data-model , open-source	4	263	March 25, 2024
Datamodel for Scylla/Cassandra for table partition key is not known beforehand -> static field? ScyllaDB	1	301	February 9, 2023
Recommendations for partitioning imbalanced data ScyllaDB data-model , hot-partition	1	130	November 22, 2024
Memory issue - deleting big partitions in ScyllaDB with TimeWindowCompactionStrategy ScyllaDB data-model , troubleshooting , sizing , compaction	0	312	March 19, 2024
I want to know Best Practices for Designing an Efficient Data Model in ScyllaDB? ScyllaDB data-model , performance , tombstone	1	29	February 23, 2025

Data model for frequent deletes with partition key

Related topics