I want to ask for a suggestion on how to rethink my data model for a use case in which I’m using ScyllaDB, if anyone can help me out with a suggestion.
I’m using ScyllaDB to store the data generated by running multiple scripts. All of them run in parallel, and they generate a different amount of data. Up until now I’ve been using a bad partition key as the run_id for each script run. However, since they generate different amount of data, that caused some issues because some partitions are bigger than others and some nodes in my cluster have run out of disk space because of that.
What I want to know is how to deal with this, keeping in mind that:
at the end of each script I want to be able to delete all of the data created by the run
randomly throughout the script runtime, I want to access the entries written
These use cases are problematic because, from what I’ve seen, I need to be able to provide the full partition key for both SELECT and DELETE queries.
I’ve been thinking of using a hash value for the partition key, unique for each entry inserted. However, in the end it will be impossible to delete all of these entries created without a way to store all of these hash values since DELETE needs the full partition key value.
My question is, what is better to do about DELETES like this that need to target data from multiple partitions, especially when the partition key has high cardinality? Thanks
As I could understand from your modeling is that you’re adding everything inside the partition with run_id. If you can share a snippet of your data modeling would be good either (don’t need to be a real one, but something that shows the problem you’re facing).
SELECT request_info FROM scripts WHERE run_id = <run_id> AND request_id = <request_id>;
and at the end I need to be able to do
DELETE FROM scripts WHERE run_id = <run_id>;
the values of the request_id column are hashed values that would be really good for spreading the data as a partition key, however with it I won’t be able to correctly target the delete for all the data saved throughout the run.
Your developer environment doesn’t have something that can get merged at the partition key to make it traceable?
If you’re running scripts, probably it has a name or something.
Also, you can use CDC to listen your table changes and do whatever you want and then erase these data right after.
But you can also add a field like timeuuid as a clustering and uses run_id and request_id as your partition ((run_id, request_id), ts_uuid) so you will never have duplicates during the run and still traceable.
Hello, as I mentioned previously, what could be merged to the partition key to really increase the cardinality and split the data better is the values from the request_id column. But that won’t really help me with targeting all the data of a scrip in the DELETE at the end.
The name of the scripts is what’s being used as a partition key now, as I’ve mentioned, but since scripts generate different amount of data, that’s not a good partition key since the partitions are really different in size.
I’m not really sure how CDC might help me in this situation.
Also, I don’t have problems with duplicated entries so adding a timeuuid will not help me either. Adding that won’t help with either the SELECTS, nor the DELETES.
One more idea:
If you do not have many scripts, you can create a table per script and use the run_id as the partition of this table.
Once you are done with the script, drop the table.
thank you! this sounds good, I think I might try this but I have one more question, will multiple tables be a in any way problematic? cus I’ll eventually need to run a couple of hundreds of scripts and I’m not really sure if this will badly affect the db