How to model my data when having subgropus of data with vastly different volumes

Guy · December 9, 2024, 4:36am

Originally from the User Slack

@Andres: I have an issue where subgroups of my data have vastly different volumes and this volume disparity isn’t something we can change.

A common query I need to do is “what is the latest state for all active ‘agents’ in a subgroup” so I need to design my tables such that searching with just the subgroup is efficient, meaning have it be the partition key:
CREATE TABLE agents (
  subgroup text,
  agent_id uuid,
  /* snip agent metadata */,
  PRIMARY KEY (subgroup, agent_id)
);
However, this leads to hot/large partitions for the larger subgroups that have magnitudes more number of agents than our smaller subgroups.

If I make the agent_id part the partition key, then I no longer have a hot partition (each agent writes at about the same rate) but I then lose the ability to quickly query for all agents in a subgroup (queries for smaller subgroups still have to go sift through the data of the larger subgroups).

I considered maybe making separate tables or keyspaces per subgroup (since I don’t need to ever search across subgroups) but it seems like having a dynamic number of tables is an anti-pattern (https://forum.scylladb.com/t/any-restriction-on-number-of-tables-in-a-keyspace/482). Do you all have any ideas or alternatives?

@Felipe_Cardeneti_Mendes: Thinking out loud — maybe hash the agent_id UUID (a 16 byte array) and split it to corresponding buckets

When you need to query a specific agent you already know which bucket it lives

When you want to scan all agents, simply read from all buckets concurrently. Find the good balance where ((subgroup, bucket), agent_id) is optimal

@Andres: So keep a static set of bucket identifiers so if I need to query for all agents I iterate through all the buckets client-side, doing a query per bucket?

@Felipe_Cardeneti_Mendes: Yeap— this should spread the load accordingly. The problem is that you cant be too optimistic on the number of buckets as you otherwise risk the smaller subgroups from becoming expensive to read — if you are too conservative you wont resolve the problem as it is.

Another option would be to maybe only apply this to outliers (under a corresponding table) and transition subgroups as needed.

Scanning should be fast as you query all buckets in parallel

@Andres: It’s an interesting idea, unfortunate that unless I only do it to outliers then even the small subgroups also get the “cost” of buckets but it may be worth it. I’d have to think further on how painful, if at all, it would be to do it to only outliers and then transition as needed.
I was also considering maintaining a redis ordered set of subgroup to agent uuids, so that way I can have the agent uuids in the partition key because I would know the uuids via the redis cache

@Felipe_Cardeneti_Mendes: Yeap, a reverse lookup table also works. Scanning an entire subgroup with too many agents may be slower though. Instead of scanning each bucket individually you will now query each agent— you will need to decide on what works and scales best for your use case.

@Andres: Since it will be a query using the IN operation, instead of a query per agent, it shouldn’t be that much slower right? I was imagining a query for every few thousand agents so similarly to paging.

@Felipe_Cardeneti_Mendes: > Since it will be a query using the IN operation, instead of a query per agent, it shouldn’t be that much slower right? I was imagining a query for every few thousand agents so similarly to paging.
It works, but may create shard contention under some circumstances, beware of it. A query (IN or not) is handled by a single coordinator shard. When you ask for many other keys on potentially different shards, this coordinator needs to coordinate with other replicas. Best to test and ensure it matches your P99 latencies requirements.

Topic		Replies	Views
Recommendations for partitioning imbalanced data ScyllaDB data-model , hot-partition	1	136	November 22, 2024
Data modeling - application with highly varying loads, detecting large partitions ScyllaDB data-model , large-partitions	0	98	March 18, 2024
Data model with small partitions ScyllaDB data-model	0	101	March 24, 2024
Querying by non partition key column, creating an index ScyllaDB data-model , secondary-index	0	118	March 31, 2024
Data model for frequent deletes with partition key ScyllaDB data-model	7	86	August 8, 2024

How to model my data when having subgropus of data with vastly different volumes

Related topics