Recommendations for partitioning imbalanced data

Hi all,

My team is considering using ScyllaDB (cloud version) to improve cost and efficiency when reading and querying on a couple of our larger datasets. Since we know in advance exactly the kind of queries we want to make it seems like Scylla is a great option but I am having a bit of trouble coming up with an appropriate schema that won’t lead to hot partitions.

Context:

We deploy a lot of agents that report their health status and other data about where they are installed. One of our customers has hundreds of thousands of agents running, others have tens of thousands, and many have just a few hundred. The most common query we’ll want to do is “information about all running agents health for customer X”. The word “running” there, is highly subjective and my plan was to adjust the TTL of the tables we create to determine what is running vs hasn’t checked in a while. I don’t foresee an scenario where we would ever want to query across different customers (at least at the scylladb level).

So a naive schema might look something like:

CREATE TABLE agents (
  customer_name text,
  agent_id uuid,
  /* snip agent metadata */,
  PRIMARY KEY (customer_name, agent_id)
);

This means we can easily get into a partition to search for all agents for a customer but then there is a hot and large partition on the customer with hundreds of thousands of agents running.

Another naive schema might be:

CREATE TABLE agents (
  customer_name text,
  agent_id uuid,
  /* snip agent metadata */,
  PRIMARY KEY ((customer_name, agent_id))
);

Now the partition key is customer + agent id so we no longer have hot and large partitions but we lost the ability to efficiently query all agents for a customer since we need the agent_id to get into the partition.

I’ve also considered making separate tables or separate keyspaces per customer but as far as I could find having a dynamic number of keyspaces/tables is an anti-pattern because there is a non-negligible cost to maintaining separate tables (I am not allowed to post links but I’ve seen other discussions in this forum about it as well as in a cassandra blog)

Do you all have any advise or suggestions on how best to design a schema for such use cases where one subset of the “same” data is significantly larger than the rest?

I wanted to recommend using the second schema (PRIMARY KEY ((customer_name, agent_id))) and creating an materialized view on it, but the I realized the MV would have the same hot partition problem that your first table proposal has and at that point it is better to just go with the first proposal, it will be simpler.

Maybe you can tweak it slightly to deal with the hot partition problem:

CREATE TABLE agents (
  customer_name text,
  agent_id uuid,
  /* snip agent metadata */,
  PRIMARY KEY ((customer_name, agent_group), agent_id)
);

Where agent_group is derived from agent_id. E.g. if agent_id is a monotonically increasing counter, you can make agent_group = agent_id % 1000, dispersing agents into 1000 different partitions. There will still be some which are larger and hotter than others, but the difference will be less drastic.
You can think of a more sophisticated method to map agent_id to agent_group, this was just a simple example.

1 Like