Hi all,
My team is considering using ScyllaDB (cloud version) to improve cost and efficiency when reading and querying on a couple of our larger datasets. Since we know in advance exactly the kind of queries we want to make it seems like Scylla is a great option but I am having a bit of trouble coming up with an appropriate schema that won’t lead to hot partitions.
Context:
We deploy a lot of agents that report their health status and other data about where they are installed. One of our customers has hundreds of thousands of agents running, others have tens of thousands, and many have just a few hundred. The most common query we’ll want to do is “information about all running agents health for customer X”. The word “running” there, is highly subjective and my plan was to adjust the TTL of the tables we create to determine what is running vs hasn’t checked in a while. I don’t foresee an scenario where we would ever want to query across different customers (at least at the scylladb level).
So a naive schema might look something like:
CREATE TABLE agents (
customer_name text,
agent_id uuid,
/* snip agent metadata */,
PRIMARY KEY (customer_name, agent_id)
);
This means we can easily get into a partition to search for all agents for a customer but then there is a hot and large partition on the customer with hundreds of thousands of agents running.
Another naive schema might be:
CREATE TABLE agents (
customer_name text,
agent_id uuid,
/* snip agent metadata */,
PRIMARY KEY ((customer_name, agent_id))
);
Now the partition key is customer + agent id so we no longer have hot and large partitions but we lost the ability to efficiently query all agents for a customer since we need the agent_id to get into the partition.
I’ve also considered making separate tables or separate keyspaces per customer but as far as I could find having a dynamic number of keyspaces/tables is an anti-pattern because there is a non-negligible cost to maintaining separate tables (I am not allowed to post links but I’ve seen other discussions in this forum about it as well as in a cassandra blog)
Do you all have any advise or suggestions on how best to design a schema for such use cases where one subset of the “same” data is significantly larger than the rest?