Originally from the User Slack
@Bohdan_Smal: Hello everyone,
I have a somewhat general question. Could you please advise on the potential risks of not using a clustering key?
Currently, we have a table where the primary key is a combination of brand and client_id, and the clustering key is transaction_id. We have observed that we could achieve better performance and more even data distribution if we don’t use a clustering key. Instead, we would set the primary key as transaction_id, brand, and client_id.
This way, each transaction becomes a separate partition, ensuring no imbalance in partition distribution across nodes, even if some clients have more transactions. However, we are not fully aware of the potential risks associated with having a large number of partitions.
Can anyone explain the possible risks of this approach?
Thank you!
@Karol_Baryła: I’m not sure what are the performance implications of large number of partitions. What comes to my mind is usability. With such a schema you can’t e.g:
• Select all transactions for a given user without using ALLOW FILTERING
and making the query much slower this way.
• Use LWT / Batches with LWT to atomically update several transactions for a given user.
@Bohdan_Smal: Got it, thank you for your response. Have a great day!
@avi: In fact having more and smaller partitions is better than having fewer and larger partitions. So if you don’t need a clustering key for sorting and grouping, don’t use it.
@Bohdan_Smal: thank you)