Hello, guys!
If I put 1 trillion records in one table, each record is a separate partition, is this a good schema design, what problems will exist?
In other words, in a large-scale cluster, do we need to control the number of partitions for a single table?
When I have a lot of small partitions, does it mean that the reading efficiency will be poor, for example, there will be problems in the construction of Bloom filters. How should I tune some table-level parameters for good performance, such as compaction strategy?
ScyllaDB loves many partitions, I would even say that many small partitions are the sweet spot for ScyllaDB. This is mostly because having many partitions will mean that data is well distributed among the nodes and shards in the cluster. Uneven data distribution leads to hot shards and hot nodes, which slow the cluster down.
On the other hand, having partitions that are tiny can have some side effects. You may find that not as many as them fits in cache as you would expect, as there is a constant per-partition size overhead for partitions stored in memory. View update generation during repair, for tables that have materialized views or secondary indexes attached, can get quite slow with many tiny partitions.
All that said, you should not see any problems with storing huge number of small partitions in ScyllaDB. Like I said above, this is the sweet spot for ScyllaDB and there is no limit (theoretical or hard) as to how many partitions you can have.
Choosing the compaction strategy is more of a question of what workloads you have, than how your data is organized.
Do we have parameters to control this value?
No, this is just the overhead of the C++ objects involved in storing and organizing partitions. Nothing we can do about that.