How Do Many Small Partitions Influence Memory Usage in ScyllaDB?

amata1219 · September 24, 2024, 8:00am

Hello everyone,

In ScyllaDB, does having a large number of small partitions significantly increase Bloom filter memory consumption?

I found a post discussing this issue in Cassandra, as shown in the following link:

However, I couldn’t find similar information specific to ScyllaDB.

On the other hand, ScyllaDB strongly recommends creating small partitions to avoid hotspots. Would it be acceptable to reduce the number of partitions and make them larger, as long as hotspots can still be avoided?

Thank you in advance!

amata1219 · September 25, 2024, 9:42am

It seems I was confusing the primary key and partition key. While the partition key should ideally have high cardinality, the primary key itself does not need to have high cardinality. ScyllaDB University recommends designing partitions to be neither too small nor too large.

The maximum number of rows per partition is not universally determined, but varies depending on the actual workload and queries.
https://groups.google.com/g/scylladb-users/c/I_qHdQV5u1Q/m/BIjBveceCQAJ

The official blog below suggests keeping the partition size below 100MB. Currently, it seems that warning logs are generated for sizes above 1000MB, as the limit has been raised.

It appears that having a very large number of small partitions significantly increases the memory consumption of Bloom filters, which is true not only for Cassandra but also for ScyllaDB.
https://groups.google.com/g/scylladb-users/c/3d-GhCl6x3U/m/V_YUpv3vAwAJ

github.com/scylladb/scylladb

Wasteful storage of table with very short partitions

opened 08:57PM - 23 Mar 23 UTC

nyh

enhancement user request area/storage area/sstable

A user complained that a table with a huge number of very short partitions was s…urprisingly big - perhaps as much as 4 times larger than if the same data is stored in a modest number of large partitions. Let's look at a simple example. Consider two tables, each of them has three integer columns, `p, c, x`: * In table1, `p` is the partition key, `c` is the clustering key, `x` is a regular column. * In table2, `(p,c)` is a compound partition key, `x` is a regular column. We write a million rows to each table, with `(p, c, x) = (1, i, 1)` for one million i's. So both tables have exactly the same data - table1 is one partition with one million rows, and table2 in one million partitions, but both have exactly the same rows. It turns out that after compaction, the size of table1's sstables is 8.2 MB, and the size of table2 is 34 MB. table2 is more than 4 times larger than table1! To understand why, let's look at the size of the individual sstable components: 1. For table1, almost the entire 8.2 MB size is the "Data" component. The "Index" component is almost empty (just one partition). 2. For table2, the "Data" component is 12.6 MB, the "Index" component is 20MB. The "Filter" is 1.2 MB. It is not surprising that table2's Data component is slightly larger than table1's (12.6 MB vs 8.2 MB) - after all the individual partitions do have some overhead (e.g., a tombstone), and this overhead is noticable when the partitions are so tiny (just a single integer). It's also not surprising that the Bloom filter (the "Filter" file) takes more space when we have many partitions. But what is really surprising, and really frustrating is the size of the Index file, which is almost twice bigger than the Data file, which we didn't expect. The idea I want to propose in this issue is that when partitions are very short, it would be better not to have an Index file at all. The Summary file could point directly into the Data file instead of the Index file. I don't know what should be the threshold for dropping the Index file - for larger partitions, it may still be good. Maybe we can write the Data and Index file as we do today, and after-the-fact, if we notice that Index is larger than Data (or even if it is larger than half of Data), we delete the Index component and rewrite the Summary. Instead of dropping the Index file, and even more efficient thing to do can be to create in Index a level between the Data and Summary - in other words, Index will be a sample of Data's keys - not every paritition as in Data but not a sparse a sample as Summary, but something in the middle. But this will require more work to implement. This use case, of very short partitions, may seem artificial, but a real user encountered it with a materialized view - the user had a base table with reasonably-long partitions, but then had a view with a compound partitions, where each row of data was in its own partition - and each row was also very short. The user didn't even realize that this very-short-partitions case was happening, but was surprised that the view was 4 times larger than the base table. Code for the tests described above: ```python @pytest.fixture(scope="function") def table1(cql, test_keyspace): t = f'{test_keyspace}.{unique_name()}' cql.execute(f'CREATE TABLE {t}(p int, c int, x int, PRIMARY KEY (p, c))') yield t cql.execute(f'DROP TABLE {t}') @pytest.fixture(scope="function") def table2(cql, test_keyspace): t = f'{test_keyspace}.{unique_name()}' cql.execute(f'CREATE TABLE {t}(p int, c int, x int, PRIMARY KEY ((p, c)))') yield t cql.execute(f'DROP TABLE {t}') # Table with a single 1-million-row partition, each row has, in addition # to the key, a single int. def test_single_partition(cql, table1): table = table1 stmt = cql.prepare(f"INSERT INTO {table} (p, c, x) VALUES (1, ?, 1)") for i in range(1000000): cql.execute(stmt, [i]) if (i%100000)==0: print(i) nodetool.flush(cql, table) nodetool.compact(cql, table) nodetool.flush(cql, table) nodetool.compact(cql, table) print('going to sleep\n') time.sleep(10000) # Table with a single 1-million partitions, each with a single row # (no clustering key), and each row has additionally an int value def test_million_partitions(cql, table2): table = table2 stmt = cql.prepare(f"INSERT INTO {table} (p, c, x) VALUES (1, ?, 1)") for i in range(1000000): cql.execute(stmt, [i]) if (i%100000)==0: print(i) nodetool.flush(cql, table) nodetool.compact(cql, table) nodetool.flush(cql, table) nodetool.compact(cql, table) print('going to sleep\n') time.sleep(10000) ```

Regarding the issue of Bloom filter size, various efforts are still ongoing.

github.com/scylladb/scylladb

[Epic] Bloom Filter

opened 06:56AM - 22 Mar 24 UTC

denesb

Field-Tier1 area/compaction type/epic P1 area/bloom filter

Bloom filters allow for quick and cheap checks for whether a partition exists in… an sstable or not, without doing any I/O. Bloom filters are allowed to provide false-positive answers (check says that the partition is in the sstable, but its not), but it can never provide false-negative answers. They are created based on two parameters: partition count (often an estimate) and require false-positive chance (probability of false-positive results). To satisfy these two requirements, a filter of a certain size is required. The filter size is proportional to the partition count and inversely proportional to the false-positive chance. Too large or too small filters are problematic. Too small filters will result in too many false-positives, increasing read latency and read memory consumption (extra I/O). Too large filters consume a lot of extra memory (filters are always kept in memory). Mis-sized filters are a result of bad partition estimates: our estimate of how many partitions are in an sstable is either an over or under estimate. In the extreme case, either severe under or over estimates can cause OOM. Under estimates because a lot of sstables are opened for reads (this is rare) and over estimates because the filters end up using up all memory of the shard. Design Document: https://docs.google.com/document/d/1LHuDCF2YbTsXBLG3l6KH5RZVJmq_Y9GrVj3zvPHUiKc/edit?usp=sharing This issue is an epic, covering the effort of resolving all our current issues with bloom filters. List of issues currently part of this epic: ```[tasklist] ### Tasks - [ ] https://github.com/scylladb/scylladb/issues/17747 - [ ] https://github.com/scylladb/scylladb/pull/18141 - [ ] https://github.com/scylladb/scylladb/issues/18398 - [ ] https://github.com/scylladb/scylladb/issues/18283 - [ ] https://github.com/scylladb/scylladb/issues/18607 - [ ] https://github.com/scylladb/scylladb/issues/19049 - [ ] https://github.com/scylladb/scylladb/issues/2024 ```

Based on this information, my conclusions are as follows:

An increase in Bloom filter memory consumption due to the presence of many small partitions is inevitable.
Addressing this issue is similar to handling hot/big partition problems, and it is important to design partitions appropriately and monitor them closely.

Since my original question has been resolved, I consider this issue closed, but if there are any inaccuracies in the information, please let me know.

Topic		Replies	Views
What is the maximum number of records that a scylla table can carry? ScyllaDB	3	1473	June 8, 2023
Limitations of tiny partitions ScyllaDB data-model , sizing , materialized-views , secondary-index , hot-partition	3	44	October 28, 2024
Smallest sensible memory footprint for small database? ScyllaDB	5	490	December 19, 2023
Maximal limit for the number of partitions and other partition recommendations ScyllaDB data-model , performance	0	21	January 6, 2025
Cassandra Vs ScyllaDB Memory Usage ScyllaDB cassandra , performance	2	747	December 13, 2022

How Do Many Small Partitions Influence Memory Usage in ScyllaDB?

Related topics