A user complained that a table with a huge number of very short partitions was s…urprisingly big - perhaps as much as 4 times larger than if the same data is stored in a modest number of large partitions.
Let's look at a simple example. Consider two tables, each of them has three integer columns, `p, c, x`:
* In table1, `p` is the partition key, `c` is the clustering key, `x` is a regular column.
* In table2, `(p,c)` is a compound partition key, `x` is a regular column.
We write a million rows to each table, with `(p, c, x) = (1, i, 1)` for one million i's.
So both tables have exactly the same data - table1 is one partition with one million rows, and table2 in one million partitions, but both have exactly the same rows.
It turns out that after compaction, the size of table1's sstables is 8.2 MB, and the size of table2 is 34 MB. table2 is more than 4 times larger than table1! To understand why, let's look at the size of the individual sstable components:
1. For table1, almost the entire 8.2 MB size is the "Data" component. The "Index" component is almost empty (just one partition).
2. For table2, the "Data" component is 12.6 MB, the "Index" component is 20MB. The "Filter" is 1.2 MB.
It is not surprising that table2's Data component is slightly larger than table1's (12.6 MB vs 8.2 MB) - after all the individual partitions do have some overhead (e.g., a tombstone), and this overhead is noticable when the partitions are so tiny (just a single integer). It's also not surprising that the Bloom filter (the "Filter" file) takes more space when we have many partitions. But what is really surprising, and really frustrating is the size of the Index file, which is almost twice bigger than the Data file, which we didn't expect.
The idea I want to propose in this issue is that when partitions are very short, it would be better not to have an Index file at all. The Summary file could point directly into the Data file instead of the Index file.
I don't know what should be the threshold for dropping the Index file - for larger partitions, it may still be good. Maybe we can write the Data and Index file as we do today, and after-the-fact, if we notice that Index is larger than Data (or even if it is larger than half of Data), we delete the Index component and rewrite the Summary.
Instead of dropping the Index file, and even more efficient thing to do can be to create in Index a level between the Data and Summary - in other words, Index will be a sample of Data's keys - not every paritition as in Data but not a sparse a sample as Summary, but something in the middle. But this will require more work to implement.
This use case, of very short partitions, may seem artificial, but a real user encountered it with a materialized view - the user had a base table with reasonably-long partitions, but then had a view with a compound partitions, where each row of data was in its own partition - and each row was also very short. The user didn't even realize that this very-short-partitions case was happening, but was surprised that the view was 4 times larger than the base table.
Code for the tests described above:
```python
@pytest.fixture(scope="function")
def table1(cql, test_keyspace):
t = f'{test_keyspace}.{unique_name()}'
cql.execute(f'CREATE TABLE {t}(p int, c int, x int, PRIMARY KEY (p, c))')
yield t
cql.execute(f'DROP TABLE {t}')
@pytest.fixture(scope="function")
def table2(cql, test_keyspace):
t = f'{test_keyspace}.{unique_name()}'
cql.execute(f'CREATE TABLE {t}(p int, c int, x int, PRIMARY KEY ((p, c)))')
yield t
cql.execute(f'DROP TABLE {t}')
# Table with a single 1-million-row partition, each row has, in addition
# to the key, a single int.
def test_single_partition(cql, table1):
table = table1
stmt = cql.prepare(f"INSERT INTO {table} (p, c, x) VALUES (1, ?, 1)")
for i in range(1000000):
cql.execute(stmt, [i])
if (i%100000)==0:
print(i)
nodetool.flush(cql, table)
nodetool.compact(cql, table)
nodetool.flush(cql, table)
nodetool.compact(cql, table)
print('going to sleep\n')
time.sleep(10000)
# Table with a single 1-million partitions, each with a single row
# (no clustering key), and each row has additionally an int value
def test_million_partitions(cql, table2):
table = table2
stmt = cql.prepare(f"INSERT INTO {table} (p, c, x) VALUES (1, ?, 1)")
for i in range(1000000):
cql.execute(stmt, [i])
if (i%100000)==0:
print(i)
nodetool.flush(cql, table)
nodetool.compact(cql, table)
nodetool.flush(cql, table)
nodetool.compact(cql, table)
print('going to sleep\n')
time.sleep(10000)
```