Big rows do not compact multi sstable to one?

qiuqiu · March 25, 2025, 1:32am

Installation details
#ScyllaDB version: 5.4
#Cluster size: 6 nodes
os (RHEL/CentOS/Ubuntu/AWS AMI): CentOS

I update an item by adding attrs to make it a big row, but it do not compact to one. Could you please help me figure it out why it does not merge to one ? Large rows information see here

cqlsh> select * from system.large_rows;

 keyspace_name                     | table_name             | sstable_name                                | row_size | partition_key | clustering_key | compaction_time
-----------------------------------+------------------------+---------------------------------------------+----------+---------------+----------------+---------------------------------
 alternator_obj_meta_general_qlj_2 | obj_meta_general_qlj_2 | me-3gop_1eb9_1sij42kpajyiqabklc-big-Data.db | 87420304 |   example_obj |     example_bi | 2025-03-19 18:06:46.851000+0000
 alternator_obj_meta_general_qlj_2 | obj_meta_general_qlj_2 | me-3got_0oev_0m2r42kpajyiqabklc-big-Data.db | 52502825 |   example_obj |     example_bi | 2025-03-23 08:47:27.066000+0000
 alternator_obj_meta_general_qlj_2 | obj_meta_general_qlj_2 | me-3gov_03xb_3ndxs2kpajyiqabklc-big-Data.db | 16134442 |   example_obj |     example_bi | 2025-03-25 01:24:53.256000+0000

(3 rows)
cqlsh> 
cqlsh> 
cqlsh> select * from system.large_cells;

 keyspace_name                     | table_name             | sstable_name                                | cell_size | partition_key | clustering_key | column_name | collection_elements | compaction_time
-----------------------------------+------------------------+---------------------------------------------+-----------+---------------+----------------+-------------+---------------------+---------------------------------
 alternator_obj_meta_general_qlj_2 | obj_meta_general_qlj_2 | me-3gop_1eb9_1sij42kpajyiqabklc-big-Data.db |  87420279 |   example_obj |     example_bi |      :attrs |              116645 | 2025-03-19 18:06:45.504000+0000
 alternator_obj_meta_general_qlj_2 | obj_meta_general_qlj_2 | me-3got_0oev_0m2r42kpajyiqabklc-big-Data.db |  52502800 |   example_obj |     example_bi |      :attrs |               70025 | 2025-03-23 08:47:24.623000+0000
 alternator_obj_meta_general_qlj_2 | obj_meta_general_qlj_2 | me-3gov_03xb_3ndxs2kpajyiqabklc-big-Data.db |  16134417 |   example_obj |     example_bi |      :attrs |               21496 | 2025-03-25 01:24:52.070000+0000

(3 rows)
cqlsh>

qiuqiu · March 26, 2025, 3:26am

Bash for test like this

#!/bin/bash

head /dev/urandom | tr -dc A-Za-z0-9 | head -c 1024 > new_value_data2.txt

for i in {1..500000}; do
    newValue=$(cat new_value_data2.txt)

    attrName="new_attribute2_$i"

    echo "Value to be set for ${attrName}: ${newValue}"

    sudo /usr/local/bin/aws dynamodb update-item \
        --table-name obj_meta_general_qlj_2 \
        --key '{"obj": {"S": "example_obj"}, "bi": {"S": "example_bi"}}' \
        --update-expression "SET #newAttr = :newValue" \
        --expression-attribute-names '{"#newAttr": "'"${attrName}"'"}' \
        --expression-attribute-values '{":newValue": {"S": "'"${newValue}"'"}}' \
        --endpoint-url http://10.224.0.6:8000

    echo "Iteration $i completed with attribute name: $attrName."
done

Botond_Denes · March 26, 2025, 11:48am

The system.large_* tables reflect the SSTables as they are on the disk. If ScyllaDB is not compacting the SSTables together, the entries in system.large_* also won’t be merged.

You can force the SSTables to be all compacted into a single one with nodetool compact (major compaction), but this is unnecessary.

qiuqiu · March 26, 2025, 11:59am

In fact, I am trying to test a failure that result a node restart, which is due to compact big partition. Then, if there is big partition / row , and activate a compaction, will it cause bad_alloc ?

Botond_Denes · March 26, 2025, 12:05pm

Compacting a large partition should not cause any bad alloc, ScyllaDB doesn’t read entire partitions into memory. Very big rows on the other hand can cause problems because ScyllaDB reads the entire row into memory. If the row is big enough to take too much memory, it will cause problems for sure.

qiuqiu · March 31, 2025, 9:12am

Then large rows warn info is just to warn users, but does nothing to stop it, right? Maybe here is supposed to set some limitation to avoid such bad_alloc exception caused by big rows read at once?

Botond_Denes · March 31, 2025, 4:17pm

It is hard to implement such limitations. Such large rows can build up over time, with small individual writes adding up to a large row. If the database refuses to read the row when it becomes large, this will block compaction. Proper solution would be to not read all the row into memory, but this is a lot of complex work, for this edge case. So for now, we have the large_rows table and the user is expected to keep an eye on large rows and take action before/after they become a problem.

qiuqiu · April 1, 2025, 2:08am

Thanks for your nice answer. Then is there any metrics that may help us get large rows info more conveniently ?

Botond_Denes · April 3, 2025, 8:19am

I think the large_* tables are also exported to monitoring, although they are an optional (opt-in) feature.
@Amnon_Heiman can you point us to the documentation on this?

Topic		Replies	Views
Compaction task occurred bad_alloc when there is large partition ScyllaDB error-message , compaction , large-partitions	1	38	April 15, 2025
Scylla constantly flushes memtables and runs huge number of compactions ScyllaDB	0	75	March 18, 2025
Compaction Storm slows down Scylla ScyllaDB troubleshooting , compaction	24	227	October 7, 2024
Manual bucketing and large partitions problem ScyllaDB twcs , large-partitions	4	153	December 14, 2024
Last week in scylladb.git master (issue #202; 2023-10-29) ScyllaDB git-news	0	228	October 29, 2023

Big rows do not compact multi sstable to one?

Related topics