Big rows do not compact multi sstable to one?

Installation details
#ScyllaDB version: 5.4
#Cluster size: 6 nodes
os (RHEL/CentOS/Ubuntu/AWS AMI): CentOS

I update an item by adding attrs to make it a big row, but it do not compact to one. Could you please help me figure it out why it does not merge to one ? Large rows information see here

cqlsh> select * from system.large_rows;

 keyspace_name                     | table_name             | sstable_name                                | row_size | partition_key | clustering_key | compaction_time
-----------------------------------+------------------------+---------------------------------------------+----------+---------------+----------------+---------------------------------
 alternator_obj_meta_general_qlj_2 | obj_meta_general_qlj_2 | me-3gop_1eb9_1sij42kpajyiqabklc-big-Data.db | 87420304 |   example_obj |     example_bi | 2025-03-19 18:06:46.851000+0000
 alternator_obj_meta_general_qlj_2 | obj_meta_general_qlj_2 | me-3got_0oev_0m2r42kpajyiqabklc-big-Data.db | 52502825 |   example_obj |     example_bi | 2025-03-23 08:47:27.066000+0000
 alternator_obj_meta_general_qlj_2 | obj_meta_general_qlj_2 | me-3gov_03xb_3ndxs2kpajyiqabklc-big-Data.db | 16134442 |   example_obj |     example_bi | 2025-03-25 01:24:53.256000+0000

(3 rows)
cqlsh> 
cqlsh> 
cqlsh> select * from system.large_cells;

 keyspace_name                     | table_name             | sstable_name                                | cell_size | partition_key | clustering_key | column_name | collection_elements | compaction_time
-----------------------------------+------------------------+---------------------------------------------+-----------+---------------+----------------+-------------+---------------------+---------------------------------
 alternator_obj_meta_general_qlj_2 | obj_meta_general_qlj_2 | me-3gop_1eb9_1sij42kpajyiqabklc-big-Data.db |  87420279 |   example_obj |     example_bi |      :attrs |              116645 | 2025-03-19 18:06:45.504000+0000
 alternator_obj_meta_general_qlj_2 | obj_meta_general_qlj_2 | me-3got_0oev_0m2r42kpajyiqabklc-big-Data.db |  52502800 |   example_obj |     example_bi |      :attrs |               70025 | 2025-03-23 08:47:24.623000+0000
 alternator_obj_meta_general_qlj_2 | obj_meta_general_qlj_2 | me-3gov_03xb_3ndxs2kpajyiqabklc-big-Data.db |  16134417 |   example_obj |     example_bi |      :attrs |               21496 | 2025-03-25 01:24:52.070000+0000

(3 rows)
cqlsh> 

Bash for test like this

#!/bin/bash

head /dev/urandom | tr -dc A-Za-z0-9 | head -c 1024 > new_value_data2.txt

for i in {1..500000}; do
    newValue=$(cat new_value_data2.txt)

    attrName="new_attribute2_$i"

    echo "Value to be set for ${attrName}: ${newValue}"

    sudo /usr/local/bin/aws dynamodb update-item \
        --table-name obj_meta_general_qlj_2 \
        --key '{"obj": {"S": "example_obj"}, "bi": {"S": "example_bi"}}' \
        --update-expression "SET #newAttr = :newValue" \
        --expression-attribute-names '{"#newAttr": "'"${attrName}"'"}' \
        --expression-attribute-values '{":newValue": {"S": "'"${newValue}"'"}}' \
        --endpoint-url http://10.224.0.6:8000

    echo "Iteration $i completed with attribute name: $attrName."
done

The system.large_* tables reflect the SSTables as they are on the disk. If ScyllaDB is not compacting the SSTables together, the entries in system.large_* also won’t be merged.

You can force the SSTables to be all compacted into a single one with nodetool compact (major compaction), but this is unnecessary.

1 Like

In fact, I am trying to test a failure that result a node restart, which is due to compact big partition. Then, if there is big partition / row , and activate a compaction, will it cause bad_alloc ?

Compacting a large partition should not cause any bad alloc, ScyllaDB doesn’t read entire partitions into memory. Very big rows on the other hand can cause problems because ScyllaDB reads the entire row into memory. If the row is big enough to take too much memory, it will cause problems for sure.

Then large rows warn info is just to warn users, but does nothing to stop it, right? Maybe here is supposed to set some limitation to avoid such bad_alloc exception caused by big rows read at once?

It is hard to implement such limitations. Such large rows can build up over time, with small individual writes adding up to a large row. If the database refuses to read the row when it becomes large, this will block compaction. Proper solution would be to not read all the row into memory, but this is a lot of complex work, for this edge case. So for now, we have the large_rows table and the user is expected to keep an eye on large rows and take action before/after they become a problem.

2 Likes

Thanks for your nice answer. Then is there any metrics that may help us get large rows info more conveniently ?

I think the large_* tables are also exported to monitoring, although they are an optional (opt-in) feature.
@Amnon_Heiman can you point us to the documentation on this?