Scylla constantly flushes memtables and runs huge number of compactions

Installation details

#ScyllaDB version: 6.1.4
#Cluster size: 6
os (RHEL/CentOS/Ubuntu/AWS AMI): Ubuntu 22.04/24.04 (on some nodes), ScyllaDB is running in Docker containers (image: https://hub.docker.com/layers/scylladb/scylla/6.1.4/images/sha256-a507e50f703662580230d54269876d491c9bb7f6110a1a95af6341b74e86154c)
Limits: memory - 16 GiB, CPU - 12 cores

Hello

TL;DR

I’m facing quite a strange Scylla’s behavior: it seems like Scylla doesn’t keep any data in memtables and instantly flushes any new data into new SSTable causing too frequent compaction runs as soon as there are at least 2 new equally-sized SSTables. This leads to performance issues (~20% of queries are timed out), resource shortage (quite frequent OOM’s, CPU throttling), and overall cluster overload.

On the other hand we have several production Scylla clusters in our company running in the same or very close environments - and there are no issues with them at all. Data models, configuration, workload patterns, RPSs are very similar too. So there is something here completely going wrong, or maybe I’m just missing something obvious. Anyway I need your help - any hints on causes of such behavior would be very appreciated, details are following

Details

Data model: there are 8 tables in the keyspace, general schema:

CREATE TABLE IF NOT EXISTS keyspace.entity_a
(
    entity_a_id uuid PRIMARY KEY,
    some_column_a text,
    some_column_b timestamp,
    ... more text and timestamp columns ...
) WITH
    caching = {'keys' : 'ALL', 'rows_per_partition' :'NONE'}
    AND compaction = {'class' : 'SizeTieredCompactionStrategy'};

Compaction strategy used is default SizeTieredCompactionStrategy with default options. Earlier I’ve tried to increase the tombstone_compaction_interval option but it didn’t improve anything

Workload: typical workload - read some entity, update some fields and save it, so numbers of reads and writes are expected to be approximately equal

Symptoms: first signs of problems came with alerts of high memory and CPU consumption from monitoring system. After that random nodes started to crash periodically with 139 exit code (segfaults I guess).

Searching through Scylla’s logs I discovered only the following types of errors (some info is replaced with placeholders):

2025-03-15T06:21:11+03:00 WARN  2025-03-15 06:21:11,800 [shard  4:strm] storage_proxy - Failed to apply mutation from {IP}#4: logalloc::bad_alloc (failed to refill emergency reserve of 30 (have 23 free segments)) 

---

2025-03-15T06:21:11+03:00 ERROR 2025-03-15 06:21:11,819 [shard 11:stmt] sstable - failed reading index for /var/lib/scylla/data/keyspace/entity-9fbaa2c0ea1b11ef9b1971c775e4da32/me-3gol_09ah_26nsg2mgniaba2ww9j-big-Data.db: logalloc::bad_alloc (failed to refill emergency reserve of 30 (have 23 free segments))

---

2025-03-15T06:21:11+03:00 terminate called after throwing an instance of 'logalloc::bad_alloc'

---

2025-03-14T20:32:13+03:00 WARN  2025-03-14 20:32:13,799 [shard  5:strm] storage_proxy - Failed to apply mutation from {IP}#5: std::_Nested_exception<std::runtime_error> (frozen_mutation::unfreeze_gently(): failed unfreezing mutation pk{0010ffa76327f52e44c18a1fa694bb3fab53} of keyspace.entity): std::runtime_error (IDL frame truncated: expected to have at least 4 bytes, got 0)

Further research showed that there is always quite a large scheduler’s task queue (something about 50-100 tasks all the time) and Scylla is constantly running some compactions (there are 50-100 running compactions all the time according to compaction manager’s data). This data agrees with Scylla’s log records on compactions - there are 200k (!) successful compactions per table per day on average according to the logs

Running nodetool tablestats gives the following data (I’m showing only one table - other tables’ stats are very similar):

Keyspace : keyspace
	Read Count: 6478065
	Read Latency: 4.983646351186658E-03 ms
	Write Count: 20058411
	Write Latency: 2.666766574879735E-05 ms
	Pending Flushes: 51
		Table: entity_a
		SSTable count: 231
		SSTables in each level: [231/4]
		Space used (live): 4502553151
		Space used (total): 4502553151
		Space used by snapshots (total): 0
		Off heap memory used (total): 32058036
		SSTable Compression Ratio: 0.9
		Number of partitions (estimate): 21219500
		Memtable cell count: 0
		Memtable data size: 0
		Memtable off heap memory used: 0
		Memtable switch count: 2129661
		Local read count: 1260855
		Local read latency: 2.517 ms
		Local write count: 3654451
		Local write latency: 0.021 ms
		Pending flushes: 2
		Percent repaired: 0.0
		Bloom filter false positives: 9187
		Bloom filter false ratio: 0.12558
		Bloom filter space used: 27063252
		Bloom filter off heap memory used: 27063240
		Index summary off heap memory used: 4994796
		Compression metadata off heap memory used: 0
		Compacted partition minimum bytes: 87
		Compacted partition maximum bytes: 642
		Compacted partition mean bytes: 322
		Average live cells per slice (last five minutes): 0.0
		Maximum live cells per slice (last five minutes): 0
		Average tombstones per slice (last five minutes): 0.0
		Maximum tombstones per slice (last five minutes): 0
		Dropped Mutations: 0

And here come first oddities:

  • Read and write counts were expected to be approximately the same but actually differ by more than an order of magnitude - does Scylla count only successful reads?
  • Memtable switch count is quite large in comparison with SSTables and partitions numbers
  • There is no in-memory data
  • Bloom filter false ratio is probably higher (?) than expected - not sure if it’s really a problem

nodetool sstableinfo on the same table reports something quite similar to the following data for the vast majority of SSTables:

      0 :
              data size : 97
            filter size : 12
             index size : 20
                  level : 0
                   size : 6087
             generation : 3gon_0wq0_5htww1y59nn2zhk0e2
                version : me
              timestamp : 2025-03-17T11:46:48Z
    extended properties :
             compression_parameters :
                        sstable_compression : org.apache.cassandra.io.compress.LZ4Compressor

And here comes the next chunk of oddities:

  • A majority of SSTables have quite small total sizes
  • Total size is much larger than the sum of data, index and filter sizes
  • Tables can be split into several large groups by timestamp values - each table in one certain group has the same (or very close) timestamp value

And finally nodetool cfhistograms reports the following for this table:

Percentile  SSTables     Write Latency      Read Latency    Partition Size        Cell Count
                              (micros)          (micros)           (bytes)                  
50%             1.00             17.00             56.00               310                12
75%             2.00             25.75             96.75               372                14
95%             3.00             48.15          88205.70               446                17
98%             4.00             58.72         234367.68               446                17
99%             4.00             63.29         328035.80               446                17
Min             0.00              2.00             11.00                87                 3
Max            12.00             68.00         401697.00               642                24

Taking into account all of the collected data I supposed that there might be some issue with instant memtables flushing and keeping SSTable size very small but I’m not quite sure

Please let me know if you have any clues on such Scylla’s behavior that can help me to elucidate the root cause of cluster’s problems or if there’s some additional data I could collect that can help in this investigation.

Thank you in advance!