Installation details
#ScyllaDB version: 6.1.4
#Cluster size: 6
os (RHEL/CentOS/Ubuntu/AWS AMI): Ubuntu 22.04/24.04 (on some nodes), ScyllaDB is running in Docker containers (image: https://hub.docker.com/layers/scylladb/scylla/6.1.4/images/sha256-a507e50f703662580230d54269876d491c9bb7f6110a1a95af6341b74e86154c)
Limits: memory - 16 GiB, CPU - 12 cores
Hello
TL;DR
I’m facing quite a strange Scylla’s behavior: it seems like Scylla doesn’t keep any data in memtables and instantly flushes any new data into new SSTable causing too frequent compaction runs as soon as there are at least 2 new equally-sized SSTables. This leads to performance issues (~20% of queries are timed out), resource shortage (quite frequent OOM’s, CPU throttling), and overall cluster overload.
On the other hand we have several production Scylla clusters in our company running in the same or very close environments - and there are no issues with them at all. Data models, configuration, workload patterns, RPSs are very similar too. So there is something here completely going wrong, or maybe I’m just missing something obvious. Anyway I need your help - any hints on causes of such behavior would be very appreciated, details are following
Details
Data model: there are 8 tables in the keyspace, general schema:
CREATE TABLE IF NOT EXISTS keyspace.entity_a
(
entity_a_id uuid PRIMARY KEY,
some_column_a text,
some_column_b timestamp,
... more text and timestamp columns ...
) WITH
caching = {'keys' : 'ALL', 'rows_per_partition' :'NONE'}
AND compaction = {'class' : 'SizeTieredCompactionStrategy'};
Compaction strategy used is default SizeTieredCompactionStrategy with default options. Earlier I’ve tried to increase the tombstone_compaction_interval
option but it didn’t improve anything
Workload: typical workload - read some entity, update some fields and save it, so numbers of reads and writes are expected to be approximately equal
Symptoms: first signs of problems came with alerts of high memory and CPU consumption from monitoring system. After that random nodes started to crash periodically with 139 exit code (segfaults I guess).
Searching through Scylla’s logs I discovered only the following types of errors (some info is replaced with placeholders):
2025-03-15T06:21:11+03:00 WARN 2025-03-15 06:21:11,800 [shard 4:strm] storage_proxy - Failed to apply mutation from {IP}#4: logalloc::bad_alloc (failed to refill emergency reserve of 30 (have 23 free segments))
---
2025-03-15T06:21:11+03:00 ERROR 2025-03-15 06:21:11,819 [shard 11:stmt] sstable - failed reading index for /var/lib/scylla/data/keyspace/entity-9fbaa2c0ea1b11ef9b1971c775e4da32/me-3gol_09ah_26nsg2mgniaba2ww9j-big-Data.db: logalloc::bad_alloc (failed to refill emergency reserve of 30 (have 23 free segments))
---
2025-03-15T06:21:11+03:00 terminate called after throwing an instance of 'logalloc::bad_alloc'
---
2025-03-14T20:32:13+03:00 WARN 2025-03-14 20:32:13,799 [shard 5:strm] storage_proxy - Failed to apply mutation from {IP}#5: std::_Nested_exception<std::runtime_error> (frozen_mutation::unfreeze_gently(): failed unfreezing mutation pk{0010ffa76327f52e44c18a1fa694bb3fab53} of keyspace.entity): std::runtime_error (IDL frame truncated: expected to have at least 4 bytes, got 0)
Further research showed that there is always quite a large scheduler’s task queue (something about 50-100 tasks all the time) and Scylla is constantly running some compactions (there are 50-100 running compactions all the time according to compaction manager’s data). This data agrees with Scylla’s log records on compactions - there are 200k (!) successful compactions per table per day on average according to the logs
Running nodetool tablestats
gives the following data (I’m showing only one table - other tables’ stats are very similar):
Keyspace : keyspace
Read Count: 6478065
Read Latency: 4.983646351186658E-03 ms
Write Count: 20058411
Write Latency: 2.666766574879735E-05 ms
Pending Flushes: 51
Table: entity_a
SSTable count: 231
SSTables in each level: [231/4]
Space used (live): 4502553151
Space used (total): 4502553151
Space used by snapshots (total): 0
Off heap memory used (total): 32058036
SSTable Compression Ratio: 0.9
Number of partitions (estimate): 21219500
Memtable cell count: 0
Memtable data size: 0
Memtable off heap memory used: 0
Memtable switch count: 2129661
Local read count: 1260855
Local read latency: 2.517 ms
Local write count: 3654451
Local write latency: 0.021 ms
Pending flushes: 2
Percent repaired: 0.0
Bloom filter false positives: 9187
Bloom filter false ratio: 0.12558
Bloom filter space used: 27063252
Bloom filter off heap memory used: 27063240
Index summary off heap memory used: 4994796
Compression metadata off heap memory used: 0
Compacted partition minimum bytes: 87
Compacted partition maximum bytes: 642
Compacted partition mean bytes: 322
Average live cells per slice (last five minutes): 0.0
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): 0.0
Maximum tombstones per slice (last five minutes): 0
Dropped Mutations: 0
And here come first oddities:
- Read and write counts were expected to be approximately the same but actually differ by more than an order of magnitude - does Scylla count only successful reads?
- Memtable switch count is quite large in comparison with SSTables and partitions numbers
- There is no in-memory data
- Bloom filter false ratio is probably higher (?) than expected - not sure if it’s really a problem
nodetool sstableinfo
on the same table reports something quite similar to the following data for the vast majority of SSTables:
0 :
data size : 97
filter size : 12
index size : 20
level : 0
size : 6087
generation : 3gon_0wq0_5htww1y59nn2zhk0e2
version : me
timestamp : 2025-03-17T11:46:48Z
extended properties :
compression_parameters :
sstable_compression : org.apache.cassandra.io.compress.LZ4Compressor
And here comes the next chunk of oddities:
- A majority of SSTables have quite small total sizes
- Total size is much larger than the sum of data, index and filter sizes
- Tables can be split into several large groups by timestamp values - each table in one certain group has the same (or very close) timestamp value
And finally nodetool cfhistograms
reports the following for this table:
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 1.00 17.00 56.00 310 12
75% 2.00 25.75 96.75 372 14
95% 3.00 48.15 88205.70 446 17
98% 4.00 58.72 234367.68 446 17
99% 4.00 63.29 328035.80 446 17
Min 0.00 2.00 11.00 87 3
Max 12.00 68.00 401697.00 642 24
Taking into account all of the collected data I supposed that there might be some issue with instant memtables flushing and keeping SSTable size very small but I’m not quite sure
Please let me know if you have any clues on such Scylla’s behavior that can help me to elucidate the root cause of cluster’s problems or if there’s some additional data I could collect that can help in this investigation.
Thank you in advance!