Read errors in a cluster, IO setup failure, disk performance and optimizations

Guy · August 4, 2025, 5:42am

Originally from the User Slack

@Erik-Jan_van_de_WalErik-Jan_van_de_Wal**:** Hi all,
I’m running into intermittent read errors under load on a 3-node Scylla 2025.1 cluster:

Jul 31 16:01:02 learner01 taskset[121958]: Error read_moving_average_data: Failed to fetch the first page of the result: Database returned an error: Not enough nodes responded to the read request in time to satisfy required consistency level (consistency: Quorum, received: 1, required: 2, data_present: false)

Cluster hardware: (servers are about 8 years old)
Node 1: 40 cores / 164 GB RAM / 20 TB disk (DELL PowerEdge R720)
Node 2: 20 cores / 164 GB RAM / 20 TB disk (DELL PowerEdge R720)
Node 3: 56 cores / 64 GB RAM / 20 TB disk (Supermicro)

The servers are in the same rack, and the machine that runs the application that uses Scylla as well.

Keyspace:
NetworkTopologyStrategy, RF = 3

Failing Table Schema:
CREATE TABLE ... (
    moving_average_num bigint,
    moving_average double,
    price double,
    hour_identifier bigint,
    timestamp bigint,
    PRIMARY KEY ((moving_average_num, hour_identifier), timestamp)
) WITH CLUSTERING ORDER BY (timestamp ASC)
  AND compaction = {
      'class': 'TimeWindowCompactionStrategy',
      'compaction_window_unit': 'DAYS',
      'compaction_window_size': '1'
  }
  AND compression = {
      'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'
  };
Query Pattern:
SELECT price, timestamp, moving_average, hour_identifier, moving_average_num
FROM keyspace.moving_average
WHERE moving_average_num = ? AND hour_identifier = ?;
Query function (Rust):
    pub async fn get_moving_average_by_data_hour_identifier(
        &self,
        hour_identifier: i64,
        moving_average_num: i64,
    ) -> Result<Vec<MovingAverage>> {
        let _permit = GLOBAL_QUERY_SEMAPHORE.acquire().await?;

        let prepared_values: (i64, i64) = (moving_average_num, hour_identifier);

        let mut rows_stream: TypedRowStream<MovingAverage> = self
            .session
            .execute_iter(
                self.get_moving_average_by_data_hour_identifier.clone(),
                prepared_values,
            )
            .await?
            .rows_stream::<MovingAverage>()?;

        let mut result: Vec<MovingAverage> = Vec::new();
        while let Some(next_row_res) = rows_stream.next().await {
            result.push(next_row_res?);
        }

        Ok(result)
    }
Load Profile:
• App is a simulator running 8–10 jobs, each with max 15 sub-simulations in parallel
• Average: ~~7k ops/sec, peaks at ~~10k ops/sec
• Errors occur mostly under high read pressure
Identified bottlenecks: CPU, disk

Node 3 Diagnostics (Problem Node):
total_successful_reads: 1,469,616
total_failed_reads: 3,234,429
reads_enqueued: 3,655,955
reads_queued_count: 3,610,289
permits: 100/100

Disk layout:
/dev/sda3 (LVM):

/ (100G)

/var/lib/scylla/commitlog (50G)

/var/lib/scylla/hints (25G)

/var/lib/scylla/view_hints (10G)

/var/lib/scylla/saved_caches(25G)

/dev/sdb1 (18.2TB):

/var/lib/scylla/data

Note: scylla_io_setup failed on node 3 during install; I bybassed this by explicitly only test /var/lib/scylla/data and /var/lib/scylla/commitlog (because commitlog, hints, view_hints, saved_caches are the same disk)

The problem mostly looks like node3, it looks like it can not keep up and node1 and node2 need to pick up the work for node3 where eventually node1 and node2 also crash.

See the screenshot (the blue line is node3)

Any help that push me in the right direction is helpful, I have exhausted everything I know so far. I do understand that node3 is the weakest link, and of course the first thing that needs to be solved is the scylla_io_setup, but I am stuck here
This is the error from scylla_io_setup

coconut@scylla-03:~$ sudo scylla_io_setup
[sudo] password for coconut:
tuning /sys/devices/pci0000:80/0000:80:02.0/0000:81:00.0/host0/target0:2:1/0:2:1:0/block/sdb/sdb1
tuning /sys/devices/pci0000:80/0000:80:02.0/0000:81:00.0/host0/target0:2:1/0:2:1:0/block/sdb
already tuned: /sys/devices/pci0000:80/0000:80:02.0/0000:81:00.0/host0/target0:2:1/0:2:1:0/block/sdb/queue/nomerges
tuning /sys/devices/virtual/block/dm-1
tuning: /sys/devices/virtual/block/dm-1/queue/nomerges 2
tuning /sys/devices/pci0000:80/0000:80:02.0/0000:81:00.0/host0/target0:2:0/0:2:0:0/block/sda/sda3
tuning /sys/devices/pci0000:80/0000:80:02.0/0000:81:00.0/host0/target0:2:0/0:2:0:0/block/sda
already tuned: /sys/devices/pci0000:80/0000:80:02.0/0000:81:00.0/host0/target0:2:0/0:2:0:0/block/sda/queue/nomerges
tuning /sys/devices/virtual/block/dm-2
tuning: /sys/devices/virtual/block/dm-2/queue/nomerges 2
tuning /sys/devices/pci0000:80/0000:80:02.0/0000:81:00.0/host0/target0:2:0/0:2:0:0/block/sda/sda3
tuning /sys/devices/virtual/block/dm-4
tuning: /sys/devices/virtual/block/dm-4/queue/nomerges 2
tuning /sys/devices/pci0000:80/0000:80:02.0/0000:81:00.0/host0/target0:2:0/0:2:0:0/block/sda/sda3
tuning /sys/devices/virtual/block/dm-5
tuning: /sys/devices/virtual/block/dm-5/queue/nomerges 2
tuning /sys/devices/pci0000:80/0000:80:02.0/0000:81:00.0/host0/target0:2:0/0:2:0:0/block/sda/sda3
INFO 2025-07-31 20:34:48,499 seastar - Reactor backend: io_uring
INFO 2025-07-31 20:34:49,051 [shard 0:main] iotune - /var/lib/scylla/view_hints passed sanity checks
INFO 2025-07-31 20:34:49,052 [shard 0:main] iotune - Disk parameters: max_iodepth=916 disks_per_array=1 minimum_io_size=4096
INFO 2025-07-31 20:34:49,056 [shard 0:main] iotune - Filesystem parameters: read alignment 512, write alignment 4096
Starting Evaluation. This may take a while…
Measuring sequential write bandwidth: ERROR 2025-07-31 20:36:12,578 [shard 0:main] seastar - Exiting on unhandled exception: std::system_error (error system:28, No space left on device)
ERROR:root:Command ‘[’/usr/bin/iotune’, ‘–format’, ‘envfile’, ‘–options-file’, ‘/etc/scylla.d/io.conf’, ‘–properties-file’, ‘/etc/scylla.d/io_properties.yaml’, ‘–evaluation-directory’, ‘/var/lib/scylla/data’, ‘–evaluation-directory’, ‘/var/lib/scylla/commitlog’, ‘–evaluation-directory’, ‘/var/lib/scylla/hints’, ‘–evaluation-directory’, ‘/var/lib/scylla/view_hints’, ‘–evaluation-directory’, ‘/var/lib/scylla/saved_caches’, ‘–cpuset’, ‘1,2,3,4,5,6,7,8,9,10,11,12,13,15,16,17,18,19,20,21,22,23,24,25,26,27,29,30,31,32,33,34,35,36,37,38,39,40,41,43,44,45,46,47,48,49,50,51,52,53,54,55’]’ returned non-zero exit status 1.
ERROR:root:[‘/var/lib/scylla/data’, ‘/var/lib/scylla/commitlog’, ‘/var/lib/scylla/hints’, ‘/var/lib/scylla/view_hints’, ‘/var/lib/scylla/saved_caches’] did not pass validation tests, it may not be on XFS and/or has limited disk space.
This is a non-supported setup, and performance is expected to be very bad.
For better performance, placing your data on XFS-formatted directories is required.
To override this error, enable developer mode as follow:
sudo /opt/scylladb/scripts/scylla_dev_mode_setup --developer-mode 1

It is only on the Scylla 3 node, it has the same harddrive capacity. I have reformatted the disks many times already
The partitions are formatted as XFS

└─sda3 LVM2_member LVM2 001 Ct4FCa-TwgT-4DjE-7dBR-6cgn-0b0B-MomIEC
├─ubuntu–vg-ubuntu–lv ext4 1.0 60fe3131-3982-44ab-8698-b3895c152ca0 79.5G 14% /
├─ubuntu–vg-scylla_commitlog xfs addf9c73-d96a-43f3-80ab-cf7b40de31c2 20.8G 58% /var/lib/scylla/commitlog
├─ubuntu–vg-scylla_hints xfs 8249b669-f41e-4b2e-aabd-79fefe11a55f 24.4G 2% /var/lib/scylla/hints
├─ubuntu–vg-scylla_caches xfs 03a5ab30-3099-4b9a-b7b9-3bce7b6aca39
├─ubuntu–vg-scylla_view_hints xfs c8581434-4c36-4fcd-b287-ba9238f7e101 9.7G 2% /var/lib/scylla/view_hints
└─ubuntu–vg-scylla_saved_caches xfs 23c34a60-8c32-4f10-9478-89616cc79a96 24.4G 2% /var/lib/scylla/saved_caches
sdb
└─sdb1 xfs b00edbbb-08b8-4e7e-a77c-dfd81755f618 17.8T 2% /var/@Felipe_Cardeneti_Mendesib/scylla/data

@Felipe_Cardeneti_Mendes**:** Hm… there are quite a few problems here.
Node 3 has a smaller memory per vCPU than the rest of the nodes. Node1 ~~ 4GB/vCPU, Node2 ~~ 8GB/vCPU, Node 3 ~1GB/vCPU
Your iotune (what scylla_io_setup calls) fails due to:
INFO  2025-07-31 20:34:49,051 [shard  0:main] iotune - /var/lib/scylla/view_hints passed sanity checks
INFO  2025-07-31 20:34:49,052 [shard  0:main] iotune - Disk parameters: max_iodepth=916 disks_per_array=1 minimum_io_size=4096
INFO  2025-07-31 20:34:49,056 [shard  0:main] iotune - Filesystem parameters: read alignment 512, write alignment 4096
Starting Evaluation. This may take a while...
Measuring sequential write bandwidth: ERROR 2025-07-31 20:36:12,578 [shard  0:main] seastar - Exiting on unhandled exception: std::system_error (error system:28, No space left on device)
Given the sanity check is on view_hints I suppose that’s the directory it failed on. But why do you have a VG for each directory, rather than simply having a single /var/lib/scylla ? If these are slow disks, then maybe keep just the commitlog.
The 2% utilization under /var/lib/scylla/data also looks strange - is it all of your data, or is it simply not persisting any data? If the data is different than from the other nodes, then the read could be failing because it could be a large partiti@Erik-Jan_van_de_Waln trying to read-repair to that node?

@Erik-Jan_van_de_Wal**:** I though it would be wise to keep my actual data on a seperate disk (the 20TB) and keep the other data on@Felipe_Cardeneti_Mendesthe primary (OS) disk. As recommended in the config file

@Felipe_Cardeneti_Mendes**:** if at all possible,@Erik-Jan_van_de_WalI’d suggest you try to keep similar shard count and memory ratios per vCPU. Hmm

@Erik-Jan_van_de_Wal**:** the 2% utlization is probably correct. I only have about 5 millions records
I need to work with the hardware I have (wooh@Felipe_Cardeneti_Mendeso startups). Would it be better if I virtualize (docker) Scylla per server to make them identical?

@Felipe_Cardeneti_Mendes**:** > I though it would be wise to keep my actual data on a seperate disk (the 20TB) and keep the other data on the primary (OS) disk. As recommended in the config file
Yeah, if these disks are NVMes you should be good with just a single mount. The separate filesystem recommendation for the commitlog is mai@Erik-Jan_van_de_Wally when the disks are known to be slow, which is no@Felipe_Cardeneti_Mendes the case with NVMes (hints are rarely used, and we never use caches)

@Erik-Jan_van_de_Wal**:** Right. The disks are magnetic disks. Not SSDs

@Felipe_Cardeneti_Mendes**:** > Would it be better if I virtualize (docker) Scylla per server to make them identical?
Or just reduce the shard count (/etc/scylla.d/cpuset.conf). You can also reduce memory in /etc/scylla.d/memory.conf.

We often use ~8GB per vCPU. 4GB works, but I wouldn’t go lower than that. The more memor@Erik-Jan_van_de_Wal, the more caching space you have, plus headroom for metadata.
oh ok, with magnetic, definitely keep the commitlog, b@Felipe_Cardeneti_Mendest drop the rest, you don’t need these LVs.

@Erik-Jan_van_de_Wal**:** got it!
so I take the weakest link, and make sure that the other servers are locked to that config too 8GB/vCPU

@Felipe_Cardeneti_Mendes**:** also, increase the commitlog LV size to about the memory size you have, otherwise recycling their segments may be too frequent.
yeah, 8 vCPUs should work fine for that amount of RAM
one last @Erik-Jan_van_de_Walote before you move forward - since you’re on 2025.1 - IF you are using tablets, then you’d @Felipe_Cardeneti_Mendesee@avi to replace the node. Or j@Erik-Jan_van_de_Walst recreate the clu@avit@Felipe_Cardeneti_Mendesr with the right configs if you are ok with losing the data.

@Erik-Jan_van_de_Wal**:** Thanks@Felipe_Cardeneti_Mendes I opted out for tablets so I should be good with reconfiguring them one by one

@Felipe_Cardeneti_Mendes**:** @avi drops a tear silently

@Erik-Jan_van_de_Wal**:** aaaaaw i’m sorry @avi
@Felipe_Cardeneti_Mendes could I@Erik-Jan_van_de_Walalso pass @Felipe_Cardeneti_Mendeshe cpu and mem config via SCYLLA_ARGS in /etc/default/scylla-server?

@Felipe_Cardeneti_Mendes**:** yes
all these are sourced by the systemd scylla-server unit, so ensure you dont have duplicates so the cmdline isn’t messed up. But other than that, you have the flexibility to adjust it as you see better fit.

@Erik-Jan_van_de_Wal**:** Thanks!
@Felipe_Cardeneti_Mendes so I did what you suggested.

• Removed the LVMS
• Increased the commit partition
• Configured each server to use 8CPU/64GB RAM
And I still get the following error on only one (the 3rd) server.
ug 01 03:04:09 scylla-03 scylla[107456]:  [shard 0:sl:d] reader_concurrency_semaphore - (rate limiting dropped 736 similar messages) Semaphore sl:default with 100/100 count and 2861400/155398963 memory resources: timed out, dumping permit diagnostics:
                                          Trigger permit: count=1, memory=23979, table=data.price_0_0002, operation=data-query, state=active/await
                                          Identified bottleneck(s): CPU, disk
                                          
                                          permits        count        memory        table/operation/state
                                          34        34        1168K        data.regression/data-query/active/await
                                          42        42        1071K        data.moving_average/data-query/active/await
                                          24        24        556K        data.price_0_0002/data-query/active/await
                                          24        0        0B        data.price_0_0002/data-query/waiting_for_admission
                                          117        0        0B        data.moving_average/data-query/waiting_for_admission
                                          46        0        0B        data.regression/data-query/waiting_for_admission
                                          
                                          287        100        2794K        total
                                          
                                          Stats:
                                          permit_based_evictions: 0
                                          time_based_evictions: 0
                                          inactive_reads: 0
                                          total_successful_reads: 272
                                          total_failed_reads: 13848
                                          total_reads_shed_due_to_overload: 0
                                          total_reads_killed_due_to_kill_limit: 0
                                          reads_admitted: 3409
                                          reads_enqueued_for_admission: 14397
                                          reads_enqueued_for_memory: 0
                                          reads_admitted_immediately: 10
                                          reads_queued_because_ready_list: 3
                                          reads_queued_because_need_cpu_permits: 1296
                                          reads_queued_because_memory_resources: 0
                                          reads_queued_because_count_resources: 13098
                                          reads_queued_with_eviction: 0
                                          total_permits: 14407
                                          current_permits: 287
                                          need_cpu_permits: 100
                                          awaits_permits: 100
                                          disk_reads: 100
      @Felipe_Cardeneti_Mendes                                   sstables_read: 138
As you can s@Erik-Jan_van_de_Wale in the screenshot blue is completly flatlined. The problem so far is that when I have a low req/s everything seems to go well, but when I got to 5K+ req/s stuff starts to break.

Any other advice? I tried to lower the num_tokens to 128, but that doesnt change anything.
The load is also low

@Felipe_Cardeneti_Mendes**:** what’s the io_properties.yaml values compared to other nodes?

@Erik-Jan_van_de_Wal**:** server_1:

coconut@scylla-01:/etc/scylla.d$ cat io_properties.yaml
disks:

mountpoint: /var/lib/scylla/commitlog
read_iops: 232
read_bandwidth: 84852776
write_iops: 1198
write_bandwidth: 209791600

mountpoint: /var/lib/scylla/data
read_iops: 231
read_bandwidth: 191295152
write_iops: 1305
write_bandwidth: 254318288

server_2:

coconut@scylla-02:/etc/scylla.d$ cat io_properties.yaml
disks:

mountpoint: /var/lib/scylla/commitlog
read_iops: 248
read_bandwidth: 87394640
write_iops: 1222
write_bandwidth: 210555792

mountpoint: /var/lib/scylla/data
read_iops: 234
read_ban@Felipe_Cardeneti_Mendeswidth: 92834344
write_iops: 1275
write_bandwidth: 252885664

server_3 (problem server):

coconut@scylla-03:/etc/scylla.d$ cat io_properties.yaml
disks:

mountpoint: /var/lib/scylla/commitlog
read_iops: 245
read_bandwidth: 162483360
write_iops: 3557
write_bandwidth: 213447312

mountpoint: /var/lib/scylla/data
read_iops: 129
read_bandwidth: 174533520
write_iops: 4058
write_bandwidth: 226609440

@Felipe_Cardeneti_Mendes**:** Yeah, it’s definitely disk. The semaphore dump earlier shows all read permits are consumed (100/100) with reads in progress. Once the queue gets full, other requests accumulate until they timeout or we kill it due to overload.

Is it possible most reads from n1/n2 are served from cache? You can check the misses/hits in the Cache section within the Detailed panel.
It is probably wise to see the IOPS metrics on the Advanced dashboard for your classes (memtable, compaction, sl:default) and compare these nodes.

My wild guess is that your reads on N3 are saturating its IOPS, and the course of action (despite using faster disks) would be to tune the system to use a less IOPS as possible. This would involve chunk_length_kb (https://www.scylladb.com/2017/08/01/compression-chunk-sizes-scylla/) depending on your average read size, bloom filters, and (at the extreme) SSTable summary ratios (https://medium.com/agoda-engineering/exploring-scylla-disk-performance-and-optimizations-at-agoda-65a6dcdd6fe7).

You can also play with different Seastar settings - like io-latency-goal-ms (https://www.scylladb.com/2023/07/17/top-mistakes-with-scylladb-storage/):
Defines how long requests to disk should take. Defaults to 1.5 * task-quota-ms. The value should be great@Erik-Jan_van_de_Walr than a single request's latency, as it allows for more requests to be dispatched simultaneously. For spinning disks with an average latency of 10ms, increasing the latency goal to at least 50ms should allow for some concurrency.
Lastly, triple check the number of CPU/shards stand as you configured them - /usr/lib/scylla/seastar-cpu-map.sh -n scylla

Medium: Exploring Scylla Disk Performance and Optimizations at Agoda

@Erik-Jan_van_de_Wal**:** Thanks. I will research further and dive deeper in the documentation. I think this is a case of “never let a software engineer handle database engineering”.

We can run our tests and potentially move to live. I’ve built a L1 cache between the application and data and it seems to hold (I also removed N3 from the cluster and put RF=2, not ideal, but for now we need to work with what we’ve got)

Thanks again for the quick response and helpful insights @Felipe_Cardeneti_Mendes!

Topic		Replies	Views
Read fails for one node in the cluster, no response received on GCP ScyllaDB error-message , large-partitions , hot-partition , gcp	0	18	March 10, 2025
Scylla 5.2 very slow startup ScyllaDB troubleshooting	19	1026	October 9, 2023
Performance issue, throughput drop and latency increase ScyllaDB data-model , performance , troubleshooting , sizing	0	148	November 28, 2024
High read latency ScyllaDB performance , operator , latency	2	395	August 29, 2024
[REALESE] Scylla 5.4 RC1 - part 3 Release Notes open-source , release-candidate , open-source-release , open-source-5-4	0	591	November 8, 2023

Read errors in a cluster, IO setup failure, disk performance and optimizations

Related topics