Facing errors in repair jobs

Getting below error in logs while repair job. This is causing repair job to run for long hours.
Error :

Jul 9 05:55:58 nodename scylla: [shard 25] repair - repair[f06d168e-664f-4b3f-9717-5abc0e4cbf01]: shard=25, keyspace=time_series_interaction, cf=click_logged_in, range=(-7693897
20434297988, -551877587955844925], got error in row level repair: std::runtime_error (put_row_diff: Repair follower=10.x.x.x failed in put_row_diff hanlder, status=0)
Jul 9 05:55:58 nodename scylla: [shard 28] repair - repair[f06d168e-664f-4b3f-9717-5abc0e4cbf01]: shard=28, keyspace=time_series_interaction, cf=click_logged_in, range=(-7693897
20434297988, -551877587955844925], got error in row level repair: std::runtime_error (put_row_diff: Repair follower=10.x.x.x failed in put_row_diff hanlder, status=0)
Jul 9 05:55:58 nodename scylla: [shard 11] repair - repair[f06d168e-664f-4b3f-9717-5abc0e4cbf01]: shard=11, keyspace=time_series_interaction, cf=click_logged_in, range=(-7693897

Jul 9 06:04:41 nodename scylla: [shard 0] repair - repair[f06d168e-664f-4b3f-9717-5abc0e4cbf01]: repair_tracker run failed: std::runtime_error ({shard 0: seastar::rpc::closed_e
rror (connection is closed), shard 1: seastar::rpc::closed_error (connection is closed), shard 2: seastar::rpc::closed_error (connection is closed), shard 3: seastar::rpc::closed_error (connection i
s closed), shard 4: seastar::rpc::closed_error (connection is closed), shard 5: seastar::rpc::remote_verb_error (seastar::nested_exception), shard 6: seastar::rpc::closed_error (connection is closed
), shard 7: seastar::rpc::closed_error (connection is closed), shard 8: seastar::rpc::closed_error (connection is closed), shard 9: seastar::rpc::closed_error (connection is closed), shard 10: seast
ar::rpc::closed_error (connection is closed), shard 11: seastar::rpc::remote_verb_error (seastar::nested_exception), shard 12: seastar::rpc::closed_error (connection is closed), shard 13: seastar::r
pc::closed_error (connection is closed), shard 14: seastar::rpc::closed_error (connection is closed), shard 15: seastar::rpc::closed_error (connection is closed), shard 16: seastar::rpc::closed_erro

Please share as many details as possible: ScyllaDB Version, cluster size, OS, hardware details, cluster size.
Did you check the Monitoring dashboards?

ScyllaDB version : 5.2.14
Cluster size : 1.5 TB with 2 nodes and RF as 2.
OS : ubuntu 20.04 focal
Hardware : n2-highmem-32 GCP with 16 NVME disks
Monitoring dashboards seems showing spikes in latencies in read and write.

It looks like the error happened on one of the peer nodes. Please check the logs of the other nodes participating in the repair, to see the real error.

Errors :
Jul 12 01:53:54 nodename scylla: [shard 2] repair - repair[b3fbc952-171d-4420-9a82-fb7981a32d9f]: Started to repair 1 out of 1 tables in keyspace=, table=, table_id=c195a300-baaa-11ee-b6
9d-54c54cd21d49, repair_reason=repair
Jul 12 01:53:54 nodename scylla: [shard 17] repair - repair[b3fbc952-171d-4420-9a82-fb7981a32d9f]: Started to repair 1 out of 1 tables in keyspace=, table=, table_id=c195a300-baaa-11ee-b6
9d-54c54cd21d49, repair_reason=repair
Jul 12 01:53:54 nodename scylla: [shard 28] repair - repair[b3fbc952-171d-4420-9a82-fb7981a32d9f]: Started to repair 1 out of 1 tables in keyspace=, table=, table_id=c195a300-baaa-11ee-b6
9d-54c54cd21d49, repair_reason=repair
Jul 12 01:53:54 nodename scylla: [shard 5] repair - repair[b3fbc952-171d-4420-9a82-fb7981a32d9f]: Started to repair 1 out of 1 tables in keyspace=, table=, table_id=c195a300-baaa-11ee-b6
9d-54c54cd21d49, repair_reason=repair
Jul 12 01:53:54 nodename scylla: [shard 20] repair - repair[b3fbc952-171d-4420-9a82-fb7981a32d9f]: Started to repair 1 out of 1 tables in keyspace=, table=, table_id=c195a300-baaa-11ee-b6
9d-54c54cd21d49, repair_reason=repair
Jul 12 01:53:54 nodename scylla: [shard 9] repair - repair[b3fbc952-171d-4420-9a82-fb7981a32d9f]: Started to repair 1 out of 1 tables in keyspace=, table=, table_id=c195a300-baaa-11ee-b6
9d-54c54cd21d49, repair_reason=repair
Jul 12 01:53:58 nodename scylla: [shard 24] storage_proxy - Exception when communicating with 10.x.x.x, to read from .click_anon: std::bad_alloc
Jul 12 01:53:59 nodename scylla: [shard 17] compaction - [Compact system.compaction_history 9b0d8f00-3ff1-11ef-ae9c-8bda48236624] Compacting [/var/lib/scylla/data/system/compaction_history-b4dbb7b4dc493fb5b3bfce6e434832ca/me-
183887-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/system/compaction_history-b4dbb7b4dc493fb5b3bfce6e434832ca/me-183857-big-Data.db:level=0:origin=compaction]
Jul 12 01:53:59 nodename scylla: [shard 17] compaction - [Compact system.compaction_history 9b0d8f00-3ff1-11ef-ae9c-8bda48236624] Compacted 2 sstables to [/var/lib/scylla/data/system/compaction_history-b4dbb7b4dc493fb5b3bfce6
e434832ca/me-183917-big-Data.db:level=0]. 299kB to 192kB (~64% of original) in 107ms = 2MB/s. ~1024 total partitions merged to 844.
Jul 12 01:53:59 nodename scylla: [shard 7] repair - repair[b3fbc952-171d-4420-9a82-fb7981a32d9f]: shard=7, keyspace=, cf=, range=(-2136115943339218610, -2031250356567146432], got error i
n row level repair: seastar::rpc::remote_verb_error (std::bad_alloc)
Jul 12 01:54:00 nodename scylla: [shard 20] repair - repair[b3fbc952-171d-4420-9a82-fb7981a32d9f]: shard=20, keyspace=, cf=, range=(-2136115943339218610, -2031250356567146432], got error in row level repair: seastar::rpc::remote_verb_error (std::bad_alloc)
Jul 12 01:54:00 nodename scylla: [shard 3] storage_proxy - Exception when communicating with 10.x.x.x, to read from .: std::bad_alloc
Jul 12 01:54:00 nodename scylla: [shard 3] storage_proxy - Exception when communicating with 10.x.x.x, to read from .: std::bad_alloc
Jul 12 01:54:00 nodename scylla: [shard 18] storage_proxy - Exception when communicating with 10.x.x.x, to read from .: std::bad_alloc
Jul 12 01:54:00 nodename scylla: [shard 27] storage_proxy - Exception when communicating with 10.x.x.x, to read from .: std::bad_alloc
Jul 12 01:54:00 nodename scylla: [shard 18] storage_proxy - Exception when communicating with 10.x.x.x, to read from .: std::bad_alloc
Jul 12 01:54:00 nodename scylla: [shard 20] repair - repair[b3fbc952-171d-4420-9a82-fb7981a32d9f]: shard=20, keyspace=, cf=, range=(-2141427493649988622, -2136115943339218610], got error in row level repair: seastar::rpc::remote_verb_error (std::bad_alloc)
Jul 12 01:54:00 nodename scylla: [shard 18] storage_proxy - Exception when communicating with 10.x.x.x, to read from .: std::bad_alloc
Jul 12 01:54:00 nodename scylla: [shard 3] storage_proxy - Exception when communicating with 10.x.x.x, to read from .: std::bad_alloc
Jul 12 01:54:00 nodename scylla: [shard 18] storage_proxy - Exception when communicating with 10.x.x.x, to read from .: std::bad_alloc
Jul 12 01:54:00 nodename scylla: [shard 26] storage_proxy - Exception when communicating with 10.x.x.x, to read from .: std::bad_alloc
Jul 12 01:54:01 nodename scylla: [shard 18] storage_proxy - Exception when communicating with 10.x.x.x, to read from .: std::bad_alloc
Jul 12 01:54:01 nodename scylla: [shard 27] storage_proxy - Exception when communicating with 10.x.x.x, to read from .: std::bad_alloc
Jul 12 01:54:01 nodename scylla: [shard 26] storage_proxy - Exception when communicating with 10.x.x.x, to read from .: std::bad_alloc
Jul 12 01:54:01 nodename scylla: [shard 26] storage_proxy - Exception when communicating with 10.x.x.x, to read from .: std::bad_alloc
Jul 12 01:54:01 nodename scylla: [shard 3] repair - repair[b3fbc952-171d-4420-9a82-fb7981a32d9f]: shard=3, keyspace=, cf=, range=(-2141427493649988622, -2136115943339218610], got error in row level repair: seastar::rpc::remote_verb_error (std::bad_alloc)`
Jul 12 01:54:01 nodename scylla: [shard 3] repair - repair[b3fbc952-171d-4420-9a82-fb7981a32d9f]: shard=3, keyspace=, cf=, range=(-2136115943339218610, -2031250356567146432], got error in row level repair: seastar::rpc::remote_verb_error (std::bad_alloc)
Jul 12 01:54:02 nodename scylla: [shard 10] repair - repair[b3fbc952-171d-4420-9a82-fb7981a32d9f]: shard=10, keyspace=, cf=, range=(-2141427493649988622, -2136115943339218610], got error in row level repair: seastar::rpc::remote_verb_error (std::bad_alloc)
Jul 12 01:54:02 nodename scylla: [shard 4] storage_proxy - Exception when communicating with 10.x.x.x, to read from .: std::bad_alloc
Jul 12 01:54:02 nodename scylla: [shard 26] storage_proxy - Exception when communicating with 10.x.x.x, to read from .: std::bad_alloc
Jul 12 01:54:02 nodename scylla: [shard 26] storage_proxy - Exception when communicating with 10.x.x.x, to read from .: std::bad_alloc
Jul 12 01:54:02 nodename scylla: [shard 27] storage_proxy - Exception when communicating with 10.x.x.x, to read from .: std::bad_alloc
Jul 12 01:54:02 nodename scylla: [shard 23] repair - repair[b3fbc952-171d-4420-9a82-fb7981a32d9f]: shard=23, keyspace=, cf=, range=(-2141427493649988622, -2136115943339218610], got error in row level repair: seastar::rpc::remote_verb_error (std::bad_alloc)
Jul 12 01:54:03 nodename scylla: [shard 0] repair - repair[b3fbc952-171d-4420-9a82-fb7981a32d9f]: shard=0, keyspace=, cf=, range=(-2136115943339218610, -2031250356567146432], got error in row level repair: std::runtime_error (put_row_diff: Repair follower=10.x.x.x failed in put_row_diff hanlder, status=0)
Jul 12 01:54:03 nodename scylla: [shard 18] storage_proxy - Exception when communicating with 10.x.x.x, to read from .: std::bad_alloc
Jul 12 01:54:03 nodename scylla: [shard 27] storage_proxy - Exception when communicating with 10.x.x.x, to read from .: std::bad_alloc
Jul 12 01:54:03 nodename scylla: message repeated 2 times: [ [shard 27] storage_proxy - Exception when communicating with 10.x.x.x, to read from .: std::bad_alloc]
Jul 12 01:54:03 nodename scylla: [shard 24] storage_proxy - Exception when communicating with 10.x.x.x, to read from .click_anon: std::bad_alloc
Jul 12 01:54:03 nodename scylla: [shard 17] compaction - [Compact .click_anon 9db30c80-3ff1-11ef-ae9c-8bda48236624] Compacting [/var/lib/scylla/data//click_anon-c576fbe0baaa11eeb69d54c54cd21d49/me-588917-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data//click_anon-c576fbe0baaa11eeb69d54c54cd21d49/me-588887-big-Data.db:level=0:origin=compaction]
Jul 12 01:54:03 nodename scylla: [shard 17] compaction - [Compact . 9dba1160-3ff1-11ef-ae9c-8bda48236624] Compacting [/var/lib/scylla/data//-bdb00460baaa11eeb69d54c54cd21d49/me-577157-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data//-bdb00460baaa11eeb69d54c54cd21d49/me-577127-big-Data.db:level=0:origin=compaction]

Also we tried changing repair job intensity from 1 to 2 and then 4, resulted in restart of scylla-server service.

I see std::bad_alloc:s in the logs. This is the reason repair failed.
Increasing repair intensity will make this problem worse, that is why you had a restart (crash).

Check the Bloom Filter memory usage (percentage) graph, on the Detailed dashboard in monitoring. What does it show?


spikes is when repair was running.

There is a fix for this in 5.2.18. Upgrade to either the latest 5.2, or better, to 5.4 or 6.0 (5.2 is not supported anymore).

So you are saying its because of a bug which has hit on 5.2.14 ?

The bug was not introduced in 5.2.14 (as far as I remember), but it was fixed in 5.2.18.

Thanks Botond_Denes. This was helpful.