Error when running repairs with a mixed shard-count cluster

Guy · April 3, 2024, 4:29am

Originally from the User Slack

@Gopinath_M: HI, I upgraded my cluster from scylla 4.6.11 to 5.0.13, now while running repairs i am seeing below errors in logs, do we know why we are seeing these errors and fixed in which scylla version ?
Jan 29 12:20:17 ip-96-xx-xx-xx7 scylla: [shard 17] repair - Failed to read a fragment from the reader, keyspace=keyspace1, table=standard1, range=[{-8806459041241889168, end}, {-8745173525132896173, end}]: seastar::named_semaphore_timed_out (Semaphore timed out: _streaming_concurrency_sem)
Jan 29 12:20:17 ip-96-xx-xx-xx7 scylla: [shard 20] reader_concurrency_semaphore - Semaphore _streaming_concurrency_sem with 10/10 count and 4445898/159677153 memory resources: timed out, dumping permit diagnostics:#012permits#011count#011memory#011table/description/state#01223#0115#0113912K#011keyspace1.standard1/repair-meta/active/unused#0124#0114#011341K#011keyspace1.standard1/shard-reader/inactive#0121#0111#01189K#011keyspace1.standard1/shard-reader/active/used#0122#0110#0110B#011keyspace1.standard1/shard-reader/evicted#01216#0110#0110B#011keyspace1.standard1/shard-reader/waiting#012#01246#01110#0114342K#011total#012#012Total: 46 permits with 10 count and 4342K memory resources
@dor: hmm, could be the scylla deliberately slow you down to protect against OOM. @Botond_Dénes what do you say?
Do you have a large partition or large collection (we have a table for them)

@Botond_Dénes: @Gopinath_M do you have a mixed shard-count cluster?
These kind of timeouts, during repair are a known problem, when repairing nodes, with different shard count. You can retry the repair, maybe with reduced concurrency.
The best workaround is to avoid having nodes with different shard counts in your cluster.
Note that even if all your instances are the same, a different --cpuset or --smp command-line argument passed to ScyllaDB can also cause nodes to have different shard counts.

@Gopinath_M: we have 3 datacenters. 1 datacenter is in amazon with 32 cpu and 2 other datacenter are local using physical servers having 96 cpu

physical server :
grep -c processor /proc/cpuinfo
96
aws instance :
grep -c processor /proc/cpuinfo
32
So is there any fix for this or all scylla versions will have this problems in mixed nodes /instances

@Botond_Dénes: All current versions suffer from this. We are working on a fix, in the form of tablets, to be released in 6.0 (if everything goes according do plan).
This fix involves a complete refactoring of how we replicate data between nodes. Which is to say, it won’t be backported.

@Gopinath_M: okay.
So if we just have physical servers in all datacenters of same cpu count or just amazon instances with same cpu count in all datacenter, this errors will go away correct? only mixed nodes will cause problems?

@Botond_Dénes: Yes, if all nodes have the same shard count, the problem will go away.
Note, that having mixed shard count is not a gurantee that this error will appear. It also depends on luck (or the lack of it ). Having large nodes (many CPUs) makes hitting this error more likely.

@Gopinath_M: okay Thanks @Botond_Dénes
I see 2 more errors, is this also because of the same node configuration mentioned above: but this Errors are seen in 4.6.11 and 5.0.13 as well @Botond_Dénes
Jan 29 20:18:23 dxxxx-xx-0030 scylla[3690240]: [shard 28] repair - repair id [id=1, uuid=127be646-2ec8-49e8-beae-8628c32714a5] on shard 28, keyspace=keyspace1, cf=standard1, range=(6833118242024033935, 7669135147572082703], got error in row level repair: std::runtime_error (timedout)
Jan 29 20:26:26 dxxxx-xx-0030  scylla[3690240]: [shard 0] repair - repair_tracker run for repair id [id=1, uuid=127be646-2ec8-49e8-beae-8628c32714a5] failed: std::runtime_error ({shard 0: std::runtime_error (Failed to repair for keyspace=keyspace1, cf=standard1, range=(-7485188833725882615, -5911405872063621954]), shard 1: std::runtime_error (Failed to repair for keyspace=keyspace1, cf=standard1, range=(-inf, -8806459041241889168]), 
@Botond_Dénes: Yes, this is probably the same thing. The error is re-reported on different levels.

@Gopinath_M: i see, okay thanks very much for your quick response

@Botond_Dénes: This is the lowest level error:
Jan 29 12:20:17 ip-96-xx-xx-xx7 scylla: [shard 20] reader_concurrency_semaphore - Semaphore _streaming_concurrency_sem with 10/10 count and 4445898/159677153 memory resources: timed out, dumping permit diagnostics:
The keyword here is _streaming_concurrency_sem . This semaphore is only used by repair and streaming and it should never time out. We only have very generous 10min (or 30min) timeout, to break out from situations where there is no progress made.

@Gopinath_M: okay

Topic		Replies	Views
Facing errors in repair jobs ScyllaDB troubleshooting , repair	10	184	July 24, 2024
ScyllaDB Manager - repair taking too long and crashing ScyllaDB scylla-manager , troubleshooting , repair	0	50	April 14, 2025
Repair multiple node same time ScyllaDB scylla-manager , troubleshooting , repair	1	65	October 28, 2024
Reader_concurrency_semaphore and timeout errors ScyllaDB error-message , alternator	3	99	August 1, 2025
Trying to setup a 2 node multi dc cluster for the first time... Seed node comes online fine, second node gets stuck repairing tables and constantly in a state of UJ ScyllaDB open-source , troubleshooting , multi-dc	1	215	March 4, 2024

Error when running repairs with a mixed shard-count cluster

Related topics