Issues with repair not functioning properly in ScyllaDB version 5.2.6

kdhyun2 · June 9, 2025, 5:35am

Installation details
#ScyllaDB version: 5.2.6
#Cluster size: 1 DC/ 9 Nodes
os (RHEL/CentOS/Ubuntu/AWS AMI):AWS AMI

Environment Details

Issue: nodetool repair -pr appears to be stuck during system_traces keyspace repair
Symptom: No new repair logs generated after certain point, repair process seems frozen
Keyspace: system_traces
Suspected cause: Shard 9 not completing (only shards 0-8, 10-13 visible in logs)

Current Situation

I executed nodetool repair -pr and the repair process was progressing normally through the system_traces keyspace. However, the repair seems to have gotten stuck and is no longer generating any log entries.

Log Analysis

From the repair logs, I can see:

Completed Shards:

Shards 0, 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13 all completed successfully
All show round_nr_fast_path_already_synced=1280 indicating data was already in sync
All report tx_row_nr=0, rx_row_nr=0 (no data transfer required)

Missing Shard:

Shard 9 is conspicuously absent from the completion logs
This appears to be where the repair process is stuck

Last Activity:

INFO 2025-06-09 04:50:11,245 [shard 6] repair - repair[a2b617a4-288d-497f-8091-82fdfcafa581]: shard 0 completed successfully, keyspace=system_traces

After this timestamp, no further repair progress logs are being generated.

Questions

How can I identify what’s happening with shard 9?
- Are there specific log patterns I should look for?
- Any monitoring commands to check shard-specific repair status?
What are the safe approaches to resolve this stuck repair?
- Should I wait longer for the repair to potentially complete?
- Are there non-disruptive ways to help the stuck shard progress?
- What’s the safest recovery approach that won’t impact cluster stability?
Diagnostic steps:
- What additional information should I collect to troubleshoot this?
- Are there specific metrics or logs that would help identify the root cause?
Recovery options:
- Can I resume the repair from where it left off?
- Should I restart the repair process entirely?
- Any way to exclude or specifically target the problematic shard?

Additional Context

This is a primary range repair (-pr option used)
The repair was progressing normally until it reached this point
All completed shards show healthy statistics with fast-path optimization working
No obvious error messages in the visible logs

Request

I’m looking for guidance on:

How to diagnose what’s happening with the missing shard 9
Best practices for handling stuck repair operations
Safe recovery procedures to get the repair process back on track

Any insights or similar experiences would be greatly appreciated!

Has anyone encountered similar shard-specific repair hanging issues? What was your resolution approach?

Gabriel · June 13, 2025, 2:47pm

Hi @kdhyun2

It looks like you are running a very old ScyllaDB version,
I’d suggest upgrading to a more recent version and see if the issues is resolved.

In addition, I’d suggest to you use Scylla Manager for repairs and backups.

kdhyun2 · June 18, 2025, 7:42am

I solved this problem. I could do repair after restarting node each other. I’m not sure how they solved it.

Topic		Replies	Views
Facing Issue on nodetool rebuild : keeps getting stuck for a long time ScyllaDB troubleshooting , nodetool	1	150	July 10, 2024
Trying to setup a 2 node multi dc cluster for the first time... Seed node comes online fine, second node gets stuck repairing tables and constantly in a state of UJ ScyllaDB open-source , troubleshooting , multi-dc	1	215	March 4, 2024
Error while repairing system keyspace, which keyspaces should be repaired? ScyllaDB error-message , administration , repair	0	24	May 4, 2025
Facing errors in repair jobs ScyllaDB troubleshooting , repair	10	174	July 24, 2024
Error when running repairs with a mixed shard-count cluster ScyllaDB troubleshooting , repair	0	212	April 3, 2024