Installation details
#ScyllaDB version: 5.2.6
#Cluster size: 1 DC/ 9 Nodes
os (RHEL/CentOS/Ubuntu/AWS AMI):AWS AMI
Environment Details
- Issue:
nodetool repair -pr
appears to be stuck during system_traces keyspace repair - Symptom: No new repair logs generated after certain point, repair process seems frozen
- Keyspace: system_traces
- Suspected cause: Shard 9 not completing (only shards 0-8, 10-13 visible in logs)
Current Situation
I executed nodetool repair -pr
and the repair process was progressing normally through the system_traces keyspace. However, the repair seems to have gotten stuck and is no longer generating any log entries.
Log Analysis
From the repair logs, I can see:
Completed Shards:
- Shards 0, 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13 all completed successfully
- All show
round_nr_fast_path_already_synced=1280
indicating data was already in sync - All report
tx_row_nr=0, rx_row_nr=0
(no data transfer required)
Missing Shard:
- Shard 9 is conspicuously absent from the completion logs
- This appears to be where the repair process is stuck
Last Activity:
INFO 2025-06-09 04:50:11,245 [shard 6] repair - repair[a2b617a4-288d-497f-8091-82fdfcafa581]: shard 0 completed successfully, keyspace=system_traces
After this timestamp, no further repair progress logs are being generated.
Questions
-
How can I identify what’s happening with shard 9?
- Are there specific log patterns I should look for?
- Any monitoring commands to check shard-specific repair status?
-
What are the safe approaches to resolve this stuck repair?
- Should I wait longer for the repair to potentially complete?
- Are there non-disruptive ways to help the stuck shard progress?
- What’s the safest recovery approach that won’t impact cluster stability?
-
Diagnostic steps:
- What additional information should I collect to troubleshoot this?
- Are there specific metrics or logs that would help identify the root cause?
-
Recovery options:
- Can I resume the repair from where it left off?
- Should I restart the repair process entirely?
- Any way to exclude or specifically target the problematic shard?
Additional Context
- This is a primary range repair (
-pr
option used) - The repair was progressing normally until it reached this point
- All completed shards show healthy statistics with fast-path optimization working
- No obvious error messages in the visible logs
Request
I’m looking for guidance on:
- How to diagnose what’s happening with the missing shard 9
- Best practices for handling stuck repair operations
- Safe recovery procedures to get the repair process back on track
Any insights or similar experiences would be greatly appreciated!
Has anyone encountered similar shard-specific repair hanging issues? What was your resolution approach?