Issues with repair not functioning properly in ScyllaDB version 5.2.6

Installation details
#ScyllaDB version: 5.2.6
#Cluster size: 1 DC/ 9 Nodes
os (RHEL/CentOS/Ubuntu/AWS AMI):AWS AMI

Environment Details

  • Issue: nodetool repair -pr appears to be stuck during system_traces keyspace repair
  • Symptom: No new repair logs generated after certain point, repair process seems frozen
  • Keyspace: system_traces
  • Suspected cause: Shard 9 not completing (only shards 0-8, 10-13 visible in logs)

Current Situation

I executed nodetool repair -pr and the repair process was progressing normally through the system_traces keyspace. However, the repair seems to have gotten stuck and is no longer generating any log entries.

Log Analysis

From the repair logs, I can see:

Completed Shards:

  • Shards 0, 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13 all completed successfully
  • All show round_nr_fast_path_already_synced=1280 indicating data was already in sync
  • All report tx_row_nr=0, rx_row_nr=0 (no data transfer required)

Missing Shard:

  • Shard 9 is conspicuously absent from the completion logs
  • This appears to be where the repair process is stuck

Last Activity:

INFO 2025-06-09 04:50:11,245 [shard 6] repair - repair[a2b617a4-288d-497f-8091-82fdfcafa581]: shard 0 completed successfully, keyspace=system_traces

After this timestamp, no further repair progress logs are being generated.

Questions

  1. How can I identify what’s happening with shard 9?

    • Are there specific log patterns I should look for?
    • Any monitoring commands to check shard-specific repair status?
  2. What are the safe approaches to resolve this stuck repair?

    • Should I wait longer for the repair to potentially complete?
    • Are there non-disruptive ways to help the stuck shard progress?
    • What’s the safest recovery approach that won’t impact cluster stability?
  3. Diagnostic steps:

    • What additional information should I collect to troubleshoot this?
    • Are there specific metrics or logs that would help identify the root cause?
  4. Recovery options:

    • Can I resume the repair from where it left off?
    • Should I restart the repair process entirely?
    • Any way to exclude or specifically target the problematic shard?

Additional Context

  • This is a primary range repair (-pr option used)
  • The repair was progressing normally until it reached this point
  • All completed shards show healthy statistics with fast-path optimization working
  • No obvious error messages in the visible logs

Request

I’m looking for guidance on:

  • How to diagnose what’s happening with the missing shard 9
  • Best practices for handling stuck repair operations
  • Safe recovery procedures to get the repair process back on track

Any insights or similar experiences would be greatly appreciated!

Has anyone encountered similar shard-specific repair hanging issues? What was your resolution approach?

Hi @kdhyun2

It looks like you are running a very old ScyllaDB version,
I’d suggest upgrading to a more recent version and see if the issues is resolved.

In addition, I’d suggest to you use Scylla Manager for repairs and backups.

I solved this problem. I could do repair after restarting node each other. I’m not sure how they solved it.

1 Like