Originally from the User Slack
@Matheus_Salvia: I’ve been having a lot of trouble running repairs on my cluster (scylla oss 6.2)
repairs constantly go over 100% until it eventually crashes. also these tables take weeks to run. see attached
Any ideas what can be going wrong here and how to debug it?
+-------------------------------+----------------------------+----------+------------+
| sessions | table1 | 100% | 51m11s |
| sessions | table2 | 100% | 20m51s |
| sessions | table3 | 100% | 26m6s |
| sessions | table4 | 100% | 35m1s |
| sessions | table5 | 100% | 50m1s |
| sessions | table6 | 100% | 55m2s |
| sessions | table7 | 100% | 1h0m1s |
| sessions | table8 | 170%/5% | 469h8m19s |
| sessions | table9 | 100% | 52m8s |
| sessions | table10 | 100% | 1h15m21s |
| sessions | table11 | 100% | 40m12s |
| sessions | table12 | 188% | 335h36m37s |
| sessions | table13 | 100% | 51m54s |
| sessions | table14 | 189% | 314h16m23s |
| sessions | table15 | 100% | 35m5s |
| sessions | table16 | 100% | 46m44s |
| sessions | table17 | 100% | 45m1s |
| sessions | table18 | 100% | 1h7m44s |
| sessions | table19 | 100% | 50m4s |
| sessions | table20 | 100% | 1h52m47s |
| sessions | table21 | 100% | 58m39s |
| sessions | table22 | 100% | 55m2s |
| sessions | table23 | 100% | 45m9s |
| sessions | table24 | 100% | 36m7s |
| sessions | table25 | 100% | 40m1s |
| sessions | table26 | 131% | 1080h7m53s |
| sessions | table27 | 100% | 46m5s |
+-------------------------------+----------------------------+----------+------------+
running repair with sctool repair -c default/scylla -s now --intensity 2
@avi: Try running repair on the slowest table, with nodetool repair
, to start decoupling components from the problem
@Guy: Hey @Matheus_Salvia did you figure this out?
@Matheus_Salvia: not yet, still investigating. looks like scylla manager wasn’t touching some nodes, as evidenced by some logs when I started a manual repair in a small table
root@scylla-dc1-us-east-1a-1:/# nodetool repair --keyspace sessions --table user_session_history
[2025-04-02 17:05:41,023] Starting repair command #178, repairing 1 ranges for keyspace sessions (parallelism=SEQUENTIAL, full=true)
[2025-04-02 17:05:41,023] Repair session 178
[2025-04-02 17:11:18,443] Repair session 178 finished
root@scylla-dc1-us-east-1a-2:~# nodetool repair --keyspace sessions --table user_session_history
[2025-04-02 17:05:43,998] Starting repair command #1, repairing 1 ranges for keyspace sessions (parallelism=SEQUENTIAL, full=true)
[2025-04-02 17:05:43,998] Repair session 1
[2025-04-02 17:21:23,729] Repair session 1 finished
in one node repair session was 178 and in another it was 1, so I assume this was never ran before. What’s more funny, this table just showed as 100% in the scylla manager logs, I wouldn’t think there would be a problem here
i.e. this isn’t one of the big problematic tables that takes a million hours, it’s a pretty small one that scylla manager showed as 100% done