Scylla-manager backup failure

Hello All,

A backup to azure from scylladb manager fails when we run a dry run backup using “sctool backup -c Cluster -L azure:scylladb-backup --dry-run”.

But when we run the check location on each scylla cluster node with
scylla-manager-agent check-location --location azure:scylladb-backup it wont return any error.
The same thing happens for scylla-manager-agent check-location --debug --location azure:scylla-db

what we have done so far;

  1. check location on all cluster nodes using the check location command with debug and they all returned without any error!

  2. the check location with debug on all cluster nodes returned successfully writing a temp file to the azure location and deleting them.

The issue I am having now is when the command from scylladb manager is executed “sctool backup -c Cluster -L azure:scylladb-backup --dry-run”, we get the error below.

"we get the error below: Jun 13 08:55:52 scylladb scylla-manager[1957]: {“L”:“ERROR”,“T”:“2024-06-13T08:55:52.685Z”,“N”:“backup”,“M”:“Failed to access location from node”,“node”:“XXX.XX.XX.XX”,“location”:“azure:scylladb-backup”,“error”:“17X.XX.XX.XX: giving up after 2 attempts: after 30s: context deadline”

Kindly help with steps or ways to troubleshoot and resolve this issue.

Hi @Mike ,

scylla-manager-agent reads the configuration directly from /etc/scylla-manager-agent/scylla-manager-agent.yaml when it’s executed. Therefore, any changes you make to the configuration file are automatically reflected when you call the scylla-manager-agent CLI.

The Scylla Manager server performs location checks through the agent running as a service on the node. Whenever you make changes to scylla-manager-agent.yaml, you must restart the service to apply the recent configuration changes.

Have you updated the location in the .yaml file but haven’t restarted the service yet? Please try restarting the scylla-manager-agent.service on all nodes.

sudo systemctl restart scylla-manager-agent

Br,

1 Like

Hello @Karol_Kokoszka ,

Thank you for the feedback.

Yes, the agent service has been restarted on all nodes but it still fails to backup the file to Azure and we still get the following errors in the logs as seen below.

Aug 29 10:20:14 scylladb-u scylla-manager[306555]: {“L”:“INFO”,“T”:“2024-08-29T10:20:14.933+0100”,“N”:“cluster.client”,“M”:“HTTP retry backoff”,“operation”:“OperationsCheckPermissions”,“wait”:“1s”,“error”:“after 30s: context deadline exceeded”,“_trace_id”:“1fFp5ltHQnmQd2F38Sy8fg”}

Aug 29 10:20:14 scylladb-u scylla-manager[306555]: message repeated 2 times: [ {“L”:“INFO”,“T”:“2024-08-29T10:20:14.933+0100”,“N”:“cluster.client”,“M”:“HTTP retry backoff”,“operation”:“OperationsCheckPermissions”,“wait”:“1s”,“error”:“after 30s: context deadline exceeded”,“_trace_id”:“1fFp5ltHQnmQd2F38Sy8fg”}]

Aug 29 10:20:45 scylladb-u scylla-manager[306555]: {“L”:“INFO”,“T”:“2024-08-29T10:20:45.936+0100”,“N”:“backup”,“M”:“Location check FAILED”,“host”:“1XX.XX.XX.XXX”,“location”:“azure:scyllatestbackup”,“error”:“giving up after 2 attempts: after 30s: context deadline exceeded”,“_trace_id”:“1fFp5ltHQnmQd2F38Sy8fg”}

Aug 29 10:20:45 scylladb-u scylla-manager[306555]: {“L”:“ERROR”,“T”:“2024-08-29T10:20:45.936+0100”,“N”:“backup”,“M”:"Failed to access location from node.

What else can we check and try, could it be an issue with the Azure location/access?

@Mike
The error above suggests that scylla-manager-server cannot access agent’s from the nodes.
Can you call sctool status and paste the output ? Status | ScyllaDB Docs

Hello @Karol_Kokoszka , kindly see the output for the sctool status below

±—±---------±---------±--------------±-----------±-----±--------±-------±------±-------------------------------------+
| | CQL | REST | Address | Uptime | CPUs | Memory | Scylla | Agent | Host ID |
±—±---------±---------±--------------±-----------±-----±--------±-------±------±-------------------------------------+
| UN | UP (0ms) | UP (1ms) | 1XX.XX.XX.1XX | 116h19m19s | 8 | 62.797G | 5.2.6 | 3.3.0 | b8f1375x-8855-4f02-913e-b093fcaff1c3 |
| UN | UP (0ms) | UP (0ms) | 1XX.XX.XX.2XX | 572h41m1s | 8 | 62.788G | 5.2.6 | 3.3.0 | ad57011x-6597-427a-9f42-15390fa7c379 |
| UN | UP (0ms) | UP (0ms) | 1XX.XX.XX.3XX | 1315h1m7s | 8 | 62.788G | 5.2.6 | 3.3.0 | b3577f6x-dc40-4330-bb7a-0612724f68db |
±—±---------±---------±--------------±-----------±-----±--------±-------±------±-------------------------------------+

@Mike Is there a chance to see the scylla-manager-agent logs as well ?
I have no clue yet, what may be a reason.
Call to manager-agent times out definitely.

You can change the log-level in scylla-manager-agent.yaml as well to see bit more of the details, by providing this config value (top level) to the file:

logger:
  level: debug
1 Like

Hello @Karol_Kokoszka,
Thank for the help, I appreciate.
We finally resolved the issue(this was due to permissions on the cloud).
Can you please share why our backup which is now running could be slow? is there a parameter(s) in the scyllamanager yaml file or metrics we could monitor to help increase the backup speed? See details of the running status of our cluster backup to Azure. It has been running for more than 3 days now!!!

scylladbmanager:~# sctool task progress -c Cluster backup/e42a9107-1cc5-423b-8c9a-e5e30b0bd118
Command “progress” is deprecated, use sctool backup|repair progress instead.

Run: 373c1d8b-6c82-11ef-b5c7-005056a70a02
Status: RUNNING (uploading data)
Start time: 06 Sep 24 19:00:00 UTC
Duration: 86h57m26s
Progress: 32%
Snapshot Tag: sm_20240906190002UTC
Datacenters:

  • data_center

±------------±---------±-------±--------±-------------±-------+
| Host | Progress | Size | Success | Deduplicated | Failed |
±------------±---------±-------±--------±-------------±-------+
| 1XX.38.1.3 | 18% | 6.059T | 1.091T | 0 | 0 |
| 1XX.38.1.4 | 60% | 6.470T | 3.885T | 0 | 0 |
| 1XX.38.1.8 | 18% | 6.327T | 1.154T | 0 | 0 |
±------------±---------±-------±--------±-------------±-------+

1 Like