Scylla manager agent cannot connect to Scylla API in kubernetes

We have scylla running in a cluster deployed using helmfile in kubernetes.

Last Friday, we upgraded our scylla pods by bumping their CPU. Since then one of the pods has been stuck in CrashLoopBackOff with the following error in scylla-manager-agent:

│ {"L":"INFO","T":"2023-05-03T21:15:43.928Z","N":"wait","M":"Waiting for network connection","sleep":"2s","error":"dial tcp :10000: connect: connection refused","_trace_id":"TgRuXmPiTB27V │
│ {"L":"INFO","T":"2023-05-03T21:15:45.928Z","N":"wait","M":"Waiting for network connection","sleep":"2s","error":"dial tcp :10000: connect: connection refused","_trace_id":"TgRuXmPiTB27V │
│ {"L":"INFO","T":"2023-05-03T21:15:47.932Z","N":"wait","M":"Waiting for network connection","sleep":"2s","error":"dial tcp :10000: connect: connection refused","_trace_id":"TgRuXmPiTB27V │
│ {"L":"INFO","T":"2023-05-03T21:15:49.935Z","N":"wait","M":"Waiting for network connection","sleep":"2s","error":"dial tcp :10000: connect: connection refused","_trace_id":"TgRuXmPiTB27V │
│ {"L":"INFO","T":"2023-05-03T21:15:51.936Z","N":"wait","M":"Waiting for network connection","sleep":"2s","error":"dial tcp :10000: connect: connection refused","_trace_id":"TgRuXmPiTB27V │
│ {"L":"INFO","T":"2023-05-03T21:15:53.936Z","N":"wait","M":"Waiting for network connection","sleep":"2s","error":"dial tcp :10000: connect: connection refused","_trace_id":"TgRuXmPiTB27V │
│ {"L":"INFO","T":"2023-05-03T21:15:55.937Z","N":"wait","M":"Waiting for network connection","sleep":"2s","error":"dial tcp :10000: connect: connection refused","_trace_id":"TgRuXmPiTB27V │
│ {"L":"INFO","T":"2023-05-03T21:15:57.937Z","N":"wait","M":"Waiting for network connection","sleep":"2s","error":"dial tcp :10000: connect: connection refused","_trace_id":"TgRuXmPiTB27V │
│ {"L":"ERROR","T":"2023-05-03T21:15:59.938Z","M":"Bye","error":"server init: no connection to Scylla API, make sure that Scylla server is running and api_address and api_port are set cor │
│                                                                                                                                                                                           │
│ STARTUP ERROR: server init: no connection to Scylla API, make sure that Scylla server is running and api_address and api_port are set correctly in config file [/etc/scylla-manager-agent │
│                                                                                                                                                                                           │
│ Stream closed EOF for temporal-scylla/scylla-us-west-2-ebs-a-0 (scylla-manager-agent)

Today another one of our pods fell over with the same error logs in the manager agent.

The scylla logs also show:

│ E0502 16:41:16.025289       1 sidecar/probes.go:169] "healthz probe: can't connect to Scylla API" err="dial tcp [::1]:10000: connect: connection refused" Service="temporal-scylla/scylla-us-west-2-ebs-a-2"                                                                          │
│ E0502 16:41:26.025360       1 sidecar/probes.go:169] "healthz probe: can't connect to Scylla API" err="dial tcp [::1]:10000: connect: connection refused" Service="temporal-scylla/scylla-us-west-2-ebs-a-2"                                                                          │
│ E0502 16:41:36.025397       1 sidecar/probes.go:169] "healthz probe: can't connect to Scylla API" err="dial tcp [::1]:10000: connect: connection refused" Service="temporal-scylla/scylla-us-west-2-ebs-a-2"                                                                          │
│ E0502 16:41:46.024892       1 sidecar/probes.go:169] "healthz probe: can't connect to Scylla API" err="dial tcp [::1]:10000: connect: connection refused" Service="temporal-scylla/scylla-us-west-2-ebs-a-2"                                                                          │
│ E0502 16:41:56.025542       1 sidecar/probes.go:169] "healthz probe: can't connect to Scylla API" err="dial tcp [::1]:10000: connect: connection refused" Service="temporal-scylla/scylla-us-west-2-ebs-a-2"                                                                          │
│ E0502 16:41:57.913886       1 sidecar/controller.go:152] syncing key 'temporal-scylla/scylla-us-west-2-ebs-a-2' failed: can't sync the HostID annotation: can't get HostID: can't get local HostID: dial tcp [::1]:10000: connect: connection refused                                 │
│ E0502 16:42:06.025584       1 sidecar/probes.go:169] "healthz probe: can't connect to Scylla API" err="dial tcp [::1]:10000: connect: connection refused" Service="temporal-scylla/scylla-us-west-2-ebs-a-2"                                                                          │
│ E0502 16:42:16.025182       1 sidecar/probes.go:169] "healthz probe: can't connect to Scylla API" err="dial tcp [::1]:10000: connect: connection refused" Service="temporal-scylla/scylla-us-west-2-ebs-a-2"

What can we do to get these pods to connect to the Scylla Server?

I also see these errors in the scylla container:

│ JMX is enabled to receive remote connections on port: 7199                                                                                                                                                                                                                            │
│ E0503 21:57:00.614729       1 sidecar/controller.go:152] syncing key 'temporal-scylla/scylla-us-west-2-us-west-2a-1' failed: can't sync the HostID annotation: can't get HostID: can't get local HostID: dial tcp [::1]:10000: connect: connection refused                            │
│ 2023-05-03 21:57:01,059 INFO success: rsyslog entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)                                                                                                                                                           │
│ 2023-05-03 21:57:01,060 INFO spawned: 'scylla' with pid 85                                                                                                                                                                                                                            │
│ 2023-05-03 21:57:01,060 INFO success: scylla-housekeeping entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)                                                                                                                                               │
│ 2023-05-03 21:57:01,061 INFO success: scylla-jmx entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)                                                                                                                                                        │
│ 2023-05-03 21:57:01,061 INFO success: scylla-node-exporter entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)                                                                                                                                              │
│ 2023-05-03 21:57:01,061 INFO success: sshd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)                                                                                                                                                              │
│ Scylla version 5.0.3-0.20220907.b9a61c8e9 with build-id 7be266d2954825cdf843c744de04a0443a8f156c starting ...                                                                                                                                                                         │
│ command used: "/usr/bin/scylla --log-to-syslog 0 --log-to-stdout 1 --default-log-level info --network-stack posix --developer-mode=1 --smp 6 --overprovisioned --listen-address :: --rpc-address :: --seed-provider-parameters seeds=fdbc:7863:fdbf::4df9 --broadcast-address fdbc:78 │
│ parsed command line options: [log-to-syslog, (positional) 0, log-to-stdout, (positional) 1, default-log-level, (positional) info, network-stack, (positional) posix, developer-mode: 1, smp, (positional) 6, overprovisioned, listen-address: ::, rpc-address: ::, seed-provider-para │
│ Could not initialize seastar: std::runtime_error (insufficient processing units)                                                                                                                                                                                                      │
│ 2023-05-03 21:57:01,262 INFO exited: scylla (exit status 1; not expected)                                                                                                                                                                                                             │
│ E0503 21:57:03.176938       1 sidecar/controller.go:152] syncing key 'temporal-scylla/scylla-us-west-2-us-west-2a-1' failed: can't sync the HostID annotation: can't get HostID: can't get local HostID: dial tcp [::1]:10000: connect: connection refused                            │
│ 2023-05-03 21:57:03,265 INFO spawned: 'scylla' with pid 96                                                                                                                                                                                                                            │
│ Scylla version 5.0.3-0.20220907.b9a61c8e9 with build-id 7be266d2954825cdf843c744de04a0443a8f156c starting ...                                                                                                                                                                         │
│ command used: "/usr/bin/scylla --log-to-syslog 0 --log-to-stdout 1 --default-log-level info --network-stack posix --developer-mode=1 --smp 6 --overprovisioned --listen-address :: --rpc-address :: --seed-provider-parameters seeds=fdbc:7863:fdbf::4df9 --broadcast-address fdbc:78 │
│ parsed command line options: [log-to-syslog, (positional) 0, log-to-stdout, (positional) 1, default-log-level, (positional) info, network-stack, (positional) posix, developer-mode: 1, smp, (positional) 6, overprovisioned, listen-address: ::, rpc-address: ::, seed-provider-para │
│ Could not initialize seastar: std::runtime_error (insufficient processing units)                                                                                                                                                                                                      │
│ 2023-05-03 21:57:03,458 INFO exited: scylla (exit status 1; not expected)                                                                                                                                                                                                             │
│ Traceback (most recent call last):                                                                                                                                                                                                                                                    │
│   File "/opt/scylladb/scripts/libexec/scylla-housekeeping", line 196, in <module>                                                                                                                                                                                                     │
│     args.func(args)                                                                                                                                                                                                                                                                   │
│   File "/opt/scylladb/scripts/libexec/scylla-housekeeping", line 122, in check_version                                                                                                                                                                                                │
│     current_version = sanitize_version(get_api('/storage_service/scylla_release_version'))                                                                                                                                                                                            │
│   File "/opt/scylladb/scripts/libexec/scylla-housekeeping", line 80, in get_api                                                                                                                                                                                                       │
│     return get_json_from_url("http://" + api_address + path)                                                                                                                                                                                                                          │
│   File "/opt/scylladb/scripts/libexec/scylla-housekeeping", line 75, in get_json_from_url                                                                                                                                                                                             │
│     raise RuntimeError(f'Failed to get "{path}" due to the following error: {retval}')                                                                                                                                                                                                │
│ RuntimeError: Failed to get "http://localhost:10000/storage_service/scylla_release_version" due to the following error: <urlopen error [Errno 111] Connection refused>                                                                                                                │
│ 2023-05-03 21:57:06,812 INFO spawned: 'scylla' with pid 119                                                                                                                                                                                                                           │
│ E0503 21:57:06.964264       1 sidecar/probes.go:169] "healthz probe: can't connect to Scylla API" err="dial tcp [::1]:10000: connect: connection refused" Service="tem

We found out that on affected pods, scylla API Is not running at all:

root@scylla-us-west-2-us-west-2a-1:/# curl http://localhost:10000/
curl: (7) Failed to connect to localhost port 10000: Connection refused
root@scylla-us-west-2-us-west-2a-1:/# ss -nltp
State                        Recv-Q                       Send-Q                                                  Local Address:Port                                              Peer Address:Port                       Process
LISTEN                       0                            128                                                           0.0.0.0:22                                                     0.0.0.0:*                           users:(("sshd",pid=56,fd=3))
LISTEN                       0                            1024                                                        127.0.0.1:9001                                                   0.0.0.0:*                           users:(("supervisord",pid=34,fd=4))
LISTEN                       0                            128                                                              [::]:22                                                        [::]:*                           users:(("sshd",pid=56,fd=4))
LISTEN                       0                            50                                                 [::ffff:127.0.0.1]:7199                                                         *:*                           users:(("scylla-jmx",pid=39,fd=9))
LISTEN                       0                            4096                                                                *:9100                                                         *:*                           users:(("node_exporter",pid=40,fd=3))
LISTEN                       0                            4096                                                                *:8080                                                         *:*                           users:(("scylla-operator",pid=1,fd=11))
LISTEN                       0                            50                                                                  *:39473                                                        *:*                           users:(("scylla-jmx",pid=39,fd=10)

vs on normal pods:

root@scylla-us-west-2-us-west-2a-0:/# ss -nltp
State                        Recv-Q                       Send-Q                                                  Local Address:Port                                              Peer Address:Port                       Process
LISTEN                       0                            128                                                           0.0.0.0:22                                                     0.0.0.0:*                           users:(("sshd",pid=55,fd=3))
LISTEN                       0                            4096                                                        127.0.0.1:5112                                                   0.0.0.0:*
LISTEN                       0                            1024                                                        127.0.0.1:9001                                                   0.0.0.0:*                           users:(("supervisord",pid=34,fd=4))
LISTEN                       0                            4096                                                                *:9100                                                         *:*                           users:(("node_exporter",pid=40,fd=3))
LISTEN                       0                            100                                                                 *:10000                                                        *:*                           users:(("scylla",pid=37,fd=24))
LISTEN                       0                            4096                                                                *:8080                                                         *:*                           users:(("scylla-operator",pid=1,fd=11))
LISTEN                       0                            4096                                                                *:10001                                                        *:*
LISTEN                       0                            50                                                                  *:32785                                                        *:*                           users:(("scylla-jmx",pid=39,fd=10))
LISTEN                       0                            100                                                                 *:9042                                                         *:*                           users:(("scylla",pid=37,fd=111))
LISTEN                       0                            128                                                              [::]:22                                                        [::]:*                           users:(("sshd",pid=55,fd=4))
LISTEN                       0                            100                                                                 *:7000                                                         *:*                           users:(("scylla",pid=37,fd=320))
LISTEN                       0                            100                                                                 *:9180                                                         *:*                           users:(("scylla",pid=37,fd=23))
LISTEN                       0                            50                                                 [::ffff:127.0.0.1]:7199                                                         *:*                           users:(("scylla-jmx",pid=39,fd=9))
LISTEN                       0                            100                                                                 *:19042                                                        *:*                           users:(("scylla",pid=37,fd=116))
LISTEN                       0                            4096                                                                *:5090                                                         *:*

Does anybody know why this would be happening? We haven’t found any error logs indicating why this would be happening other than:

RuntimeError: Failed to get "http://localhost:10000/storage_service/scylla_release_version" due to the following error: <urlopen error [Errno 111] Connection refused>                                                                                                             

Also i see:

2023-05-03 23:52:01,772 CRIT Supervisor is running as root.  Privileges were not dropped because no user is specified in the config file.  If you intend to run as root, you can set user=root in the config file to avoid this message.
2023-05-03 23:52:01,772 INFO Included extra file "/etc/supervisord.conf.d/rsyslog.conf" during parsing
2023-05-03 23:52:01,772 INFO Included extra file "/etc/supervisord.conf.d/scylla-housekeeping.conf" during parsing
2023-05-03 23:52:01,772 INFO Included extra file "/etc/supervisord.conf.d/scylla-jmx.conf" during parsing
2023-05-03 23:52:01,772 INFO Included extra file "/etc/supervisord.conf.d/scylla-node-exporter.conf" during parsing
2023-05-03 23:52:01,772 INFO Included extra file "/etc/supervisord.conf.d/scylla-server.conf" during parsing
2023-05-03 23:52:01,772 INFO Included extra file "/etc/supervisord.conf.d/sshd-server.conf" during parsing
2023-05-03 23:52:01,775 INFO RPC interface 'supervisor' initialized
2023-05-03 23:52:01,775 CRIT Server 'inet_http_server' running without any HTTP authentication checking
2023-05-03 23:52:01,775 INFO supervisord started with pid 34
2023-05-03 23:52:02,777 INFO spawned: 'rsyslog' with pid 36
2023-05-03 23:52:02,778 INFO spawned: 'scylla' with pid 37
2023-05-03 23:52:02,779 INFO spawned: 'scylla-housekeeping' with pid 38
2023-05-03 23:52:02,781 INFO spawned: 'scylla-jmx' with pid 39
2023-05-03 23:52:02,782 INFO spawned: 'scylla-node-exporter' with pid 40
2023-05-03 23:52:02,783 INFO spawned: 'sshd' with pid 43
2023-05-03 23:52:03,011 INFO exited: scylla (exit status 1; not expected)
2023-05-03 23:52:04,316 INFO success: rsyslog entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-05-03 23:52:04,317 INFO spawned: 'scylla' with pid 85
2023-05-03 23:52:04,318 INFO success: scylla-housekeeping entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-05-03 23:52:04,318 INFO success: scylla-jmx entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-05-03 23:52:04,318 INFO success: scylla-node-exporter entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-05-03 23:52:04,318 INFO success: sshd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-05-03 23:52:04,526 INFO exited: scylla (exit status 1; not expected)
2023-05-03 23:52:06,529 INFO spawned: 'scylla' with pid 96
2023-05-03 23:52:06,742 INFO exited: scylla (exit status 1; not expected)
2023-05-03 23:52:10,058 INFO spawned: 'scylla' with pid 119
2023-05-03 23:52:10,266 INFO exited: scylla (exit status 1; not expected)
2023-05-03 23:52:11,267 INFO gave up: scylla entered FATAL state, too many start retries too quickly

in supervisord.log

We figured out the issue. We over-provisioned CPU beyond the limitations of the underlying node.

1 Like

Kindly share detail config of CPU parameter? thank you in advance!