We have scylla running in a cluster deployed using helmfile in kubernetes.
Last Friday, we upgraded our scylla pods by bumping their CPU. Since then one of the pods has been stuck in CrashLoopBackOff with the following error in scylla-manager-agent
:
│ {"L":"INFO","T":"2023-05-03T21:15:43.928Z","N":"wait","M":"Waiting for network connection","sleep":"2s","error":"dial tcp :10000: connect: connection refused","_trace_id":"TgRuXmPiTB27V │
│ {"L":"INFO","T":"2023-05-03T21:15:45.928Z","N":"wait","M":"Waiting for network connection","sleep":"2s","error":"dial tcp :10000: connect: connection refused","_trace_id":"TgRuXmPiTB27V │
│ {"L":"INFO","T":"2023-05-03T21:15:47.932Z","N":"wait","M":"Waiting for network connection","sleep":"2s","error":"dial tcp :10000: connect: connection refused","_trace_id":"TgRuXmPiTB27V │
│ {"L":"INFO","T":"2023-05-03T21:15:49.935Z","N":"wait","M":"Waiting for network connection","sleep":"2s","error":"dial tcp :10000: connect: connection refused","_trace_id":"TgRuXmPiTB27V │
│ {"L":"INFO","T":"2023-05-03T21:15:51.936Z","N":"wait","M":"Waiting for network connection","sleep":"2s","error":"dial tcp :10000: connect: connection refused","_trace_id":"TgRuXmPiTB27V │
│ {"L":"INFO","T":"2023-05-03T21:15:53.936Z","N":"wait","M":"Waiting for network connection","sleep":"2s","error":"dial tcp :10000: connect: connection refused","_trace_id":"TgRuXmPiTB27V │
│ {"L":"INFO","T":"2023-05-03T21:15:55.937Z","N":"wait","M":"Waiting for network connection","sleep":"2s","error":"dial tcp :10000: connect: connection refused","_trace_id":"TgRuXmPiTB27V │
│ {"L":"INFO","T":"2023-05-03T21:15:57.937Z","N":"wait","M":"Waiting for network connection","sleep":"2s","error":"dial tcp :10000: connect: connection refused","_trace_id":"TgRuXmPiTB27V │
│ {"L":"ERROR","T":"2023-05-03T21:15:59.938Z","M":"Bye","error":"server init: no connection to Scylla API, make sure that Scylla server is running and api_address and api_port are set cor │
│ │
│ STARTUP ERROR: server init: no connection to Scylla API, make sure that Scylla server is running and api_address and api_port are set correctly in config file [/etc/scylla-manager-agent │
│ │
│ Stream closed EOF for temporal-scylla/scylla-us-west-2-ebs-a-0 (scylla-manager-agent)
Today another one of our pods fell over with the same error logs in the manager agent.
The scylla logs also show:
│ E0502 16:41:16.025289 1 sidecar/probes.go:169] "healthz probe: can't connect to Scylla API" err="dial tcp [::1]:10000: connect: connection refused" Service="temporal-scylla/scylla-us-west-2-ebs-a-2" │
│ E0502 16:41:26.025360 1 sidecar/probes.go:169] "healthz probe: can't connect to Scylla API" err="dial tcp [::1]:10000: connect: connection refused" Service="temporal-scylla/scylla-us-west-2-ebs-a-2" │
│ E0502 16:41:36.025397 1 sidecar/probes.go:169] "healthz probe: can't connect to Scylla API" err="dial tcp [::1]:10000: connect: connection refused" Service="temporal-scylla/scylla-us-west-2-ebs-a-2" │
│ E0502 16:41:46.024892 1 sidecar/probes.go:169] "healthz probe: can't connect to Scylla API" err="dial tcp [::1]:10000: connect: connection refused" Service="temporal-scylla/scylla-us-west-2-ebs-a-2" │
│ E0502 16:41:56.025542 1 sidecar/probes.go:169] "healthz probe: can't connect to Scylla API" err="dial tcp [::1]:10000: connect: connection refused" Service="temporal-scylla/scylla-us-west-2-ebs-a-2" │
│ E0502 16:41:57.913886 1 sidecar/controller.go:152] syncing key 'temporal-scylla/scylla-us-west-2-ebs-a-2' failed: can't sync the HostID annotation: can't get HostID: can't get local HostID: dial tcp [::1]:10000: connect: connection refused │
│ E0502 16:42:06.025584 1 sidecar/probes.go:169] "healthz probe: can't connect to Scylla API" err="dial tcp [::1]:10000: connect: connection refused" Service="temporal-scylla/scylla-us-west-2-ebs-a-2" │
│ E0502 16:42:16.025182 1 sidecar/probes.go:169] "healthz probe: can't connect to Scylla API" err="dial tcp [::1]:10000: connect: connection refused" Service="temporal-scylla/scylla-us-west-2-ebs-a-2"
What can we do to get these pods to connect to the Scylla Server?