Error when deploying scylladb to AWS EKS scylla-operator-webhook.scylla-operator.svc context deadline exceeded

Hi, I’ve tried to deploy Scylladb into AWS EKS Kubernetes with yaml manifests and helm charts, but I’m encountering an error

Internal error occurred: failed calling webhook "webhook.scylla.scylladb.com": failed to call webhook: Post "https://scylla-operator-webhook.scylla-operator.svc:443/validate?timeout=10s": context deadline exceeded

Deployment process with manifests:

git clone git@github.com:scylladb/scylla-operator.git
cd scylla-operator
# Cert manager is deployed already when the EKS cluster was originally created, so that's skipped.
kubectl apply -f deploy/operator.yaml

kubectl wait --for condition=established crd/scyllaclusters.scylla.scylladb.com \
    && kubectl -n scylla-operator rollout status deployment.apps/scylla-operator

kubectl -n scylla-operator logs deployment.apps/scylla-operator
  Found 2 pods, using pod/scylla-operator-5c8cb676d9-mbfcq
  I0731 10:16:54.692443       1 operator/cmd.go:21] maxprocs: Leaving GOMAXPROCS=[4]: CPU quota undefined
  I0731 10:16:54.692905       1 operator/operator.go:202] operator version "v1.14.0-alpha.0-87-g6d65216"
  I0731 10:16:54.692928       1 flag/flags.go:64] FLAG: --burst="75"
  I0731 10:16:54.692933       1 flag/flags.go:64] FLAG: --concurrent-syncs="50"
  I0731 10:16:54.692936       1 flag/flags.go:64] FLAG: --cqls-ingress-port="0"
  I0731 10:16:54.692941       1 flag/flags.go:64] FLAG: --crypto-key-buffer-delay="200ms"
  I0731 10:16:54.692947       1 flag/flags.go:64] FLAG: --crypto-key-buffer-size-max="30"
  I0731 10:16:54.692951       1 flag/flags.go:64] FLAG: --crypto-key-buffer-size-min="10"
  I0731 10:16:54.692955       1 flag/flags.go:64] FLAG: --feature-gates=""
  I0731 10:16:54.692985       1 flag/flags.go:64] FLAG: --help="false"
  I0731 10:16:54.692997       1 flag/flags.go:64] FLAG: --image="docker.io/scylladb/scylla-operator:latest"
  I0731 10:16:54.693003       1 flag/flags.go:64] FLAG: --kubeconfig=""
  I0731 10:16:54.693007       1 flag/flags.go:64] FLAG: --leader-election-lease-duration="1m0s"
  I0731 10:16:54.693013       1 flag/flags.go:64] FLAG: --leader-election-renew-deadline="35s"
  I0731 10:16:54.693017       1 flag/flags.go:64] FLAG: --leader-election-retry-period="10s"
  I0731 10:16:54.693020       1 flag/flags.go:64] FLAG: --loglevel="2"
  I0731 10:16:54.693025       1 flag/flags.go:64] FLAG: --namespace="scylla-operator"
  I0731 10:16:54.693030       1 flag/flags.go:64] FLAG: --qps="50"
  I0731 10:16:54.693035       1 flag/flags.go:64] FLAG: --v="2"
  I0731 10:16:54.693258       1 leaderelection/leaderelection.go:100] Starting leader election
  I0731 10:16:54.693278       1 leaderelection/leaderelection.go:250] attempting to acquire leader lease scylla-operator/scylla-operator-lock...

kubectl create -f examples/generic/cluster.yaml
  Error from server (InternalError): error when creating "examples/generic/cluster.yaml": Internal error occurred: failed calling webhook "webhook.scylla.scylladb.com": failed to call webhook: Post "https://scylla-operator-webhook.scylla-operator.svc:443/validate?timeout=10s": context deadline exceeded

I’ve seen in couple threads that firewall port 9443 should be open from cluster to the nodes. I’ve checked the rules with this process:

# List node groups
aws eks --region eu-central-1 list-nodegroups --cluster-name k8s01 --query nodegroups
[
    "k8s01-initial-2024062407114279650000000f"
]

# Get auto scaling group name
aws eks --region eu-central-1 describe-nodegroup --cluster-name k8s01 --nodegroup-name k8s01-initial-2024062407114279650000000f --query nodegroup.resources.autoScalingGroups[].name
[
    "eks-k8s01-initial-2024062407114279650000000f-b2c8248b-0857-f142-cf40-7a7c66219c46"
]

# Get one instance of the auto scaling group
aws autoscaling --region eu-central-1 describe-auto-scaling-groups --auto-scaling-group-names eks-k8s01-initial-2024062407114279650000000f-b2c8248b-0857-f142-cf40-7a7c66219c46 --query AutoScalingGroups[].Instances[0].InstanceId
[
    "i-000d9209e60e7365d"
]

# Get security groups of the instance
aws ec2 --region eu-central-1 describe-instances --instance-ids i-000d9209e60e7365d --query Reservations[].Instances[].SecurityGroups[].GroupId
[
    "sg-0c79a4a5106d84011"
]

# Check firewall rules of the security group
aws ec2 --region eu-central-1 describe-security-groups --group-ids sg-0c79a4a5106d84011 --query 'SecurityGroups[].IpPermissions[].{FromPort:FromPort,UserIdGroupPairs:UserIdGroupPairs}' --output yaml | sed '/UserId/d'
- FromPort: 30080
- FromPort: 6443
  - Description: Cluster API to node 6443/tcp webhook
    GroupId: sg-06a93cc5cd09cf20a
- FromPort: 30880
- FromPort: null
  - Description: Node to node ingress traffic
    GroupId: sg-0c79a4a5106d84011
- FromPort: 9443
  - Description: Cluster API to node 9443/tcp webhook
    GroupId: sg-06a93cc5cd09cf20a
- FromPort: 1025
  - Description: Node to node ingress on ephemeral ports
    GroupId: sg-0c79a4a5106d84011
- FromPort: 8443
  - Description: Cluster API to node 8443/tcp webhook
    GroupId: sg-06a93cc5cd09cf20a
- FromPort: 10250
  - Description: Cluster API to node kubelets
    GroupId: sg-06a93cc5cd09cf20a
- FromPort: 53
  - Description: Node to node CoreDNS
    GroupId: sg-0c79a4a5106d84011
- FromPort: 53
  - Description: Node to node CoreDNS UDP
    GroupId: sg-0c79a4a5106d84011
- FromPort: 443
  - Description: Cluster API to node groups
    GroupId: sg-06a93cc5cd09cf20a
- FromPort: 4443
  - Description: Cluster API to node 4443/tcp webhook
    GroupId: sg-06a93cc5cd09cf20a

# Check the cluster security groups
aws eks describe-cluster --region eu-central-1 --name k8s01 --query 'cluster.resourcesVpcConfig.{securityGroupIds:securityGroupIds, clusterSecurityGroupId:clusterSecurityGroupId}'
{
    "securityGroupIds": [
        "sg-06a93cc5cd09cf20a"
    ],
    "clusterSecurityGroupId": "sg-06071c24109602e48"
}

Cluster security group sg-06a93cc5cd09cf20a has access to the nodes via port 9443, so I think the firewall rule is fine.

- FromPort: 9443
  - Description: Cluster API to node 9443/tcp webhook
    GroupId: sg-06a93cc5cd09cf20a

Versions:

  • Platform: AWS EKS
  • Kubernetes: v1.30.0-eks-036c24b
  • cert-manager: v1.12.12
  • Helm: v3.15.2
  • Scylla-operator: 1.14.0

Any ideas what I should check?

Webhook Service listens on port 443.

You have to make sure traffic on 443 port between Kubernetes master nodes and nodes where Operator Webhook Pods are running is allowed

Shouldn’t this rule already allow port 443 from the cluster API (= master nodes?) to the nodes?

Traffic is from random port to 443 port, maybe rule has to specify that instead of FromPort:443.

Thanks for advice so far!

The output of AWS CLI is bit confusing. The output says FromPort:443 and it means that the security group will let port 443 through.

This screenshot shows the same information as the AWS CLI returns. Port 443 is open on node security group and allowed source is cluster security group.

Would there be some commands that I could run to test connections between pods or something?

Hey @iisti, did you find a solution? I have the same issue as you. Port 443 is open, and I actually have other webhooks configured on this cluster, like the one for the Prometheus Operator, and it’s working fine.
I’m out of ideas. Thanks!

Sorry I never got it working, and I stopped working on the project. I also could configure other webhooks, but didn’t get scylladb working.

Alright, thanks. There must be something I’m missing, but I can’t figure out what.
Maybe someone who got it working will comment on this post!

Check if webhook is up and running by checking logs from two webhook pods. One of them should complain about not being able to become a leader, and second one should mostly be silent (leader one).
If they are up and running, then you have to make sure kube-apiserver has access to those pods. They are usually running on worker clusters, so you have to make sure that master nodes where kube-apiserver is running has access to worker clusters on port 443.
You may also temporarily allow entire traffic to check if firewall rules are indeed an issue.

1 Like

Okay, allowing all ports on the nodes with the Kubernetes API as the source makes it work.
However, all ports need to be open on the nodes — not just port 443 — and I don’t understand how this webhook is different from the one for ingress-nginx or the Prometheus operator, which both work and also listen on port 443.

I think I’ve figured it out: port 5000 also needs to be open. I suppose the service is only used to provide the IP and port of a pod, and the API eventually connects directly to port 5000 of the webhook.
Here’s the Terraform block to add if you’re using the Terraform EKS module — and the operator will work!

  node_security_group_additional_rules = {
    [...]
    scylla-operator = {
      description                   = "Scylla Operator"
      from_port                     = 5000
      to_port                       = 5000
      protocol                      = "TCP"
      type                          = "ingress"
      source_cluster_security_group = true
    }
    [...]
  }