Error when deploying scylladb to AWS EKS scylla-operator-webhook.scylla-operator.svc context deadline exceeded

iisti · July 31, 2024, 10:33am

Hi, I’ve tried to deploy Scylladb into AWS EKS Kubernetes with yaml manifests and helm charts, but I’m encountering an error

Internal error occurred: failed calling webhook "webhook.scylla.scylladb.com": failed to call webhook: Post "https://scylla-operator-webhook.scylla-operator.svc:443/validate?timeout=10s": context deadline exceeded

Deployment process with manifests:

git clone git@github.com:scylladb/scylla-operator.git
cd scylla-operator
# Cert manager is deployed already when the EKS cluster was originally created, so that's skipped.
kubectl apply -f deploy/operator.yaml

kubectl wait --for condition=established crd/scyllaclusters.scylla.scylladb.com \
    && kubectl -n scylla-operator rollout status deployment.apps/scylla-operator

kubectl -n scylla-operator logs deployment.apps/scylla-operator
  Found 2 pods, using pod/scylla-operator-5c8cb676d9-mbfcq
  I0731 10:16:54.692443       1 operator/cmd.go:21] maxprocs: Leaving GOMAXPROCS=[4]: CPU quota undefined
  I0731 10:16:54.692905       1 operator/operator.go:202] operator version "v1.14.0-alpha.0-87-g6d65216"
  I0731 10:16:54.692928       1 flag/flags.go:64] FLAG: --burst="75"
  I0731 10:16:54.692933       1 flag/flags.go:64] FLAG: --concurrent-syncs="50"
  I0731 10:16:54.692936       1 flag/flags.go:64] FLAG: --cqls-ingress-port="0"
  I0731 10:16:54.692941       1 flag/flags.go:64] FLAG: --crypto-key-buffer-delay="200ms"
  I0731 10:16:54.692947       1 flag/flags.go:64] FLAG: --crypto-key-buffer-size-max="30"
  I0731 10:16:54.692951       1 flag/flags.go:64] FLAG: --crypto-key-buffer-size-min="10"
  I0731 10:16:54.692955       1 flag/flags.go:64] FLAG: --feature-gates=""
  I0731 10:16:54.692985       1 flag/flags.go:64] FLAG: --help="false"
  I0731 10:16:54.692997       1 flag/flags.go:64] FLAG: --image="docker.io/scylladb/scylla-operator:latest"
  I0731 10:16:54.693003       1 flag/flags.go:64] FLAG: --kubeconfig=""
  I0731 10:16:54.693007       1 flag/flags.go:64] FLAG: --leader-election-lease-duration="1m0s"
  I0731 10:16:54.693013       1 flag/flags.go:64] FLAG: --leader-election-renew-deadline="35s"
  I0731 10:16:54.693017       1 flag/flags.go:64] FLAG: --leader-election-retry-period="10s"
  I0731 10:16:54.693020       1 flag/flags.go:64] FLAG: --loglevel="2"
  I0731 10:16:54.693025       1 flag/flags.go:64] FLAG: --namespace="scylla-operator"
  I0731 10:16:54.693030       1 flag/flags.go:64] FLAG: --qps="50"
  I0731 10:16:54.693035       1 flag/flags.go:64] FLAG: --v="2"
  I0731 10:16:54.693258       1 leaderelection/leaderelection.go:100] Starting leader election
  I0731 10:16:54.693278       1 leaderelection/leaderelection.go:250] attempting to acquire leader lease scylla-operator/scylla-operator-lock...

kubectl create -f examples/generic/cluster.yaml
  Error from server (InternalError): error when creating "examples/generic/cluster.yaml": Internal error occurred: failed calling webhook "webhook.scylla.scylladb.com": failed to call webhook: Post "https://scylla-operator-webhook.scylla-operator.svc:443/validate?timeout=10s": context deadline exceeded

I’ve seen in couple threads that firewall port 9443 should be open from cluster to the nodes. I’ve checked the rules with this process:

# List node groups
aws eks --region eu-central-1 list-nodegroups --cluster-name k8s01 --query nodegroups
[
    "k8s01-initial-2024062407114279650000000f"
]

# Get auto scaling group name
aws eks --region eu-central-1 describe-nodegroup --cluster-name k8s01 --nodegroup-name k8s01-initial-2024062407114279650000000f --query nodegroup.resources.autoScalingGroups[].name
[
    "eks-k8s01-initial-2024062407114279650000000f-b2c8248b-0857-f142-cf40-7a7c66219c46"
]

# Get one instance of the auto scaling group
aws autoscaling --region eu-central-1 describe-auto-scaling-groups --auto-scaling-group-names eks-k8s01-initial-2024062407114279650000000f-b2c8248b-0857-f142-cf40-7a7c66219c46 --query AutoScalingGroups[].Instances[0].InstanceId
[
    "i-000d9209e60e7365d"
]

# Get security groups of the instance
aws ec2 --region eu-central-1 describe-instances --instance-ids i-000d9209e60e7365d --query Reservations[].Instances[].SecurityGroups[].GroupId
[
    "sg-0c79a4a5106d84011"
]

# Check firewall rules of the security group
aws ec2 --region eu-central-1 describe-security-groups --group-ids sg-0c79a4a5106d84011 --query 'SecurityGroups[].IpPermissions[].{FromPort:FromPort,UserIdGroupPairs:UserIdGroupPairs}' --output yaml | sed '/UserId/d'
- FromPort: 30080
- FromPort: 6443
  - Description: Cluster API to node 6443/tcp webhook
    GroupId: sg-06a93cc5cd09cf20a
- FromPort: 30880
- FromPort: null
  - Description: Node to node ingress traffic
    GroupId: sg-0c79a4a5106d84011
- FromPort: 9443
  - Description: Cluster API to node 9443/tcp webhook
    GroupId: sg-06a93cc5cd09cf20a
- FromPort: 1025
  - Description: Node to node ingress on ephemeral ports
    GroupId: sg-0c79a4a5106d84011
- FromPort: 8443
  - Description: Cluster API to node 8443/tcp webhook
    GroupId: sg-06a93cc5cd09cf20a
- FromPort: 10250
  - Description: Cluster API to node kubelets
    GroupId: sg-06a93cc5cd09cf20a
- FromPort: 53
  - Description: Node to node CoreDNS
    GroupId: sg-0c79a4a5106d84011
- FromPort: 53
  - Description: Node to node CoreDNS UDP
    GroupId: sg-0c79a4a5106d84011
- FromPort: 443
  - Description: Cluster API to node groups
    GroupId: sg-06a93cc5cd09cf20a
- FromPort: 4443
  - Description: Cluster API to node 4443/tcp webhook
    GroupId: sg-06a93cc5cd09cf20a

# Check the cluster security groups
aws eks describe-cluster --region eu-central-1 --name k8s01 --query 'cluster.resourcesVpcConfig.{securityGroupIds:securityGroupIds, clusterSecurityGroupId:clusterSecurityGroupId}'
{
    "securityGroupIds": [
        "sg-06a93cc5cd09cf20a"
    ],
    "clusterSecurityGroupId": "sg-06071c24109602e48"
}

Cluster security group sg-06a93cc5cd09cf20a has access to the nodes via port 9443, so I think the firewall rule is fine.

- FromPort: 9443
  - Description: Cluster API to node 9443/tcp webhook
    GroupId: sg-06a93cc5cd09cf20a

Versions:

Platform: AWS EKS
Kubernetes: v1.30.0-eks-036c24b
cert-manager: v1.12.12
Helm: v3.15.2
Scylla-operator: 1.14.0

Any ideas what I should check?

zimnx · August 8, 2024, 8:20am

Webhook Service listens on port 443.

You have to make sure traffic on 443 port between Kubernetes master nodes and nodes where Operator Webhook Pods are running is allowed

iisti · August 8, 2024, 8:35am

Shouldn’t this rule already allow port 443 from the cluster API (= master nodes?) to the nodes?

zimnx · August 8, 2024, 10:04am

Traffic is from random port to 443 port, maybe rule has to specify that instead of FromPort:443.

iisti · August 8, 2024, 8:07pm

Thanks for advice so far!

The output of AWS CLI is bit confusing. The output says FromPort:443 and it means that the security group will let port 443 through.

This screenshot shows the same information as the AWS CLI returns. Port 443 is open on node security group and allowed source is cluster security group.

Would there be some commands that I could run to test connections between pods or something?

Topic		Replies	Views
Failed calling webhook "webhook.scylla.scylladb.com" ScyllaDB troubleshooting , operator , kubernetes	1	150	September 23, 2024
[RELEASE] Scylla Operator 1.13.0 Release Notes operator , operator-release , kubernetes	0	143	June 20, 2024
[RELEASE] Scylla Operator v1.15.0 Release Notes release , operator-release , kubernetes	0	92	December 19, 2024
[RELEASE] Scylla Operator v1.14.0 Release Notes release , operator , operator-release , kubernetes	0	115	September 19, 2024
[RELEASE] ScyllaDB Operator 1.8.0 Release Notes operator , operator-release	0	513	January 30, 2023

Error when deploying scylladb to AWS EKS scylla-operator-webhook.scylla-operator.svc context deadline exceeded

Related topics