Originally from the User Slack
@Terence_Liu: I’m having a lot of trouble getting our istio service mesh to live together with the scylla operator and scylla cluster. Is there a guide that help set this up right?
@Terence_Liu: We’ve excluded a few ports on the operator and the ScyllaCluster, notably 7000 (node-to-node communication) and 8080 (from the scylladb-api-status-probe
), the service is up, and client can read from it. But when writing becomes anywhere around 500~1000 reqs/s, the scylladb-api-status-probe
starts to report errors, and the istio sidecar starts to throw out many messages like
Request to probe app failed: Get "<http://100.122.13.158:8080/healthz>": dial tcp 127.0.0.6:0->100.122.13.158:8080: bind: address already in use, original URL path = /app-health/scylla/livez
app URL path = /healthz
This forces the scylla livenessProbe to fail after 12 tries, and restarts scylla. My 3-node cluster goes into a rotating restart wave as a result. When no writes happen, some reads seem totally fine.
This is the pod level manifest. You can see from ISTIO_KUBE_APP_PROBERS
our istio remaps these health endpoints (notably on 8080) to
/app-health/{container_name}/{startupz,readyz,livez}
on 15020.
Better view
{'/app-health/scylla-manager-agent/readyz': {'tcpSocket': {'port': 10001},
'timeoutSeconds': 1},
'/app-health/scylla/livez': {'httpGet': {'path': '/healthz',
'port': 8080,
'scheme': 'HTTP'},
'timeoutSeconds': 10},
'/app-health/scylla/readyz': {'httpGet': {'path': '/readyz',
'port': 8080,
'scheme': 'HTTP'},
'timeoutSeconds': 30},
'/app-health/scylla/startupz': {'httpGet': {'path': '/healthz',
'port': 8080,
'scheme': 'HTTP'},
'timeoutSeconds': 30},
'/app-health/scylladb-api-status-probe/readyz': {'tcpSocket': {'port': 8080},
'timeoutSeconds': 30},
'/app-health/scylladb-ignition/readyz': {'httpGet': {'path': '/readyz',
'port': 42081,
'scheme': 'HTTP'},
'timeoutSeconds': 30
We figured it out - we opened too many CQL connections from the client side, ~3000 to a 3-node Scylla cluster. After packing and trimming there, the cluster became stable to write to.
The excessive connections were not only a strain on the cluster, but more importantly our istio service mesh.