Segmentation fault on shard x

Hello there!
We have a problem with our scylla cluster. Could you help me understand what exactly happens? Logs at the end

Some context:
For some time (maybe even from the very beginning but it didn’t happen so often), we experience random restarts of our scylla pods. Usually there is some warning about oversized allocation but it’s “normal” as it happens all the time. We don’t have any way to see what exact query caused segmentation faults, but even if we found out, it still shouldn’t happen right? Is it known problem? Can you deduct what went wrong based on logs?

Setup:
3-node scylla cluster in Azure Kubernetes Service (3x Standard_D8as_v5 with 8 vCPUs and 32GiB RAM)
scylla server version: 5.1.18
AKS Node image: AKSUbuntu-2204gen2containerd-202312.06.0
#################
/etc/scylla.d/io_properties.yaml

disks:
  - mountpoint: /var/lib/scylla
    read_iops: 9189
    read_bandwidth: 206683600
    write_iops: 3579
    write_bandwidth: 172520640

#################
/etc/scylla/scylla.yaml

api_address: 127.0.0.1
api_doc_dir: /opt/scylladb/api/api-doc/
api_port: 10000
api_ui_dir: /opt/scylladb/swagger-ui/dist/
authenticator: PasswordAuthenticator
authorizer: CassandraAuthorizer
batch_size_fail_threshold_in_kb: 1024
batch_size_warn_threshold_in_kb: 128
cas_contention_timeout_in_ms: 1000
cluster_name: default
commitlog_segment_size_in_mb: 32
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000
commitlog_total_space_in_mb: -1
endpoint_snitch: GossipingPropertyFileSnitch
force_schema_commit_log: true
listen_address: localhost
murmur3_partitioner_ignore_msb_bits: 12
native_shard_aware_transport_port: 19042
native_transport_port: 9042
num_tokens: 256
partitioner: org.apache.cassandra.dht.Murmur3Partitioner
read_request_timeout_in_ms: 5000
rpc_address: 0.0.0.0
rpc_port: 9160
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
  parameters:
  - seeds: 127.0.0.1
write_request_timeout_in_ms: 2000

#################
ulimit -a

root@default-scylla-db-0-0:/# ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 128340
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1048576
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

logs:
Scylla build_id: 724b167bb6412a9242e392d39188abf36a8c259e

WARN  2024-01-17 09:28:08,076 [shard 6] seastar_memory - oversized allocation: 15405056 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at 0x50f4ffe 0x50f54f0 0x50f57f8 0x4d2a7ea 0x4d2bf1f 0x4d2efa1 0x25ac028 0x25a8fbc 0x25c2450 0x4d68f24 0x4d6a307 0x4d89355 0x4d3ccca /opt/scylladb/libreloc/libpthread.so.0+0x92a4 /opt/scylladb/libreloc/libc.so.6+0x100322
seastar::continuation<seastar::internal::promise_base_with_type<void>, service::abstract_read_executor::reconcile(db::consistency_level, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >, seastar::lw_shared_ptr<query::read_command>)::{lambda(seastar::future<boost::outcome_v2::basic_result<void, utils::exception_container<exceptions::mutation_write_timeout_exception, exceptions::read_timeout_exception, exceptions::read_failure_exception, exceptions::rate_limit_exception>, utils::exception_container_throw_policy> >)#1}, seastar::future<boost::outcome_v2::basic_result<void, utils::exception_container<exceptions::mutation_write_timeout_exception, exceptions::read_timeout_exception, exceptions::read_failure_exception, exceptions::rate_limit_exception>, utils::exception_container_throw_policy> >::then_wrapped_nrvo<seastar::future<void>, service::abstract_read_executor::reconcile(db::consistency_level, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >, seastar::lw_shared_ptr<query::read_command>)::{lambda(seastar::future<boost::outcome_v2::basic_result<void, utils::exception_container<exceptions::mutation_write_timeout_exception, exceptions::read_timeout_exception, exceptions::read_failure_exception, exceptions::rate_limit_exception>, utils::exception_container_throw_policy> >)#1}>(service::abstract_read_executor::reconcile(db::consistency_level, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >, seastar::lw_shared_ptr<query::read_command>)::{lambda(seastar::future<boost::outcome_v2::basic_result<void, utils::exception_container<exceptions::mutation_write_timeout_exception, exceptions::read_timeout_exception, exceptions::read_failure_exception, exceptions::rate_limit_exception>, utils::exception_container_throw_policy> >)#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, service::abstract_read_executor::reconcile(db::consistency_level, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >, seastar::lw_shared_ptr<query::read_command>)::{lambda(seastar::future<boost::outcome_v2::basic_result<void, utils::exception_container<exceptions::mutation_write_timeout_exception, exceptions::read_timeout_exception, exceptions::read_failure_exception, exceptions::rate_limit_exception>, utils::exception_container_throw_policy> >)#1}&, seastar::future_state<boost::outcome_v2::basic_result<void, utils::exception_container<exceptions::mutation_write_timeout_exception, exceptions::read_timeout_exception, exceptions::read_failure_exception, exceptions::rate_limit_exception>, utils::exception_container_throw_policy> >&&)#1}, boost::outcome_v2::basic_result<void, utils::exception_container<exceptions::mutation_write_timeout_exception, exceptions::read_timeout_exception, exceptions::read_failure_exception, exceptions::rate_limit_exception>, utils::exception_container_throw_policy> >
Segmentation fault on shard 6.
Backtrace:
  0x4d59798
  0x4d88e16
  0x4673cb1
  0x7fd94c346a1f
  0x50f5595
  0x50f57f8
  0x4d2a7ea
  0x4d2bf1f
  0x4d2efa1
  0x2675c56
  0x4d68f24
  0x4d6a307
  0x4d89355
  0x4d3ccca
  /opt/scylladb/libreloc/libpthread.so.0+0x92a4
  /opt/scylladb/libreloc/libc.so.6+0x100322

Opened github issue - Segmentation fault on shard x · Issue #16841 · scylladb/scylladb · GitHub
I guess this post can be taken down if needed

1 Like