Segmentation fault on shard

Hi, I am facing a challenge and trying to get the solution there

I have set up a 3-node Scylla cluster in GCP (machine type: n1-highmem-8(8 vCPU, 52 GB), version of scylla-server: 4.5.1-0.20211024.4c0eac049, OS: Linux 8-gcp #24~20.04.1-Ubuntu SMP Mon Sep 12 06:14:01 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux). We have a keyspace, and here is the information about the keyspace:

mykeyspace | True | {‘class’: ‘org.apache.cassandra.locator.SimpleStrategy’, ‘replication_factor’: ‘2’}

The Scylla instance sometimes restarts, and we see this message in the logs:

Jul 25 18:14:16 db-scylla1 scylla[2190565]: Segmentation fault on shard 4.
Jul 25 18:14:16 db-scylla1 scylla[2190565]: Backtrace:
Jul 25 18:14:16 db-scylla1 scylla[2190565]: 0x3f75e08
Jul 25 18:14:16 db-scylla1 scylla[2190565]: 0x3fa7ee6
Jul 25 18:14:16 db-scylla1 scylla[2190565]: 0x7f63611c81df
Jul 25 18:14:16 db-scylla1 scylla[2190565]: 0x1c6f863
Jul 25 18:14:16 db-scylla1 scylla[2190565]: 0x217a7e2
Jul 25 18:14:16 db-scylla1 scylla[2190565]: 0x3f88c4f
Jul 25 18:14:16 db-scylla1 scylla[2190565]: 0x3f89e37
Jul 25 18:14:16 db-scylla1 scylla[2190565]: 0x3fa8488
Jul 25 18:14:16 db-scylla1 scylla[2190565]: 0x3f5438a
Jul 25 18:14:16 db-scylla1 scylla[2190565]: /opt/scylladb/libreloc/libpthread.so.0+0x93f8
Jul 25 18:14:16 db-scylla1 scylla[2190565]: /opt/scylladb/libreloc/libc.so.6+0x101902
Jul 25 18:19:34 db-scylla1 scylla[2212677]: Scylla version 4.5.1-0.20211024.4c0eac049 with build-id e0df888020bbfef43aee10264ff14c85581609f7 starting …
Jul 25 18:19:34 db-scylla1 scylla[2212677]: command used: "/usr/bin/scylla --log-to-syslog 1 --log-to-stdout 0 --default-log-level info --network-stack posix --reserve-memory 4G --io-properties-file=>
Jul 25 18:19:34 db-scylla1 scylla[2212677]: parsed command line options: [log-to-syslog: 1, log-to-stdout: 0, default-log-level: info, network-stack: posix, reserve-memory: 4G, io-properties-file: /e>

/etc/scylla.d/io_properties.yaml

  • mountpoint: /var/lib/scylla
    read_iops: 15004
    read_bandwidth: 252140208
    write_iops: 17435
    write_bandwidth: 1417786752
    ########################################

/etc/scylla.d/cpuset.conf

CPUSET="–cpuset 1-7 "

########################################

/etc/scylla.d/memory.conf

MEM_CONF=“–lock-memory=1”
########################################
/etc/scylla/scylla.yaml

cluster_name: ‘scylla’
num_tokens: 256
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000
commitlog_segment_size_in_mb: 32
seed_provider:
# Addresses of hosts that are deemed contact points.
# Scylla nodes use this list of hosts to find each other and learn
# the topology of the ring. You must change this if you are running
# multiple nodes!
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
# seeds is actually a comma-delimited list of addresses.
# Ex: “,,”
- seeds: “1.1.1.1,2.2.2.2,3.3.3.3”
listen_address: 1.1.1.1
native_transport_port: 9042
native_shard_aware_transport_port: 19042
read_request_timeout_in_ms: 5000
write_request_timeout_in_ms: 2000
cas_contention_timeout_in_ms: 1000
endpoint_snitch: GossipingPropertyFileSnitch
rpc_address: 1.1.1.1
rpc_port: 9160
api_port: 10000
api_address: 127.0.0.1
batch_size_warn_threshold_in_kb: 5
batch_size_fail_threshold_in_kb: 50

partitioner: org.apache.cassandra.dht.Murmur3Partitioner
commitlog_total_space_in_mb: -1
murmur3_partitioner_ignore_msb_bits: 12
api_ui_dir: /opt/scylladb/swagger-ui/dist/
api_doc_dir: /opt/scylladb/api/api-doc/

###########################################
this is the result of ulimits -a

core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 208819
max locked memory (kbytes, -l) 65536
max memory size (kbytes, -m) unlimited
open files (-n) 1000000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 208819
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

###########################################
/etc/sysctl.d/11-sysctl.conf

net.ipv4.ip_forward=1
net.ipv6.conf.all.forwarding=1
net.ipv4.conf.all.accept_redirects=1
fs.file-max=2097152
net.ipv4.netfilter.ip_conntrack_max=4000000
net.netfilter.nf_conntrack_max=4000000
net.ipv4.netfilter.ip_conntrack_max=4000000
net.ipv4.tcp_window_scaling=0
net.ipv4.tcp_max_tw_buckets=10000
net.ipv4.tcp_max_syn_backlog=2048
net.core.somaxconn=128
net.core.netdev_max_backlog=1000
net.ipv4.tcp_keepalive_time=60
net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait=5
net.ipv4.tcp_fin_timeout=10
net.ipv4.tcp_tw_reuse=1
net.ipv4.tcp_keepalive_intvl=15
net.ipv4.tcp_keepalive_probes=5
net.ipv4.tcp_sack=0
net.ipv4.tcp_timestamps=0
net.core.rmem_max=524287
net.core.wmem_max=524287
net.core.rmem_default=524287
net.core.wmem_default=524287
net.core.optmem_max=524287

###########################################

this is the command which we used to setup the scylla

scylla_setup --disks /dev/sdb --nic ens4 --io-setup 1 --no-version-check --no-rsyslog-setup

Hello @harman,

In the future, please report such issues on our github so they’d be better tracked: See Issues · scylladb/scylladb · GitHub

I decoded the backtrace, and this seems like a known issue: LWT update with empty clustering key range causes a crash · Issue #13129 · scylladb/scylladb · GitHub
that has been fixed, but the fix has not been backported to official releases yet.

When it will be backported, please upgrade to the latest available release, as those contain many bug fixes over the version you’re currently using. Please note that 4.5.1 aged out and is not supported any more so the fix is bound to backported to a newer release.

With that said, the root cause seems to be an invalid LWT query that results in an empty clustering key range condition. The fix will just turn the crash into a visible error, which is much better, but you should still locate to problematic query and fix it - if that’s indeed the case.

2 Likes

The issue was due to a corrupt sstable file. Deleting the files of that sstable seems to have resolved the issue. But would it be worth handling … A segfault Core Ball will occur when a program attempts to operate on a memory location in a way that is not allowed (for example, attempts to write a read-only location would result in a segfault). Segfaults can also occur when your program runs out of stack space.

Core ball is an addicting online game that you won’t be able to put down! Zigzag your way to the high score! Coreball (also named “Core Ball”) is a classic little arcade game that can be played online. The concept of this game is followed by an idea from a console game called AA Ball in 2015. The goal of Coreball is simple: you only need to get the ball into the core ball without hitting any of the other balls that are attached to it. When you have completed the current level by successfully throwing all of the balls, you will go on to the next one.