Installation details
#ScyllaDB version: 6.0.4
#Cluster size: 39 i3en.6xlarge EC2 instances
os (RHEL/CentOS/Ubuntu/AWS AMI): Customized based on the scylla-6.0.4-x86_64 AMI builder - Ubuntu 22.04
We upgraded to 6.0.4 a few weeks ago, and since then on our largest cluster have seen restart issues where a single node will crash/restart, then large chunks of the cluster will restart together. 10+ nodes will fail, restart, then another 10 or so will restart, then 1 or 2 to finish off, then we are fine for another few days.
Checking logs doesn’t really leave us with much, plenty of bad_alloc
errors around the failure. This error seems fairly common though. Today the 1st node that failed had this error, then immediately crashed (core dump) and restarted.
[shard 0: gms] storage_proxy - exception during mutation write to 10.123.101.94: utils::internal::nested_exception<std::runtime_error> (Could not write mutation system:peers (pk{00040a7b658f}) to commitlog): std::bad_alloc (std::bad_alloc)
[shard 0: gms] gossip - Gossip change listener failed: exceptions::mutation_write_failure_exception (Operation failed for system.peers - received 0 responses and 1 failures from 1 CL=ONE.), at: 0x64844be 0x6484ad0 0x6484db8 0x5f3cc6e 0x5f3ce27 0x3e88a0d 0x145345a 0x5f7d16f 0x5f7e457 0x5f7d7b8 0x5f0b767 0x5f0a92c 0x13e2da8 0x13e47f0 0x13e1379 /opt/scylladb/libreloc/libc.so.6+0x27b89 /opt/scylladb/libreloc/libc.so.6+0x27c4a 0x13de764
--------
seastar::internal::coroutine_traits_base<void>::promise_type
--------
seastar::internal::coroutine_traits_base<void>::promise_type
--------
seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<seastar::with_semaphore<seastar::semaphore_default_exception_factory, gms::gossiper::apply_state_locally(std::map<gms::inet_address, gms::endpoint_state, std::less<gms::inet_address>, std::allocator<std::pair<gms::inet_address const, gms::endpoint_state> > >)::$_0::operator()<gms::inet_address&>(gms::inet_address&) const::{lambda()#1}, std::chrono::_V2::steady_clock>(seastar::basic_semaphore<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock>&, unsigned long, gms::gossiper::apply_state_locally(std::map<gms::inet_address, gms::endpoint_state, std::less<gms::inet_address>, std::allocator<std::pair<gms::inet_address const, gms::endpoint_state> > >)::$_0::operator()<gms::inet_address&>(gms::inet_address&) const::{lambda()#1}&&)::{lambda(auto:1)#1}::operator()<seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock>)::{lambda()#1}, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<seastar::with_semaphore<seastar::semaphore_default_exception_factory, gms::gossiper::apply_state_locally(std::map<gms::inet_address, gms::endpoint_state, std::less<gms::inet_address>, std::allocator<std::pair<gms::inet_address const, gms::endpoint_state> > >)::$_0::operator()<gms::inet_address&>(gms::inet_address&) const::{lambda()#1}, std::chrono::_V2::steady_clock>(seastar::basic_semaphore<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock>&, unsigned long, gms::gossiper::apply_state_locally(std::map<gms::inet_address, gms::endpoint_state, std::less<gms::inet_address>, std::allocator<std::pair<gms::inet_address const, gms::endpoint_state> > >)::$_0::operator()<gms::inet_address&>(gms::inet_address&) const::{lambda()#1}&&)::{lambda(auto:1)#1}::operator()<seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock>)::{lambda()#1}, false> >(gms::gossiper::apply_state_locally(std::map<gms::inet_address, gms::endpoint_state, std::less<gms::inet_address>, std::allocator<std::pair<gms::inet_address const, gms::endpoint_state> > >)::$_0::operator()<gms::inet_address&>(gms::inet_address&) const::{lambda()#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<seastar::with_semaphore<seastar::semaphore_default_exception_factory, gms::gossiper::apply_state_locally(std::map<gms::inet_address, gms::endpoint_state, std::less<gms::inet_address>, std::allocator<std::pair<gms::inet_address const, gms::endpoint_state> > >)::$_0::operator()<gms::inet_address&>(auto:1&&) const::{lambda()#1}, std::chrono::_V2::steady_clock>(seastar::basic_semaphore<auto:1, auto:3>&, unsigned long, auto:2&&)::{lambda(auto:1)#1}::operator()<seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >(auto:1)::{lambda()#1}, false>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>
Aborting on shard 0.
Backtrace:
0x5f6b538
0x5fa1dc1
/opt/scylladb/libreloc/libc.so.6+0x3dbaf
/opt/scylladb/libreloc/libc.so.6+0x8e883
/opt/scylladb/libreloc/libc.so.6+0x3dafd
/opt/scylladb/libreloc/libc.so.6+0x2687e
0x5f3ce2c
0x3e88a0d
0x145345a
0x5f7d16f
0x5f7e457
0x5f7d7b8
0x5f0b767
0x5f0a92c
0x13e2da8
0x13e47f0
0x13e1379
/opt/scylladb/libreloc/libc.so.6+0x27b89
/opt/scylladb/libreloc/libc.so.6+0x27c4a
0x13de764
Decoded backtrace:
void seastar::backtrace<seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}&&) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:68
(inlined by) seastar::backtrace_buffer::append_backtrace() at ./build/release/seastar/./seastar/src/core/reactor.cc:825
(inlined by) seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:855
seastar::print_with_backtrace(char const*, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:867
(inlined by) seastar::sigabrt_action() at ./build/release/seastar/./seastar/src/core/reactor.cc:4071
(inlined by) seastar::install_oneshot_signal_handler<6, (void (*)())(&seastar::sigabrt_action)>()::{lambda(int, siginfo_t*, void*)#1}::operator()(int, siginfo_t*, void*) const at ./build/release/seastar/./seastar/src/core/reactor.cc:4047
(inlined by) seastar::install_oneshot_signal_handler<6, (void (*)())(&seastar::sigabrt_action)>()::{lambda(int, siginfo_t*, void*)#1}::__invoke(int, siginfo_t*, void*) at ./build/release/seastar/./seastar/src/core/reactor.cc:4043
__GI___sigaction at :?
__pthread_kill_implementation at ??:?
__GI_raise at :?
__GI_abort at :?
seastar::on_fatal_internal_error(seastar::logger&, std::basic_string_view<char, std::char_traits<char> >) at ./build/release/seastar/./seastar/src/core/on_internal_error.cc:81
gms::gossiper::apply_new_states(gms::inet_address, gms::endpoint_state, gms::endpoint_state const&, utils::tagged_uuid<gms::permit_id_tag>) at ./gms/gossiper.cc:1925
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume() const at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/coroutine:240
(inlined by) seastar::internal::coroutine_traits_base<void>::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:125
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./seastar/src/core/reactor.cc:2690
(inlined by) seastar::reactor::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3152
seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3320
seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3210
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:276
seastar::app_template::run(int, char**, std::function<seastar::future<int> ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:167
scylla_main(int, char**) at ./main.cc:700
std::function<int (int, char**)>::operator()(int, char**) const at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/std_function.h:591
main at ./main.cc:2211
__libc_start_call_main at ??:?
__libc_start_main_alias_2 at :?
_start at ??:?
These crashes mean the cluster cannot be read from or written to for around 30 minutes until it recovers.
We’re looking to upgrade to 6.1, but I couldn’t find anything related in issues / release notes, so am not sure if this specific issue will be resolved. I’m curious if this was resolved in some way or if Raft/Gossip has improved and the upgrade should help.