6.0.4 Major Crashes Due to Memory/Gossip Failures

GarrettPoore · February 13, 2025, 4:08am

Installation details
#ScyllaDB version: 6.0.4
#Cluster size: 39 i3en.6xlarge EC2 instances
os (RHEL/CentOS/Ubuntu/AWS AMI): Customized based on the scylla-6.0.4-x86_64 AMI builder - Ubuntu 22.04

We upgraded to 6.0.4 a few weeks ago, and since then on our largest cluster have seen restart issues where a single node will crash/restart, then large chunks of the cluster will restart together. 10+ nodes will fail, restart, then another 10 or so will restart, then 1 or 2 to finish off, then we are fine for another few days.

Checking logs doesn’t really leave us with much, plenty of bad_alloc errors around the failure. This error seems fairly common though. Today the 1st node that failed had this error, then immediately crashed (core dump) and restarted.

 [shard  0: gms] storage_proxy - exception during mutation write to 10.123.101.94: utils::internal::nested_exception<std::runtime_error> (Could not write mutation system:peers (pk{00040a7b658f}) to commitlog): std::bad_alloc (std::bad_alloc)
 [shard  0: gms] gossip - Gossip change listener failed: exceptions::mutation_write_failure_exception (Operation failed for system.peers - received 0 responses and 1 failures from 1 CL=ONE.), at: 0x64844be 0x6484ad0 0x6484db8 0x5f3cc6e 0x5f3ce27 0x3e88a0d 0x145345a 0x5f7d16f 0x5f7e457 0x5f7d7b8 0x5f0b767 0x5f0a92c 0x13e2da8 0x13e47f0 0x13e1379 /opt/scylladb/libreloc/libc.so.6+0x27b89 /opt/scylladb/libreloc/libc.so.6+0x27c4a 0x13de764
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<seastar::with_semaphore<seastar::semaphore_default_exception_factory, gms::gossiper::apply_state_locally(std::map<gms::inet_address, gms::endpoint_state, std::less<gms::inet_address>, std::allocator<std::pair<gms::inet_address const, gms::endpoint_state> > >)::$_0::operator()<gms::inet_address&>(gms::inet_address&) const::{lambda()#1}, std::chrono::_V2::steady_clock>(seastar::basic_semaphore<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock>&, unsigned long, gms::gossiper::apply_state_locally(std::map<gms::inet_address, gms::endpoint_state, std::less<gms::inet_address>, std::allocator<std::pair<gms::inet_address const, gms::endpoint_state> > >)::$_0::operator()<gms::inet_address&>(gms::inet_address&) const::{lambda()#1}&&)::{lambda(auto:1)#1}::operator()<seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock>)::{lambda()#1}, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<seastar::with_semaphore<seastar::semaphore_default_exception_factory, gms::gossiper::apply_state_locally(std::map<gms::inet_address, gms::endpoint_state, std::less<gms::inet_address>, std::allocator<std::pair<gms::inet_address const, gms::endpoint_state> > >)::$_0::operator()<gms::inet_address&>(gms::inet_address&) const::{lambda()#1}, std::chrono::_V2::steady_clock>(seastar::basic_semaphore<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock>&, unsigned long, gms::gossiper::apply_state_locally(std::map<gms::inet_address, gms::endpoint_state, std::less<gms::inet_address>, std::allocator<std::pair<gms::inet_address const, gms::endpoint_state> > >)::$_0::operator()<gms::inet_address&>(gms::inet_address&) const::{lambda()#1}&&)::{lambda(auto:1)#1}::operator()<seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock>)::{lambda()#1}, false> >(gms::gossiper::apply_state_locally(std::map<gms::inet_address, gms::endpoint_state, std::less<gms::inet_address>, std::allocator<std::pair<gms::inet_address const, gms::endpoint_state> > >)::$_0::operator()<gms::inet_address&>(gms::inet_address&) const::{lambda()#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<seastar::with_semaphore<seastar::semaphore_default_exception_factory, gms::gossiper::apply_state_locally(std::map<gms::inet_address, gms::endpoint_state, std::less<gms::inet_address>, std::allocator<std::pair<gms::inet_address const, gms::endpoint_state> > >)::$_0::operator()<gms::inet_address&>(auto:1&&) const::{lambda()#1}, std::chrono::_V2::steady_clock>(seastar::basic_semaphore<auto:1, auto:3>&, unsigned long, auto:2&&)::{lambda(auto:1)#1}::operator()<seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >(auto:1)::{lambda()#1}, false>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>
Aborting on shard 0.
Backtrace:
  0x5f6b538
  0x5fa1dc1
  /opt/scylladb/libreloc/libc.so.6+0x3dbaf
  /opt/scylladb/libreloc/libc.so.6+0x8e883
  /opt/scylladb/libreloc/libc.so.6+0x3dafd
  /opt/scylladb/libreloc/libc.so.6+0x2687e
  0x5f3ce2c
  0x3e88a0d
  0x145345a
  0x5f7d16f
  0x5f7e457
  0x5f7d7b8
  0x5f0b767
  0x5f0a92c
  0x13e2da8
  0x13e47f0
  0x13e1379
  /opt/scylladb/libreloc/libc.so.6+0x27b89
  /opt/scylladb/libreloc/libc.so.6+0x27c4a
  0x13de764

Decoded backtrace:

void seastar::backtrace<seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}&&) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:68
 (inlined by) seastar::backtrace_buffer::append_backtrace() at ./build/release/seastar/./seastar/src/core/reactor.cc:825
 (inlined by) seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:855
seastar::print_with_backtrace(char const*, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:867
 (inlined by) seastar::sigabrt_action() at ./build/release/seastar/./seastar/src/core/reactor.cc:4071
 (inlined by) seastar::install_oneshot_signal_handler<6, (void (*)())(&seastar::sigabrt_action)>()::{lambda(int, siginfo_t*, void*)#1}::operator()(int, siginfo_t*, void*) const at ./build/release/seastar/./seastar/src/core/reactor.cc:4047
 (inlined by) seastar::install_oneshot_signal_handler<6, (void (*)())(&seastar::sigabrt_action)>()::{lambda(int, siginfo_t*, void*)#1}::__invoke(int, siginfo_t*, void*) at ./build/release/seastar/./seastar/src/core/reactor.cc:4043
__GI___sigaction at :?
__pthread_kill_implementation at ??:?
__GI_raise at :?
__GI_abort at :?
seastar::on_fatal_internal_error(seastar::logger&, std::basic_string_view<char, std::char_traits<char> >) at ./build/release/seastar/./seastar/src/core/on_internal_error.cc:81
gms::gossiper::apply_new_states(gms::inet_address, gms::endpoint_state, gms::endpoint_state const&, utils::tagged_uuid<gms::permit_id_tag>) at ./gms/gossiper.cc:1925
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume() const at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/coroutine:240
 (inlined by) seastar::internal::coroutine_traits_base<void>::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:125
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./seastar/src/core/reactor.cc:2690
 (inlined by) seastar::reactor::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3152
seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3320
seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3210
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:276
seastar::app_template::run(int, char**, std::function<seastar::future<int> ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:167
scylla_main(int, char**) at ./main.cc:700
std::function<int (int, char**)>::operator()(int, char**) const at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/std_function.h:591
main at ./main.cc:2211
__libc_start_call_main at ??:?
__libc_start_main_alias_2 at :?
_start at ??:?

These crashes mean the cluster cannot be read from or written to for around 30 minutes until it recovers.

We’re looking to upgrade to 6.1, but I couldn’t find anything related in issues / release notes, so am not sure if this specific issue will be resolved. I’m curious if this was resolved in some way or if Raft/Gossip has improved and the upgrade should help.

GarrettPoore · February 16, 2025, 4:19am

Another day and more failures. This time it was during compactions. Initial failure:

terminate called after throwing an instance of 'std::bad_alloc'
 [shard 10:comp] compaction - [Compact uat_objectstore.bcbsmn_object 73771ba0-ea00-11ef-945f-d9fef4b628fd] Compacting [/var/lib/scylla/data/uat_objectsto>
  what():  std::bad_alloc
Aborting on shard 19.
Backtrace:
  0x5f6b538
  0x5fa1dc1
  /opt/scylladb/libreloc/libc.so.6+0x3dbaf
  /opt/scylladb/libreloc/libc.so.6+0x8e883
  /opt/scylladb/libreloc/libc.so.6+0x3dafd
  /opt/scylladb/libreloc/libc.so.6+0x2687e
  /opt/scylladb/libreloc/libstdc++.so.6+0xa4d38
  /opt/scylladb/libreloc/libstdc++.so.6+0xb4f6b
  /opt/scylladb/libreloc/libstdc++.so.6+0xb4fd6
  0x13de83a
  0x4a5a991
  0x276f6d2
  0x28047d2
  0x145345a
  0x5f7d16f
  0x5f7e457
  0x5fa23c0
  0x5f3d87a
  /opt/scylladb/libreloc/libc.so.6+0x8c946
  /opt/scylladb/libreloc/libc.so.6+0x11296f

Decoded:

Backtrace:
[Backtrace #0]
void seastar::backtrace<seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}&&) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:68
 (inlined by) seastar::backtrace_buffer::append_backtrace() at ./build/release/seastar/./seastar/src/core/reactor.cc:825
 (inlined by) seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:855
seastar::print_with_backtrace(char const*, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:867
 (inlined by) seastar::sigabrt_action() at ./build/release/seastar/./seastar/src/core/reactor.cc:4071
 (inlined by) seastar::install_oneshot_signal_handler<6, (void (*)())(&seastar::sigabrt_action)>()::{lambda(int, siginfo_t*, void*)#1}::operator()(int, siginfo_t*, void*) const at ./build/release/seastar/./seastar/src/core/reactor.cc:4047
 (inlined by) seastar::install_oneshot_signal_handler<6, (void (*)())(&seastar::sigabrt_action)>()::{lambda(int, siginfo_t*, void*)#1}::__invoke(int, siginfo_t*, void*) at ./build/release/seastar/./seastar/src/core/reactor.cc:4043
/data/scylla-s3-reloc.cache/by-build-id/84cd4ae97ea9d9820a56ede82cc51ff18e019400/extracted/scylla/libreloc/libc.so.6: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=9148cab1b932d44ef70e306e9c02ee38d06cad51, for GNU/Linux 3.2.0, not stripped

__GI___sigaction at :?
__pthread_kill_implementation at ??:?
__GI_raise at :?
__GI_abort at :?
/data/scylla-s3-reloc.cache/by-build-id/84cd4ae97ea9d9820a56ede82cc51ff18e019400/extracted/scylla/libreloc/libstdc++.so.6: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, BuildID[sha1]=f47a734d459c3340632cd52c18936fbde51abba4, stripped

__cxa_throw_bad_array_new_length at ??:?
std::rethrow_exception(std::__exception_ptr::exception_ptr) at ??:?
std::terminate() at ??:?
__clang_call_terminate at main.cc:?
seastar::shared_future<>::shared_state::get_future(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >) at ././seastar/include/seastar/core/shared_future.hh:190
 (inlined by) seastar::shared_future<>::get_future(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >) const at ././seastar/include/seastar/core/shared_future.hh:270
 (inlined by) seastar::shared_promise<>::get_shared_future(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >) const at ././seastar/include/seastar/core/shared_future.hh:342
 (inlined by) tasks::task_manager::task::done() const at ./tasks/task_manager.cc:344
seastar::future<std::optional<sstables::compaction_stats> > compaction_manager::perform_compaction<compaction::regular_compaction_task_executor, compaction::table_state&>(seastar::bool_class<compaction::throw_if_stopping_tag>, std::optional<tasks::task_info>, compaction::table_state&) at ./compaction/compaction_manager.cc:589
compaction_manager::submit(compaction::table_state&) at ./compaction/compaction_manager.cc:1294
 (inlined by) compaction_manager::postponed_compactions_reevaluation() at ./compaction/compaction_manager.cc:1016
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume() const at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/coroutine:240
 (inlined by) seastar::internal::coroutine_traits_base<void>::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:125
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./seastar/src/core/reactor.cc:2690
 (inlined by) seastar::reactor::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3152
seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3320
seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0::operator()() const at ./build/release/seastar/./seastar/src/core/reactor.cc:4563
 (inlined by) void std::__invoke_impl<void, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&>(std::__invoke_other, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&) at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/invoke.h:61
 (inlined by) std::enable_if<is_invocable_r_v<void, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&>, void>::type std::__invoke_r<void, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&>(seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&) at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/invoke.h:111
 (inlined by) std::_Function_handler<void (), seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0>::_M_invoke(std::_Any_data const&) at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/std_function.h:290
std::function<void ()>::operator()() const at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/std_function.h:591
 (inlined by) seastar::posix_thread::start_routine(void*) at ./build/release/seastar/./seastar/src/core/posix.cc:90
start_thread at ??:?
__clone3 at :?

Secondary error, right when the first node is coming back up and there are log messages about it rejoining the token ring:

[shard  0: gms] gossip - Node 10.123.104.78 has restarted, now UP, status = NORMAL
[shard  0: gms] storage_service - Node 10.123.104.78 is in normal state, tokens: {...}
[shard  0: gms] storage_service - handle_state_normal: node 10.123.104.78/a9b9fa2b-2983-4a4b-be42-99c035f51208 was already a normal token owner
[shard  0: gms] storage_service - Host ID a9b9fa2b-2983-4a4b-be42-99c035f51208 continues to be owned by 10.123.104.78
{... listing all of the token ranges ...}
[shard  0: gms] rpc - client 10.123.104.78:7001: client connection dropped: sendmsg: Broken pipe
[shard  0: gms] gossip - Fail to send EchoMessage to 10.123.104.78: seastar::rpc::closed_error (connection is closed)
[shard  0: gms] gossip - InetAddress a9b9fa2b-2983-4a4b-be42-99c035f51208/10.123.104.78 is now UP, status = NORMAL
[shard  0: gms] storage_service - Node 10.123.104.78 is in normal state, tokens: {...}
[shard  0: gms] storage_service - handle_state_normal: node 10.123.104.78/a9b9fa2b-2983-4a4b-be42-99c035f51208 was already a normal token owner
[shard  0: gms] storage_service - Host ID a9b9fa2b-2983-4a4b-be42-99c035f51208 continues to be owned by 10.123.104.78
{... listing all of the token ranges ...}
Aborting on shard 0.
Backtrace:
  0x5f6b538
  0x5fa1dc1
  /opt/scylladb/libreloc/libc.so.6+0x3dbaf
  /opt/scylladb/libreloc/libc.so.6+0x8e883
  /opt/scylladb/libreloc/libc.so.6+0x3dafd
  /opt/scylladb/libreloc/libc.so.6+0x2687e
  0x5f3ce2c
  0x3e88a0d
  0x145345a
  0x5f7d16f
  0x5f7e457
  0x5f7d7b8
  0x5f0b767
  0x5f0a92c
  0x13e2da8
  0x13e47f0
  0x13e1379
  /opt/scylladb/libreloc/libc.so.6+0x27b89
  /opt/scylladb/libreloc/libc.so.6+0x27c4a
  0x13de764
[shard  0: gms] gossip - Gossip change listener failed: std::bad_alloc (std::bad_alloc), at: 0x64844be 0x6484ad0 0x6484db8 0x5f3cc6e 0x5f3ce27 0x3e88a0d 0x145345a 0x5f7d16f 0x5f7e457 0x5f7d7b8 0x5f0b767 0x5f0a92c 0x13e2da8 0x13e47f0 0x13e1379 /opt/scylladb/libreloc/libc.so.6+0x27b89 /opt/scylladb/libreloc/libc.so.6+0x27c4a 0x13de764
                                                                                      --------
                                                                                      seastar::internal::coroutine_traits_base<void>::promise_type
                                                                                      --------
                                                                                      seastar::internal::coroutine_traits_base<void>::promise_type
                                                                                      --------
                                                                                      seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<seastar::with_semaphore<seastar::semaphore_default_exception_factory, gms::gossiper::apply_state_locally(std::map<gms::inet_address, gms::endpoint_state, std::less<gms::inet_address>, std::allocator<std::pair<gms::inet_address const, gms::endpoint_state> > >)::$_0::operator()<gms::inet_address&>(gms::inet_address&) const::{lambda()#1}, std::chrono::_V2::steady_clock>(seastar::basic_semaphore<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock>&, unsigned long, gms::gossiper::apply_state_locally(std::map<gms::inet_address, gms::endpoint_state, std::less<gms::inet_address>, std::allocator<std::pair<gms::inet_address const, gms::endpoint_state> > >)::$_0::operator()<gms::inet_address&>(gms::inet_address&) const::{lambda()#1}&&)::{lambda(auto:1)#1}::operator()<seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock>)::{lambda()#1}, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<seastar::with_semaphore<seastar::semaphore_default_exception_factory, gms::gossiper::apply_state_locally(std::map<gms::inet_address, gms::endpoint_state, std::less<gms::inet_address>, std::allocator<std::pair<gms::inet_address const, gms::endpoint_state> > >)::$_0::operator()<gms::inet_address&>(gms::inet_address&) const::{lambda()#1}, std::chrono::_V2::steady_clock>(seastar::basic_semaphore<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock>&, unsigned long, gms::gossiper::apply_state_locally(std::map<gms::inet_address, gms::endpoint_state, std::less<gms::inet_address>, std::allocator<std::pair<gms::inet_address const, gms::endpoint_state> > >)::$_0::operator()<gms::inet_address&>(gms::inet_address&) const::{lambda()#1}&&)::{lambda(auto:1)#1}::operator()<seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >(seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock>)::{lambda()#1}, false> >(gms::gossiper::apply_state_locally(std::map<gms::inet_address, gms::endpoint_state, std::less<gms::inet_address>, std::allocator<std::pair<gms::inet_address const, gms::endpoint_state> > >)::$_0::operator()<gms::inet_address&>(gms::inet_address&) const::{lambda()#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<seastar::with_semaphore<seastar::semaphore_default_exception_factory, gms::gossiper::apply_state_locally(std::map<gms::inet_address, gms::endpoint_state, std::less<gms::inet_address>, std::allocator<std::pair<gms::inet_address const, gms::endpoint_state> > >)::$_0::operator()<gms::inet_address&>(auto:1&&) const::{lambda()#1}, std::chrono::_V2::steady_clock>(seastar::basic_semaphore<auto:1, auto:3>&, unsigned long, auto:2&&)::{lambda(auto:1)#1}::operator()<seastar::semaphore_units<seastar::semaphore_default_exception_factory, std::chrono::_V2::steady_clock> >(auto:1)::{lambda()#1}, false>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>

Decoded:

Backtrace:
[Backtrace #0]
void seastar::backtrace<seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}&&) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:68
 (inlined by) seastar::backtrace_buffer::append_backtrace() at ./build/release/seastar/./seastar/src/core/reactor.cc:825
 (inlined by) seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:855
seastar::print_with_backtrace(char const*, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:867
 (inlined by) seastar::sigabrt_action() at ./build/release/seastar/./seastar/src/core/reactor.cc:4071
 (inlined by) seastar::install_oneshot_signal_handler<6, (void (*)())(&seastar::sigabrt_action)>()::{lambda(int, siginfo_t*, void*)#1}::operator()(int, siginfo_t*, void*) const at ./build/release/seastar/./seastar/src/core/reactor.cc:4047
 (inlined by) seastar::install_oneshot_signal_handler<6, (void (*)())(&seastar::sigabrt_action)>()::{lambda(int, siginfo_t*, void*)#1}::__invoke(int, siginfo_t*, void*) at ./build/release/seastar/./seastar/src/core/reactor.cc:4043
/data/scylla-s3-reloc.cache/by-build-id/84cd4ae97ea9d9820a56ede82cc51ff18e019400/extracted/scylla/libreloc/libc.so.6: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=9148cab1b932d44ef70e306e9c02ee38d06cad51, for GNU/Linux 3.2.0, not stripped

__GI___sigaction at :?
__pthread_kill_implementation at ??:?
__GI_raise at :?
__GI_abort at :?
seastar::on_fatal_internal_error(seastar::logger&, std::basic_string_view<char, std::char_traits<char> >) at ./build/release/seastar/./seastar/src/core/on_internal_error.cc:81
gms::gossiper::apply_new_states(gms::inet_address, gms::endpoint_state, gms::endpoint_state const&, utils::tagged_uuid<gms::permit_id_tag>) at ./gms/gossiper.cc:1925
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume() const at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/coroutine:240
 (inlined by) seastar::internal::coroutine_traits_base<void>::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:125
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./seastar/src/core/reactor.cc:2690
 (inlined by) seastar::reactor::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3152
seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3320
seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3210
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:276
seastar::app_template::run(int, char**, std::function<seastar::future<int> ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:167
scylla_main(int, char**) at ./main.cc:700
std::function<int (int, char**)>::operator()(int, char**) const at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/std_function.h:591
main at ./main.cc:2211
__libc_start_call_main at ??:?
__libc_start_main_alias_2 at :?
_start at ??:?

horschi · February 27, 2025, 5:07pm

Hi @GarrettPoore ,

we are also having memory issues since 6.0.

Could you check the scylla_memory_regular_dirty_bytes metric? Perhaps you are having the same problem as we have: We noticed that for us scylla_memory_regular_dirty_bytes goes up over time, at some point not going down with flushes any more.

Disclaimer: I am just a Scylla user, not a dev. But I am interested in understanding allocation problems better.

regards,
Christian

GarrettPoore · February 27, 2025, 8:58pm

@horschi Thanks for the input. I checked scylla_memory_regular_dirty_bytes and don’t really see any large spikes before the restarts. It builds up over an hour or so after restarts then stays within ~20% of where it peaks. Are you suggesting it should drop down fairly often?

We have constant activity on the cluster, so we never really have downtime unless we plan for it (or the crashes).

Botond_Denes · February 28, 2025, 5:00am

If the nodes are under memory pressure, the problems can manifest in many different places. The code causing the crash might be completely innocent, it was just the one which attempted a badly failing allocation.

Any time there is memory pressure, the metric to look at is non-LSA memory usage in the “Detailed” dashboard. Almost always, you will see this metric being elevated on the sick node. The next step is to find something which correlates with this, lately we often found the memory used by bloom filter to be the culprit, but it can be something else as well.

GarrettPoore · February 28, 2025, 3:21pm

Not in our cases @Botond_Denes, we do have 1 node that is always higher on Non-LSA, but I think it just has more “load” than the others. It reports more on nodetool status than the others, and is our oldest node, so I assume that has something to do with it.

Here’s an example of one of our failures this week (of 4), there isn’t a change in Non-LSA at all before the crash (even when I isolate to that node it has no change).

horschi · March 2, 2025, 4:29am

No, I think what you describe sounds like normal behaviour. For us it keeps building up over time, at some point being permanently stuck at 100%. How does it behave when you manually call nodetool flush? If its a properly functioning node, then the metric goes close to 0.

Another question: Are you doing a lot of “in” queries on larger partitions? It feels like Scylla does not like them very much.

I find your post quite interesting, as your traces look similar to what we are seeing.

Guy · March 3, 2025, 4:36am

@horschi, @GarrettPoore , thanks for reporting this.
Notice that version 6.0.x is End-of-life.
Please open an issue for it, with as many details as possible, so that it would be easier to investigate.

GarrettPoore · March 3, 2025, 5:29pm

@horschi Hm, I haven’t manually flushed any of these nodes yet, but will give it a try tonight.

As for what queries we use, I’m not sure, I’ll have to check with app developers.

@Guy We found this issue on GitHub last week and have been talking there a bit as well. Though we have had trouble analyzing the coredumps on our end. We are unable to send the coredumps due to compliance constraints.

GarrettPoore · March 3, 2025, 5:34pm

We also tried upgraded this specific cluster to 6.1.5 over the weekend, hoping it helps. Though since the above issue is still open we’re not convinced we’re safe yet.

GarrettPoore · March 4, 2025, 5:08pm

@horschi I tried a nodetool flush and that metric does drop very low, so I think we’re seeing different issues. Or at least different ways to get to the same problem.

Topic		Replies	Views
The expansion of the 180-node cluster has failed ScyllaDB troubleshooting , administration , memory	6	145	October 28, 2024
Allocation problems 6.1 ScyllaDB troubleshooting , unanswered , memory	2	39	January 9, 2025
Node crashing after adding new nodes in scylla cluster ScyllaDB troubleshooting , administration , tablets , topology-change	3	105	December 27, 2024
Compaction Storm slows down Scylla ScyllaDB troubleshooting , compaction	24	224	October 7, 2024
Bad_alloc when trying to remove a node, large collections, topology changes errors ScyllaDB collections , administration , raft , topology	0	16	May 12, 2025

6.0.4 Major Crashes Due to Memory/Gossip Failures

Related topics