[RELEASE] ScyllaDB 2025.3.7

The ScyllaDB team announces ScyllaDB 2025.3.7, a bug-fix production-ready patch release for ScyllaDB 2025.3 Feature Release.

Note that there is a new Short-Term Support (STS) Feature Release 2025.4. You are welcome to upgrade to it for the latest and greatest features.

More information on ScyllaDB’s Support Policy is available here.

Related Links

Bug Fixes

API

  • API Improvements and Keyspace Operations: The force_keyspace_cleanup_async API handler was missing a check for non-vnode keyspaces. This was fixed as part of a larger effort to unify synchronous and asynchronous compaction and upgrade APIs, enhancing the robustness and consistency of keyspace maintenance and administrative tasks.
    scylladb#26715, scylladb#26886

Commitlog

  • Commitlog Race Condition: A possible race condition could occur between the large allocation of memory for the commitlog and the termination of a segment. The fix addresses this race condition, significantly improving the durability and reliability of commitlog operations.
    scylladb#27992

Materialized Views

  • Materialized View Inconsistency on Restart: ScyllaDB was not correctly discovering staging SSTables on restart, a bug that could cause prolonged base-view inconsistencies after a node restart. The fix ensures the view update generator properly discovers all staging SSTables at startup, preventing prolonged inconsistencies.
    scylladb#27956, scylladb#28091

Networking

  • Networking Stack Overflow: The posix_server_socket_impl::accept() method was found to be recursive in a way that could potentially overflow the stack. The fix eliminates this recursive behavior, improving the stability of the networking stack and preventing node failures under high connection load.
    scylladb#28166

Performance

  • Load Sketch Accuracy: The internal load monitoring mechanism, load_sketch, is updated to correctly allow for population with the normalized current load. This provides more accurate monitoring data used for auto-scaling and internal load balancing decisions.
    scylladb#28106

  • System Keyspace Query Performance: Fetching the history of group 0 schema versions could stall when a cluster has a large number of tablets due to synchronous unfreeze operations. The fix adjusts the process to use an asynchronous unfreeze, which prevents the query from blocking and improves the responsiveness of schema-related queries. scylladb#27872, scylladb#27908

Raft

  • Raft Node IP-to-ID Mapping: In Raft topology clusters, the IP-to-ID mapping of a replacing node was not preserved across restarts. The fix ensures this critical mapping is persisted correctly, which is vital for consistent cluster recovery and operation.
    scylladb#28057, scylladb#28098

  • Topology State Cleanup: A fix was applied to ensure that an inactive node in the Raft topology does not incorrectly retain a “leave request” in its state. This correction ensures the Raft topology accurately reflects the current cluster membership.
    scylladb#27990

  • Excluded Node Generation Fix: The gossiper was fixed to correctly handle excluded nodes by making their generations negative. This improves the accuracy of the cluster state maintained by the gossiper and enhances overall stability.
    scylladb#28098

Repair

  • Repair Failure Reporting: An issue was fixed in the repair process to ensure that an exception is thrown if flushing data fails during the get_flush_time process. This prevents silent failures in repair operations, providing timely and accurate error reporting for improved operational safety.
    scylladb#26794

  • Repair Token Range Format: A repair command could fail because the token range was provided in a mismatched format of (minimum token, maximum token) instead of the required (start, end]. The fix adjusts the repair process to use the correct token range format, allowing full-range repairs to execute without errors.
    scylladb#27220

  • Tombstone Garbage Collection Data Resurrection: A critical bug allowed failed batches during repair and batchlog replay to resurrect data already marked for deletion by tombstone garbage collection. The comprehensive fix prevents the repair_time from being updated if batchlog replay fails and ensures that tablet repair operations fail if any batch was not sent successfully, protecting against data resurrection and improving data integrity.
    scylladb#24415, scylladb#26762

Schema

  • Table Truncation Consistency: The truncate_table_on_all_shards operation was only checking the can_flush condition on a single shard, which could lead to inconsistent truncation state. The fix ensures this operation correctly considers the flush status on all shards before proceeding, guaranteeing a consistent state after truncation.
    scylladb#27639 / scylladb#28071

Service Levels

  • Internal Service Level Consistency: An issue was fixed to ensure that the storage service correctly checks if service levels have already been upgraded and reliably updates the internal service levels cache after an upgrade to v2. This maintains system consistency and prevents internal state errors during an upgrade process.
    scylladb#28072

Stability

  • Concurrent Querying Segmentation Fault: Executing a standard SELECT query concurrently with a SELECT ... FROM MUTATION_FRAGMENTS(...) query against the same partition could result in a segmentation fault (nullptr dereference). The bug fix resolves the underlying race condition, stabilizing the system under concurrent read workloads.
    scylladb#26847

  • LSA Resource Allocation Failure: A failure could occur when concurrently populating a large number of tables (e.g., 5000), resulting in an Aborting on shard error because the Low-Level System Architecture (LSA) failed to refill its emergency reserve. This fix stabilizes the resource allocation logic to handle highly concurrent table creation workloads.
    scylladb#27620

  • Maintenance Mode Assertion Failure: An assertion failure was discovered to occur when a node was running in maintenance mode. The fix resolves the issue, preventing unexpected crashes and improving the stability of nodes during maintenance.
    scylladb#27988

  • Reader Concurrency Resource Leak: Protection has been added to the reader concurrency semaphore to prevent resource leaks that could result in its internal count becoming negative. This ensures the semaphore operates correctly and maintains system resource integrity.
    scylladb#28002

  • Streaming Resource Leak and Use-After-Free: Two critical issues in streaming were fixed: a semaphore streaming resource leak detected when new nodes started streaming, and a use-after-free bug in the streaming_task_impl::run function. These fixes dramatically improve node stability during join, replace, and other streaming operations.
    scylladb#28083, scylladb#28200