[RELEASE] ScyllaDB 2025.4.3

The ScyllaDB team is pleased to announce the release of ScyllaDB 2025.4.3, a production-ready patch release for ScyllaDB 2025.4 Feature Release.

Related Links

Bug Fixes

The following issues are fixed in this release.

Native backup on AWS S3

  • The AWS error handling logic did not correctly process all restartable nested exceptions, which could lead to failed operations during transient cloud service disruptions. The logic was updated to correctly fix nested exception handling and explicitly handle all restartable nested exception types. This improves resilience and reliability for AWS-based operations by ensuring automatic retries on appropriate service errors.
    scylladb#28243, scylladb#28344
  • An update was required for S3 client functionality, which depends on the seastar submodule. The seastar submodule was updated with assorted fixes for the S3 client. This improves stability and functionality for S3-related operations.
    scylladb#28482

Commitlog

  • A race condition or corruption in the commitlog could cause startup failure for a node when encountering a file with a corrupt file header. Commitlog replay will now handle files with a corrupt file header (non-zero) as data loss instead of as a fatal startup failure, which increases robustness during node startup and recovery from commitlog issues.
    scylladb#27682

Native backup Connection & DNS

  • The connection factory needed improvements for handling network instability and DNS resolution. The connection factory was enhanced to introduce a TTL timer, retry on failures, and use all resolved DNS addresses, in addition to general cleanup and refactoring. This significantly improves connection reliability and fault tolerance, especially in dynamic environments with frequent DNS updates or transient failures.
    scylladb#28404

Vector Search - Data Modeling & Querying

  • When passing a null vector to an Approximate Nearest Neighbor (ANN) query, the system would fail with a non-informative error. The CQL interface was updated to fail with a better, more explicit error when a null vector is passed to an ANN query, which improves the user experience and debugging process by providing clearer error messages for vector search queries.
    scylladb#28052

  • The default compression change for CQL to LZ4WithDictsCompressor was not applied consistently to all table types, specifically in Alternator and Materialized Views. The schema initialization process was updated to apply sstable_compression_user_table_options to CQL auxiliary and Alternator tables, which ensures consistent performance and space usage across CQL, Alternator, and Materialized View table types.
    scylladb#26914

Database & Internal

  • The system_replicated_keys keyspace was not correctly marked as a system keyspace, leading to incorrect internal management, and the replicated_key_provider required the KSNAME to be made public. The system_replicated_keys keyspace is now correctly marked and handled as a system keyspace, and the KSNAME was made public. This ensures correct internal management and behavior of system keyspaces and allows for proper external referencing.
    scylladb#27903, scylladb#28237

  • The service layer was not correctly propagating the topology guard to the Replication Backpressure and Node Operations (RBNO) service. The service was updated to pass the topology guard to RBNO, which prevents assertion failures and ensures cluster stability during topology changes.
    scylladb#28298

Raft & Topology

  • A node could sometimes remain in the Raft topology with a pending leave request, creating an inaccurate cluster state. The topology coordinator now completes pending operations for a replaced node, which ensures a cleaner and more accurate cluster topology state, particularly after node replacement operations.
    scylladb#27990
  • Disabling tablet balancing via the REST API (/storage_service/tablets/balancing) did not properly integrate with the internal topology request system and failed to interrupt the tablet scheduler immediately. Disabling balancing via REST now correctly goes through a topology request, and the RPC for balancing disabling will preempt tablet transitions, ensuring the interruption of the tablet scheduler. This guarantees atomic, consistent disabling of load balancing, ensuring prompt cessation of balancing activity.
    scylladb#27647, scylladb#27210

Repair

  • The repair service lacked session support for the rebuild_with_repair operation. Session support has been added to repair_service::rebuild_with_repair, which enables more complex and stateful repair operations.
    scylladb#27759
  • Incorrect values were reported for progress_total and progress_completed for tablet repair tasks. The reporting logic was corrected, and progress reporting support was added to the tablet repair task. This provides accurate, detailed visibility into the progress and status of tablet repair operations.
    scylladb#26896, scylladb#22564
  • Memory corruption was detected when running a specific repair test with disjoint rows and a different shard count. The logic for sstable_list_to_mark_as_repaired was fixed to work correctly with a multishard writer, which eliminates a memory safety issue and improves repair operation reliability.
    scylladb#27666, scylladb#28064

Storage & SSTables

  • Concurrent SELECT ... FROM MUTATION_FRAGMENTS(...) queries alongside a regular SELECT on the same partition could lead to a segmentation fault (nullptr dereference). The row cache reader logic was updated to pass a cache tracker to the snapshot in make_nonpopulating_reader(), which prevents the nullptr dereference and improves system reliability under concurrent query load.
    scylladb#26847, scylladb#28279
  • A refresh of load statistics (load_stats) could fail, throwing a no_such_column_family error and incorrectly handling dropped tables. The refresh logic was fixed to correctly handle dropped tables and prevent the error. This ensures accurate and reliable load statistics reporting, improving monitoring capabilities.
    scylladb#28470
  • An assertion failure could occur when a node was in maintenance mode due to validation logic, and the topology was not properly set up. The system was updated to skip validate_read_replica and is now updated to properly set up the topology in maintenance mode. This ensures node stability and successful completion of maintenance operations.
    scylladb#27988, scylladb#28498
  • A non-UTF8 character error could occur during a database snapshot test. A fix was applied to the serialization of the partition key, which prevents crashes and ensures data integrity during database snapshot and key serialization.
    scylladb#28195

Streaming

  • A resource leak was detected in the streaming semaphore after new nodes started streaming. The handling of base resources in the reader_concurrency_semaphore was improved, which prevents resource exhaustion and improves long-term stability during node operations like adding or replacing nodes.
    scylladb#28083, scylladb#28245
  • A use-after-free memory bug was present in the streaming_task_impl::run function of the node operations service. Coroutine lambda wrappers were removed in node_ops, which eliminates a critical memory safety bug and improves the reliability of node operations.
    scylladb#28200
  • The streaming process did not consistently use a session variable. The streaming service was updated to use a session variable for streaming, which improves correctness and consistency for streaming operations.
    scylladb#28298

Vector Search - Permissions

  • Vector search permissions lacked the necessary scope to cover CDC streams and timestamps. The fix adds CDC streams and timestamps to vector search permissions, which ensures proper access control and security when using these features with vector search.
    scylladb#28537