[RELEASE] ScyllaDB 2025.4.1

The ScyllaDB team is pleased to announce the release of ScyllaDB 2025.4.1, a production-ready patch release for ScyllaDB 2025.4 Feature Release.

Related Links

Bug Fixes

The following issues are fixed in this release.

Memory Management and Stability

  • Prevent crashes when memory kill thresholds are exceeded
    Creating a reader could crash ScyllaDB when memory kill thresholds were exceeded.
    Exception propagation was enabled by removing noexcept, allowing graceful handling under memory pressure.
    scylladb#27475

Alternator

  • Fix use-after-free during internal table initialization
    Under rare conditions, Alternator could trigger a use-after-free while preparing ScyllaDB internal tables, potentially leading to crashes.
    Paging state encoding was fixed to ensure safe memory handling during Alternator operations.
    scylladb#27125

Batchlog and Repair

  • Prevent data resurrection when batchlog replay fails
    Failed batchlog replays during repair could incorrectly update repair timestamps, allowing previously deleted data to reappear.
    Repair now fails explicitly if batch replay or flush fails, and repair timestamps are updated only after successful completion.
    scylladb#24415

  • Harden batchlog replay logic and validation
    Batchlog replay could incorrectly skip batches with unknown or invalid versions, masking replay errors.
    Replay logic is now coroutine-based, validates batch versions, and fails explicitly when replay errors occur.
    scylladb#26766

  • Fail tablet repair on partial batch delivery
    Tablet repair previously continued even if some batches failed to send, risking an inconsistent data state.
    Tablet repair now fails if any batch is not successfully delivered, ensuring stronger consistency guarantees.
    scylladb#26766

Cloud Storage and Streaming

  • Classify additional S3 network errors as retryable
    Some transient S3 network failures were treated as fatal, causing unnecessary operation failures.
    Additional transient error types are now classified as retryable, improving the robustness of S3-backed workflows.
    scylladb#27349

Failure Detection and Raft

  • Fix race conditions and flakiness in failure detector tests
    Raft failure detector tests suffered from race conditions and intermittent failures.
    Timeout handling and scheduling were improved, and abort-related exceptions are now handled consistently.
    scylladb#27136

  • Improve direct failure detector efficiency and scheduling
    Direct failure detector pings incurred unnecessary overhead and suboptimal scheduling behavior.
    The detector now runs in the gossiper scheduling group, avoids redundant cross-core calls, and passes timeouts explicitly.
    scylladb#27483

  • Ensure node removal notifications are emitted only once
    Raft topology coordination could emit duplicate node removal notifications.
    Notifications are now deduplicated to ensure consistent cluster state propagation.
    scylladb#27913

Indexes and Query Processing

  • Prevent invalid vector index configurations
    Vector indexes could be created with invalid option values or without tablets enabled.
    ScyllaDB now rejects invalid configurations and requires tablets for vector indexes.
    scylladb#27234

  • Fix query failures with low tombstone page limits
    Low query_tombstone_page_limit values could result in misleading “Key column not found” errors.
    Query execution was corrected to handle tombstone pagination safely.
    scylladb#27001

Monitoring and Tooling

  • Extend scylla-node-exporter with ethtool support
    Network interface statistics exposed by scylla-node-exporter were incomplete.
    The node exporter now includes ethtool, enabling richer network monitoring metrics.
    scylladb#27508

SSTables and Storage

  • Fix boot failures caused by missing SSTable TOC files
    ScyllaDB could fail to boot if SSTable TOC files were missing or modified concurrently.
    A new semaphore serializes SSTable linking and component rewrites, preventing corruption and startup failures.
    scylladb#25919

Tablets and Topology

  • Prevent conflicting tablet migrations
    The tablet scheduler could emit conflicting inter-node and intra-node migrations, leading to incorrect reads.
    Scheduler logic was corrected to avoid conflicting migrations during colocation merges.
    scylladb#27304

  • Improve topology coordinator robustness during streaming and repair
    Certain topology coordinator operations were skipped or executed at incorrect times.
    Session IDs are now set correctly for streaming, and repair compaction control is always executed.
    scylladb#27867

Vector Search

  • Reject invalid primary key restrictions in vector search queries
    Vector search allowed queries with unsupported primary key restrictions.
    Such queries now fail fast with a clear validation error.
    scylladb#27668