[RELEASE] ScyllaDB 2025.3.5

The ScyllaDB team is pleased to announce the release of ScyllaDB 2025.3.5, a production-ready patch release for ScyllaDB 2025.3 Feature Release.

Related Links

Bug Fixes

The following issues are fixed in this release.

Change Data Capture (CDC)

  • Critical errors due to a malformed SSTable exception

    • Issue: Critical errors (sstables::malformed_sstable_exception) were occurring because a column was reported as missing in the current schema for the cdc_log table. This could happen when recreating a column too soon.

    • Fix: Added a check to prevent recreating a column too soon, and the logic was updated to set the column drop timestamp in the future to prevent the schema mismatch. scylladb#26340, scylladb#27036

  • Notification about expiring ERM held for too long was broken

    • Issue: The system failed to properly notify when an Effective Replication Map (ERM) token was held for too long after its expiry.

    • Fix: The notification logic for the expiring ERM held for too long was corrected. scylladb#27141, scylladb#27275

Cloud/Connectivity

  • EC2 metadata querying should use back-off for “service unavailable”

    • Issue: When querying EC2 metadata (used by AWS KMS), “service unavailable” responses (e.g., HTTP 503 errors) were not handled with a retry mechanism.

    • Fix: The KMS host was updated to include the HTTP error code in KMS errors, and an exponential backoff-retry mechanism was added specifically for 503 errors. scylladb#27062, scylladb#27063

  • S3 client error handling for transient network errors

    • Issue: The S3 client was not classifying all transient network errors as retryable, leading to unnecessary failures.

    • Fix: Error handling for the S3 client was extended to correctly classify additional transient network errors as retryable. scylladb#27349, scylladb#27390

Operations/Management

  • Automatic cleanup improvements

    • Issue: Automatic cleanup logic was limited and lacked user-facing controls.

    • Fix: Automatic cleanup was improved to allow a node to opt out of automatic cleanup. This update also introduced a RESTful API to reset the cleanup needed flag, and a nodetool cluster cleanup command to run cleanup on all dirty nodes. scylladb#26866, scylladb#27093

  • Maintenance mode functionality was broken

    • Issue: Maintenance mode was non-functional, and the related test (test_maintenance_mode) did not perform as expected.

    • Fix: The service QoS was updated to fall back to the default scheduling group when using the maintenance socket, restoring maintenance mode functionality. scylladb#26816, scylladb#27039

  • More logging for load_new_sstables/download_new_sstables

    • Issue: The logging output for load_new_sstables and download_new_sstables was insufficient, lacking logging of all option values.

    • Fix: The functions were updated to log all option values used during execution, and additional logging was added to streaming operations. scylladb#27299, scylladb#27341

  • Node locator missing `_excluded` field in operations

    • Issue: The node locator was not preserving the _excluded field in clone() and omitting it from the verbose formatter.

    • Fix: The locator logic was updated to preserve and include the _excluded field in all necessary places. scylladb#27290

Stability/Reliability

  • Conflicting tablet migrations in the scheduler

    • Issue: The tablet scheduler could emit conflicting migrations for the same tablet in different DCs or conflicting inter-node and intra-node migrations, resulting in incorrect reads.

    • Fix: The scheduler logic was updated to prevent emitting conflicting migrations in the plan and during merge colocation. scylladb#26038, scylladb#27304, scylladb#26048, scylladb#27312, scylladb#27330

  • Load-and-stream with tablets failing with “Unable to load SSTable”

    • Issue: Load-and-stream operations with tablets would sometimes fail with an “Unable to load SSTable” error.

    • Fix: Synchronization logic was added to the sstables_loader to prevent bypassing synchronization when the topology is busy. scylladb#22707, scylladb#26730

  • Multiple oversized memory allocation errors with Vnodes

    • Issue: Creating thousands of tables with Vnodes could lead to multiple seastar_memory - oversized allocation errors.

    • Fix: The issue was resolved by changing the internal type of a table metadata variable. scylladb#26787, scylladb#27198

  • Node coredumped after tablet cleanup log line

    • Issue: A node could coredump after logging that tasks were stopped for compactions due to tablet cleanup.

    • Fix: The replica logic was updated to fail a timed-out single-key read on a cleaned-up tablet replica. scylladb#26229, scylladb#27155

  • Premature break causes SSTables to be skipped during streaming

    • Issue: A premature loop break in the tablet_sstable_streamer::stream function was causing SSTables to be unexpectedly skipped.

    • Fix: The loop break condition in tablet_sstable_streamer::stream was fixed. scylladb#26979, scylladb#27153

  • Race condition between tablet split and load-and-stream

    • Issue: A race condition could occur between the tablet split process and the load-and-stream operation.

    • Fix: Synchronization logic was implemented to correctly synchronize tablet split and load-and-stream. scylladb#26455, scylladb#26648