[RELEASE] ScyllaDB 2025.1 - part 2

Additional Improvements

Procedures

With Tablets, the Replication Factor (RF) cannot be updated to a value higher than the number of nodes per Data Center (DC). This feature protects the Admin from setting an impossible-to-support RF. This affects the following operations:

Node Decommission / Remove

Starting from 2024.2, you cannot decommission or remove a node if the resulting number of nodes would be smaller than the largest non-zero replication factor (for any keyspace) in this DC

For example:

1 DC, 5 nodes, a KS with RF=5

The decommission request will fail

The Replication Factor (RF) of Keyspaces must be less than or equal to the number of available nodes per Data Center (DC)

Once a tablets-enabled Keyspace has tables, you can not ALTER its Replication Factor to be greater than the number of available nodes per DC.

If you create such a Keyspace, you won’t be able to create Tables until you fix the RF or add more nodes.

Monitor Tablets

To Monitor Tablets in real time, upgrade ScyllaDB Monitoring Stack to release 4.7, and use the new dynamic Tablet panels, below.

Driver Support

The following drivers versions and newer support Tablets

  • Java driver 4.x, from 4.18.0.2
  • Java driver 3.x, from 3.11.5.4
  • Python driver, from 3.28.1
  • Gocql driver, from v1.14.5
  • Rust driver from 0.13.0

Legacy ScyllaDB and Apache Cassandra drivers will continue to work with ScyllaDB but will be less efficient when working with tablet-based Keyspaces.

File-based streaming for Tablets

File-based streaming is an optimization of tablet migration performance.

In ScyllaDB Open Source, migrating tablets is performed by streaming mutation fragments, which involves deserializing SSTable files into mutation fragments and re-serializing them back into SSTables on the other node. In ScyllaDB 2025.1, migrating tablets is performed by streaming entire SStables, which does not require (de)serializing or processing mutation fragments. As a result, less data is streamed over the network, and less CPU is consumed, especially for data models that contain small cells.

File-based streaming is used for tablet migration in all keyspaces created with tablets enabled.

More in Docs.

Arbiter and Zero-token Node

There is now support for zero-token nodes. Such nodes do not replicate any data, but can participate in query coordination, and in Raft quorum voting.

One can use this to create an Arbiter: a tiebreaker node, with no data, that can help maintain quorum in the case of a symmetrical two-datacenter cluster. If one of the data centers fails, the Arbiter, deployed on a 3rd datacenter, keeps quorum on the node alive. Since the Arbiter has zero token, it does not replicate user data, and does not come with network and storage costs. #15360

You can use nodetool status for a list of zero token nodes.

Strongly Consistent Topology Updates

With Raft-managed topology enabled, all topology operations are internally sequenced consistently. A centralized coordination process ensures that topology metadata is synchronized across the nodes on each step of a topology change procedure.

This makes topology updates fast and safe, as the cluster administrator can trigger many topology operations concurrently, and the coordination process will safely drive all of them to completion. For example, multiple nodes can be bootstrapped concurrently, which couldn’t be done with the previous gossip-based topology.

Strongly Consistent Topology Updates is now the default for new clusters, and should be enabled after upgrade for existing clusters.

Strongly Consistent Auth Updates

System-auth-2 is a reimplementation of the Authentication and Authorization systems in a strongly consistent way on top of the Raft sub-system.

This means that Role-Based Access Control (RBAC) commands like create role or grant permission are safe to run in parallel without a risk of getting out of sync with themselves and other metadata operations, like schema changes.

As a result, there is no need to update system_auth RF or run repair when adding a DataCenter.

Strongly Consistent Service Levels

Service Levels allow you to define attributes like timeout per workload.

Service levels are now strongly consistent using Raft, like Schema, Topology and Auth.

#17926

Improved network compression for intra-node RPC

This release adds new RPC compression improvements for node to node communication:

  • Using zstd instead of lz4
  • Using a shared dictionary re-trained periodically on the traffic, instead of the message by message compression.

Below is a comparison of compressions algorithms on different types of data.

Note that dictionary based compression can be used with either lz4 or zstd.

Actual compression is very much workload-dependent and can vary between use cases.

Describe Schema with Internals

Until this release, CQL DESCRIBE SCHEMA was not sufficient to do a full schema restore from backup. For example, it lacks information about dropped columns.

In 6.0, the DESC SCHEMA WITH INTERNALS command provides more information, streamlining the restore process.

#16482

Native Nodetool

The nodetool utility provides simple command-line interface operations and attributes.

ScyllaDB inherited the Java based nodetool from Apache Cassandra. In this release, the Java implementation was replaced with a backward-compatible native nodetool.

The native nodetool works much faster. Unlike the Java version ,the native nodetool is part of the ScyllaDB repo, and allows easier and faster updates.

Removing the JMX Server

With the Native Nodetool (above), the JMX server has become redundant and will no longer be part of the default ScyllaDB Installation or image.

If you are using the JMX server directly, not via nodetool, you can either:

Related issues: #15588 #18566 #18472 #18566

As part of moving to native tooling and away from Java tools, we deprecated SSTableloader.

You can use the Load and Stream to upload SSTables directly to Scylla, either from Apache Cassandra or other ScyllaDB clusters. We are also deprecating the Java version of nodetool, which was replaced by a compatible native version (see above).

Maintenance Mode

Maintenance mode is a new mode in which the node does not communicate with clients or other nodes and only listens to the local maintenance socket and the REST API. It can be used to fix damaged nodes – for example, by using nodetool compact or nodetool scrub. In maintenance mode, ScyllaDB skips loading tablet metadata if it is corrupted to allow an administrator to fix it.

#5489

Maintenance Socket

The Maintenance Socket provides a new way to interact with ScyllaDB from within the node it runs on. It is mainly for debugging. You can use CQLSh with the Maintenance Socket as described in the Maintenance Socket docs. #16172

Deployment

  • Ubuntu 24.04 is now supported.
  • RHEL / CentOS 7 support is deprecated.
  • Amazon Linux 2 is deprecated and replaced with Amazon Linux 2023
  • Debian 10 support is deprecated.
  • The setup utility now works with disks that do not have UUIDs, such as those in some virtualized environments #13803
  • The scylladb-kernel-conf package tunes the Linux kernel scheduler via sysfs to improve latency. These tunings were lost in Linux 5.13+ due to kernel changes. They are now restored. #16077
  • Docker: can not connect to Scylla 5.4 with CQLSh without providing host IP #16329
  • Update rust packages
    • “Rustix”
    • “chrono” #15772
  • On Ubuntu, the installer now handles conflicts between a system process updating apt metadata and the installer itself.#16537

Service Level Per Query

Enhancement: override service level per query using USING SERVICE LEVEL = name

It is now possible to reroute an individual statement to a different service level; previously a different login session was required. This is useful for drivers to reduce the strain from the login queries.

In OSS, this only affects the statement’s timeout.

#18471

DESCRIBE SCHEMA enhance

The DESCRIBE SCHEMA statement is now extended with statements to re-create roles and grants. This can be used to re-create not only the schema, but also the user and permission structure when restoring from backup.

#18750 #18751 #20711

Alternator RBAC

Authorization: Alternator now supports Role-Based Access Control (RBAC) via CQL commands. #5047

CQL

  • CQL3: implement NOT IN #21992

    Example:

    select * from TBL where v NOT IN (5,7) ALLOW FILTERING

  • CQL3: Allow selecting map values and set elements, compatible to Cassandra 4.0

    Examples:

    SELECT map['key'] FROM table

    SELECT map['key1']['key2'] FROM table

    #7751

  • The CREATE MATERIALIZED VIEW statement now supports the undocumented WITH ID clause, improving compatibility with Cassandra. #20616

  • The memtable_flush_period_in_ms option is now implemented. #20270

  • The CREATE ROLE USING SALTED HASH statement was renamed to CREATE ROLE USING HASHED PASSWORD for improved compatibility with Apache Cassandra. #21350. See Grant Authorization CQL Reference | ScyllaDB Docs

  • CQL DESCRIBE statements for Change Data Capture (CDC) log tables have been improved. #21235

  • The DESC TABLE statement will now reject materialized views. #21026

  • The CQL PER PARTITION LIMIT clause is now respected for aggregating queries, fixing the an issue when combining PER PARTITION LIMIT and GROUP BY #5363

  • New COMPACT STORAGE tables can no longer be created. They have been deprecated for a long while. #16403

  • Failing to create table with sstable_compression=ZstdCompressor #22444

Deployments and packaging

  • The bundled node_exporter Prometheus metrics exporter was updated to version 1.8.2. #18493
  • The tarball distribution, for example for Air-gapped Server Installation, no longer packages the Java-based tools, just like rpm and deb. #20739
  • The container image no longer contains the rsyslog package. #21953

Stability

  • A node that is being replaced is now marked earlier so it does not get unexpected traffic, resulted in coredump during bootstrap when replacing a dead node #20629
  • A rare case where an sstable promoted index lookup would return in correct information was fixed.
  • A crash during shutdown while draining active writes has been fixed. #20874
  • The CQL server will now wait for the superuser to be created when authentication is enabled, avoiding a race condition #10481
  • Materialized Views perform a read operation on the base table before writing the view table. This read operation now has better concurrency control: the amount of memory consumed by reads is limited, and when the CPU is the bottleneck, we avoid issuing new reads to avoid flooding the system with competing operations. #8873 #15805
  • Data definition language (DDL) statements are automatically retried in case of an internal race accessing Raft. A crash during this retry, for ALTER KEYSPACE statements, was fixed. #21102
  • node-exporter: Disable the hwmon temperature collector, as it causes bad performance on Azure. #21163
  • Compaction manager stop operations now ignore errors in the compactions it manages, as those errors can affect the caller of the stop operation (such as the shutdown process). #21159
  • Sstable checksums and digests are now checked during compaction, improving overall integrity. Not that checksums for compressed sstables were already checked before. #21153
  • A request to stop all repairs may have missed some ongoing repair operations. #21612
  • ScyllaDB computes a schema version in order to see if it needs to perform a transparent upgrade during a query. For system tables, it now uses a hash based algorithm which is more robust compared to the manual annotation applied by developers which was used earlier. pull#21602
  • The gossip code will now clean up nodes that died before they could join Raft. #20082
  • The TRUNCATE operation has been promoted to a topology level operation. This allows the topology change coordinator to drive the operation to completion in the face of node failures and concurrent tablet migrations. #16411
  • Commitlog replay fixes a corner case where a shard that received no mutations before crashing had its commitlog interpreted incorrectly. #21719
  • The read-repair code is more careful to avoid stalls when reconciling reads with many deleted rows. #21818
  • Materialized view updates are now more careful to avoid stalls when calculating affected clustering keys during a view update. #21843
  • A tablets merge causes a segmentation fault and a coredump #21867
  • read-repair can produce stalls with large partitions composed of many tombstones #21818
  • malformed_sstable_exception due to reclaim and reload of bloom filters from unlinked sstables #21887
  • system.clients shows connections as AUTHENTICATING, not READY, even after they pass authentication #12640
  • The commitlog hard limit, introduced in ScyllaDB 4.6, is now mandatory. The hard limit prevents the commit log from expanding and instead restricts the write rate.
  • When writing to a table that has a materialized view, the coordinator checks the backlog of the participating replicas in order to apply back-pressure to the client. Due to a bug, some replicas were not considered in this calculation. #21672
  • test_topology_recovery_basic fails when validation queries are done with all_pages=True #19101
  • A bug where materialized views lost track of the base table schema during reverse queries was fixed. #21354
  • Transport layer security (TLS) uses Linux inotify to watch for certificates changing on disk. It will now consume fewer inotify resources and so have a lesser chance of failing due to that. #22425
  • A bug in reading encrypted sstables when their size slightly crosses over the buffer alignment has been fixed. #22236
  • Invalid speculative_retry value causes crashes instead of validation #21825
  • A rare case of materialized view builds never completing was fixed. #21829
  • The reader concurrency semaphore controls concurrent reads at the replica level. It could lose track of reads in some circumstances, resulting in those zombie reads leaking memory. This is now fixed. #22620
  • The reader concurrency semaphore controls concurrent reads at the replica level. It could lose track of reads in some circumstances, resulting in those zombie reads leaking memory. #22588
  • Authentication now ensures the default superuser password is set before serving CQL, reducing authentication problems. Set the default password for the default ("superuser") before serving CQL commands · Issue #20566 · scylladb/scylladb · GitHub

Performance

  • The index page cache will now generate fewer disk IOPS if an index read is partially cached. #20935

  • Raft-managed tables used for system metadata now have more eager garbage collection of tombstones, reducing performance problems with many schema or topology changes. #15607

  • The system.peers table was continuously updated even if no change was happening, stressing the disk. #20991

  • The sstable reader will now consult data in memtable before purging tombstones. This prevents data resurrection in scenarios involving very low write activity, which can lead to data staying in memtables for longer than a repair cycle. #20916

  • The efficiency of sstable reads rows within medium or large partitions, when column_index_size_in_kb has been reduced, is now improved. Such reads will generate less I/O. #10030

  • ScyllaDB tracks whether read requests are waiting for CPU or I/O. In one case, a disk read from the primary index was considered to be waiting on CPU, which reduced concurrency. This is now fixed. #21325

  • Materialized View building (initiated by CREATE MATERIALIZED VIEW or CREATE INDEX) is now performance-isolated from normal reads and writes. #21232

  • Repair flushes hints and batchlog in order to reduce the amount of work it has to do, but such flushes also generate work, so these flushes are now batched. #20259

  • Some performance bugs leading to extra I/O when reading the primary index for a large partition are fixed. #20897

  • Repair performance in mixed-shard configurations (where different nodes have different shard counts) has been improved. #21113

  • The sstable reader now frees memory more quickly, reducing memory requirements. #21160

  • During ordinary sstable compaction, we do not purge tombstones if they potentially delete data in commitlog, to avoid data resurrection on restart. However, this is unnecessary for the row cache, so row cache now ignores commitlog when purging tombstones. #16781

  • The materialized view update process updates views when the base tables are updated by an UPDATE or INSERT query. It is now able to avoid unnecessary updates when a view’s PK has a regular column #21652

  • Bootstrap and decommission now enable the small-table repair optimization. This speeds up bootstrap in large clusters when small or empty system tables have to be migrated to other nodes. #19131

  • ScyllaDB no longer takes snapshots of materialized views, since they regenerated from the base table at restore time. #21339 #20760

  • types: large allocations and stalls while comparing Decimal values #21716

  • The data plane coordination code (“storage_proxy”) now uses host UUIDs to track hosts rather than network addresses. This simplifies the code and brings a nice performance improvement. Part of #6403

  • Node rebuilds that use repair-based node operations now apply the small-table optimization when beneficial. #21951

  • We now create XFS filesystems with reduced metadata overhead. #22028

  • Materialized views pair each view replica with a base replica. This pairing is now rack-aware - the database will prefer to pair a base replica and a view replica on the same rack. This reduces rack crossings which can be expensive on public clouds, and generally have lower bandwidth and higher latency. #17147

  • ScyllaDB breaks long query results into pages to reduce transient memory consumption and latency. When it does so, it caches the query running on the replica and resumes it on the next page. This resuming broke when a paging decision was made due to a large number of tombstones, requiring the query to be restarted on the next page instead of resumed. This is now fixed. #22620

  • The reader concurrency semaphore tries to restrict the number of reads competing for CPU, since the competition delays all those reads. We now allow up to two reads to compete for the CPU in the default configuration. This allows common fast reads to compete and bypass rare long reads, reducing head-of-line blocking. #22450

  • The reader concurrency semaphore could accidentally use the query timeout for evicting cached queries, resulting in reduced performance. This is now fixed. #22629

Alternator

  • Some bugs were fixed in Alternator role-based access control, for example Alternator authorization fails when ‘alternator_enforce_authorization: false’ #20619
  • The Alternator /localnodes REST API returns nodes local to the current datacenter. It is now enhanced to be able to return other datacenters, and to restrict the returned nodes to a specific rack. Once support is added in drivers, this can help reduce networking costs. #12147
  • The Alternator /localnodes endpoint will not return nodes that are temporarily down. #21538
  • Missing IndexStatus field in return of DescribeTable #11471
  • Add or delete global secondary indices after the table is created. #11567

Tablets

  • A race between tablet split and repair could cause an sstable not to be split among two new tablets, which in turn could prevent the sstable from being loaded on restart. #20626
  • After a tablet is migrated away from a node, a cleanup process takes place to remove the tablet’s footprint from the replica. During this process a race can happen the tablet’s compaction strategy is changed concurrently. The result is schema ALTER (add column) in the middle of ongoing tablet migration causes internal error: “Compaction state for table … not found” #20699
  • When a tablet is split into two, tombstone garbage collection becomes more complicated, since sstables for the tablet exist in both pre-split and post-split state, causing possible data resurrection during tablet split preparation #20044
  • The sstable Scylla.db component now has a copy of the sstable uuid. This is needed for backup deduplication, since the sstable uuid in the file name may change when a tablet is migrated to a different shard. #20459
  • The SPLIT compaction type, used to divide tablets into smaller tablets, now uses a better estimate for the partition count of new sstables, leading to better sized bloom filters. #20253
  • The tablet load balancer is now able to schedule repair operations. This is not yet integrated into nodetool repair or automatic load balancing. #21088
  • When decommissioning a node in a cluster that has different shard counts per node, we are now more careful to preserve tablet balance among the remaining nodes. #21783
  • Tablet migrations are now surfaced as tasks that can be observed and controlled by the nodetool tasks command family. #21365
  • A use-after-free bug due to a race between tablet split and cleanup has been fixed. #21867
  • Some edge cases related to tablet draining were fixed. The cluster will now reject decommissioning a node if that results in failing to satisfy the specified replication factor. #21826
  • Tablet split and merge operations are now reflected via the task manager. #21366 #21367
  • The repair time for a tablet is now recorded in the system.tablets table. This is more reliable than storing repair time in local tables. #17507

API

  • The storage backup API now supports a prefix parameter, allowing more fine-grained control of where the backup is stored. #20335
  • The storage backup API now supports a single table granularity #20636
  • The system.snapshots virtual table now lists snapshots of dropped tables. #18313
  • Restore operations (nodetool restore) can now be aborted. #20717
  • Time Window Compaction Strategy will now return fresher values for its pending compaction count estimate. #20892

Tooling

  • The replace and remove node operations now deprecate using IP addresses to specify nodes. Use host IDs to specify nodes. #19218
  • The sstable scrub/validate commands will now check the digest of the entire sstable data file, in addition to the already-checked per-block checksums, further enhancing scrub’s corruption detection capabilities by validating digests as well. #20720
  • nodetool compactionhistory will now report statistics about rows merged during compaction. #666
  • nodetool status now shows nodes with no tokens.
  • The systemd integration now uses KillMode=control-group, as there were reports for systemctl stop not completing.
  • Task TTL: Internal tasks are now kept for an hour after completion so their status can be queried with nodetool. #21499. Two new configuration parameters are introduced to support this:
    • task_ttl_in_seconds. Time for which information about finished tasks started internally stays in memory. Default 0 (delete immediately after compilation).
    • User_task_ttl_seconds. Time for which information about finished tasks started by the user stays in memory. Default 3600s.
  • The ScyllaDB process now handles the SIGQUIT signal by dumping memory diagnostics into the log. #7400
  • An sstable scrub results in rewriting sstables, but did not remove the original sstable from metadata, causing a later reshape compaction to fail. This is now fixed. #20030
  • Tablet repair operations are now tracked via the Task Manager. #21368
  • The Scylla sstable dump-summary command now displays the tokens of the first and last keys. This helps associating an sstable with a node or a tablet. #21735
  • Restore operations (nodetool restore) can now be aborted. #20717
  • The nodetool removenode force command was removed. Forcing a removal leaves the cluster in a worse state than it started. #15833

Monitoring

Scylla Monitoring Stack released 4.9 and later supports ScyllaDB 2025.1

  • The bundled Prometheus node_exporter metrics collector now collects fewer metrics by default. #21419
  • A new Disk monitoring was added to Scylla, with the following configuration:
    • Disk_space_monitor_normal_polling_interval_in_seconds (default 10s)
    • Disk_space_monitor_high_polling_interval_in_seconds (default 1s)
    • Disk_space_monitor_polling_interval_threshold (default 0.9) Polling interval is increased from Normal to High (above) when disk utilization is greater than or equal to this threshold.

Security

  • Internode encryption now supports a transitional mode, allowing a cluster to switch to internode encryption without downtime. #18903
  • Audit configuration can now be updated without restarting the server. #22449

Configuration

New config parameters:

  • Until this release, the materialized view flow-control algorithm used a constant delay_limit_us hard-coded to one second, which means that when the size of view-update backlog reached the maximum (10% of memory), we delay every request by an additional second - while smaller amounts of backlog will result in smaller delays.

    So this patch replaces the hard-coded default with a live-updateable configuration parameter, view_flow_control_delay_limit_in_ms, which defaults to 1000ms as before. #18187

  • View_flow_control_delay_limit_in_ms: maximal amount of time that materialized-view update flow control may delay responses to try to slow down the client and prevent buildup of unfinished view updates. To be effective, this maximal delay should be larger than the typical latencies.

    Setting view_flow_control_delay_limit_in_ms to 0 disables view-update flow control.

    Default: 1000 #18187

  • The small-table optimization for repair-based node operations is now enabled by default. This speeds up bootstrap and decommission operations for clusters with small amounts of data. #21861