[RELEASE] ScyllaDB 6.0

The ScyllaDB team is pleased to announce ScyllaDB Open Source 6.0, a production-ready major release.

ScyllaDB 6.0 introduces two major features which change the way ScyllaDB works:

  • Tablets, a dynamic way to distribute data across nodes that significantly improves scalability
  • Strongly consistent topology, Auth, and Service Level updates

In addition, ScyllaDB 6.0 includes many other improvements in functionality, stability, UX and performance.

Only the latest two minor releases of the ScyllaDB Open Source project are supported. With this release, only ScyllaDB Open Source 6.0 and 5.4 are supported. Users running earlier releases are encouraged to upgrade to one of these two releases.

Related Links

Related Links

New features

Tablets

In this release, ScyllaDB enabled Tablets, a new data distribution algorithm to replace the legacy vNodes approach inherited from Apache Cassandra.

While the vNodes approach statically distributes all tables across all nodes and shards based on the token ring, the Tablets approach dynamically distributes each table to a subset of nodes and shards based on its size. In the future, distribution will use CPU, OPS, and other information to further optimize the distribution.

In particular, Tablets provide the following:

  • Faster scaling and topology changes. New nodes can start serving reads and writes as soon as the first Tablet is migrated. Together with Strongly Consistent Topology Updates (below), this also allows users to add multiple nodes simultaneously.
  • Automatic support for mixed clusters with different core counts. Manual vNode updates are not required.
  • More efficient operations on small tables., since such tables are placed on a small subset of nodes and shards.

Read more about Tablets here.

Using Tablets

Tablets are enabled by default for new clusters.

  • Create a new KEYSPACE. As with Cluster level, Tablets are enabled by default for new Keyspaces.

You can set the initial number of Tablets per table using the “initial” parameter:

CREATE KEYSPACE … WITH TABLETS = { ‘initial’: 1 }

Or disable Tablets for the keyspace with the “‘enabled’ parameter:

CREATE KEYSPACE … WITH TABLETS = { ‘enabled’: false }

(See CREATE KEYSPACE docs)

All tables created in this Keyspace will use Tablets by default.

In this release, you can not use the following features with Tablets:

  • CDC
  • LWT
  • Counters

If you are planning to use one of these features, disable Tablets when creating the Keyspace.

Note: you can not ALTER an existing Keyspace to switch between Tablets and vNode based table and back.

We will remove these restrictions in upcoming patches and minor releases.

Procedures

With Tablets, the Replication Factor (RF) cannot be updated to a value higher than the number of nodes per Data Center (DC). This feature protects the Admin from setting an impossible-to-support RF. This affects the following operations:

Node Decommission / Remove

Starting from 6.0, you cannot decommission or remove a node if the resulting number of nodes would be smaller than the largest non-zero replication factor (for any keyspace) in this DC

For example:

1 DC, 5 nodes, a KS with RF=5

The decommission request will fail

CREATE and ALTER a Keyspace

You can create a KS with an RF that is greater than the number of nodes, but you cannot create a Table in this KS until you add nodes to match the RF.

You can not alter the RF of a KS with Tablets, to be less than the available nodes.

Monitor Tablets

To Monitor Tablets in real time, upgrade ScyllaDB Monitoring Stack to release 4.7, and use the new dynamic Tablet panels, below.

Driver Support

The Following Drivers support Tablets

  • Java driver 4.x, from 4.18.0.2 (to be release soon)
  • Java Driver 3.x, from 3.11.5.2
  • Python driver, from 3.26.6
  • Gocql driver, from 1.13.0
  • Rust Driver from 0.13.0

Legacy ScyllaDB and Apache Cassandra drivers will continue to work with ScyllaDB but will be less efficient when working with tablet-based Keyspaces.

Strongly Consistent Topology Updates

With Raft-managed topology enabled, all topology operations are internally sequenced consistently. A centralized coordination process ensures that topology metadata is synchronized across the nodes on each step of a topology change procedure.

This makes topology updates fast and safe, as the cluster administrator can trigger many topology operations concurrently, and the coordination process will safely drive all of them to completion. For example, multiple nodes can be bootstrapped concurrently, which couldn’t be done with the previous gossip-based topology.

Strongly Consistent Topology Updates is now the default for new clusters, and should be enabled after upgrade for existing clusters.

Strongly Consistent Auth Updates

System-auth-2 is a reimplementation of the Authentication and Authorization systems in a strongly consistent way on top of the Raft sub-system.

This means that Role-Based Access Control (RBAC) commands like create role or grant permission are safe to run in parallel without a risk of getting out of sync with themselves and other metadata operations, like schema changes.

As a result, there is no need to update system_auth RF or run repair when adding a DataCenter.

Strongly Consistent Service Levels

Service Levels allow you to define attributes like timeout per workload.

Service levels are now strongly consistent using Raft, like Schema, Topology and Auth.

#17926

Describe Schema with Internals

Until this release, CQL DESCRIBE SCHEMA was not sufficient to do a full schema restore from backup. For example, it lacks information about dropped columns.

In 6.0, the DESC SCHEMA WITH INTERNALS command provides more information, streamlining the restore process.

#16482

Native Nodetool

The nodetool utility provides simple command-line interface operations and attributes.

ScyllaDB inherited the Java based nodetool from Apache Cassandra. In this release, the Java implementation was replaced with a backward-compatible native nodetool.

The native nodetool works much faster. Unlike the Java version ,the native nodetool is part of the ScyllaDB repo, and allows easier and faster updates.

Removing the JMX Server

With the Native Nodetool (above), the JMX server has become redundant and will no longer be part of the default ScyllaDB Installation or image.

If you are using the JMX server directly (not via nodetool):

  • We advise you to move to work directly with the ScyllaDB REST API
  • You will need to install the JMX server yourself.

Related issues: #15588 #18566 #18472 #18566

Maintenance Mode

Maintenance mode is a new mode in which the node does not communicate with clients or other nodes and only listens to the local maintenance socket and the REST API. It can be used to fix damaged nodes – for example, by using nodetool compact or nodetool scrub. In maintenance mode, ScyllaDB skips loading tablet metadata if it is corrupted to allow an administrator to fix it. #5489

Maintenance Socket

The Maintenance Socket provides a new way to interact with ScyllaDB from within the node it runs on. It is mainly for debugging. You can use CQLSh with the Maintenance Socket as described in the Maintenance Socket docs. #16172

Deployment

  • Ubuntu 2024.04 is now supported.

  • RHEL / CentOS 7 support is deprecated.

  • The setup utility now works with disks that do not have UUIDs, such as those in some virtualized environments #13803

  • The scylladb-kernel-conf package tunes the Linux kernel scheduler via sysfs to improve latency. These tunings were lost in Linux 5.13+ due to kernel changes. They are now restored. #16077

  • Docker: can not connect to Scylla 5.4 with CQLSh without providing host IP #16329

  • Update rust packages

    • “Rustix”
    • “chrono”
      #15772
  • On Ubuntu, the installer now handles conflicts between a system process updating apt metadata and the installer itself.#16537

Improvements

The following is a list of improvements and bug fixes included in the release, grouped by domain.

Bloom Filters

Bloom filters are used to determine which SStables do not contain a partition key, speeding up reads when SStables can be filtered out.

Since the Bloom filters are held in memory, and their size depends on the data (small partitions require larger Bloom filters), there is a tradeoff between allocated memory, and the risk of OOM and the filter efficiency. The following improvements were made to Bloom filters in this release:

  • To generate efficient Bloom filters, we estimate the number of partitions in the SStable we will produce. The estimation has been improved for data models where the partition keys dominate the on-disk size.#15726
  • When writing an SStable, we estimate how many partitions it will have in order to size the bloom filter correctly. A few bugs were corrected for this estimation with the Time Window Compaction Strategy. #15704.
  • For stability, we drop Bloom filters from memory if their total memory consumption is above the configured limit. #17747
  • Also for stability, a reclaimed Bloom filter is left on disk when the SSTable is deleted #18398
  • ScyllaDB drops Bloom filters when they use up too much space. It will now reload them when space is available again (for example due to compaction). #18186
  • ScyllaDB estimates a compaction’s partition count in order to correctly size the Bloom filter. It will now improve the estimate for garbage collection SStables. #18283

Stability and performance

Compaction Related

  • Compaction will now avoid garbage-collecting tombstones that potentially delete data in commitlog. This prevents data resurrection in the event that a node crashes and replays commitlog. This is rare since generally commitlog data is relatively fresh and tombstones that delete such data would not be garbage collected for other reasons. #14870
  • Major compaction will now merge also sstables that were streamed in by e.g. repair, repair-based node operations. Those sstables were wrongly ignored by major compaction, leading to #11915
  • Stability: save memory by making regular compaction tasks internal #16735
  • A possible crash in the REST API when stopping compaction on a keyspace was fixed. #16975
  • During shutdown, we now wait for compactions to complete when shutting down system tables to avoid updates to system.compaction_history from racing with its shutdown. #15721
  • A potential data resurrection problem when cleanup was performed as a side-effect of regular compaction was fixed. #17501 #17452.
  • During compaction tombstone garbage collection, we now consider the memtable only if it contains the key . This prevents old data in memtable from preventing tombstone garbage collection. #17599

Commitlog Related

  • Each node contains a local system.truncated table containing truncation records for other tables on the node. This table is now cleared after commitlog replay to avoid it being re-interpreted incorrectly in the case of two consecutive commitlog replays. #15354
  • Stability: tombstone might not be garbage-collected due to conflicts with data in commitlog. #15777
  • Commitlog: During commitlog replay, we skip over corrupted sections. However if the corrupted section also has corrupt size, it can lead to a crash. This is now fixed. #15269
  • Commitlog will now avoid going over its configured disk space size. To do so, it will flush memtables earlier. #10974
  • Commitlog format has changed to individually checksum each disk sector. This allows discrimination between three types of sectors: those containing valid data, those corrupted, and those that contain data from a previous use of the commitlog segment file. This should reduce false-positive warnings about commitlog corruptions during replay. Note the format is not compatible with previous versions, so upgrades should flush memtables first. #15494
  • ScyllaDB can store changes to its own metadata using a separate commitlog, to prevent it waiting for user data to commit. This separate commitlog is now mandatory. #16254
  • When recovering after a crash, we skip commitlog replay for data that we know was captured in sstables. However, we must ignore sstables that were not created on this node, as the commitlog positions they refer to are invalid on this node. #10080
  • When replaying commitlog mutations after a crash, we now ignore mutations to tablets that were migrated away from this node, preventing data resurrection. #16752

Cluster Operation Related

Topology changes, Repairs, etc

  • To fix an edge case, when a node joins the raft group, it now first waits for the first node to enter the NORMAL state #15807
  • Topology: An edge case where a joining node is rejected from the cluster, but the rejection message times out, has been fixed. #15865
  • A cleanup operation will now flush memtables, to prevent data that managed to stay in memtables from not being cleaned up. #4734
  • A protocol change that caused streaming in a mixed version cluster to fail has been rectified. #17175
  • A bug in repair that could have caused problems with mixed-version clusters has been fixed. #16941
  • Streaming now has additional protection for the case where the streamed table is dropped during the process. #15370
  • CDC: Each time topology changes, a new generation is created with a new set of CDC streams for the application to listen to. We now allow writes to be written to the previous generation’s streams, to accommodate clients with unsynchronized clocks, or with lightweight transactions (LWT) which can have the same effect as unsynchronized clocks. This also works with consistent topology. #7251 #15260
  • A hang when a decommissioned node is restarted was fixed; the operation now correctly fails. #17282
  • Repair failures if a table is dropped during repair was fixed. #17028 #15370 #15598
  • A deadlock in repair has been fixed #17591
  • A use-after-free error that manifested during repair with very large partition keys is now fixed. #17625
  • Stability: repair: memory accounting does not include keys #16710
  • service levels on raft: upgrade might get stuck if nodes were removed recently #18198

Materialized Views Related

  • In some cases a materialized view table’s schema can be constructed when the base table’s schema is not yet known. We now avoid this illegal state.#14011
  • ScyllaDB invalidates materialized view prepared statements when the schema of the base table changes. This prevents a stale prepared statement from returning incorrect results. #16392
  • Materialized view memory accounting is now more accurate. This reduces the risk of running out of memory in materialized view intensive workloads. #17364
  • A rare failure when a materialized view update happened to be empty was fixed. #15228 #17549
  • MV: Tracking of memory within materialized view updates was improved. #17854
  • MV: A crash when a view was created while a TRUNCATE operation was in progress was fixed . #17543
  • MV: Materialized views track the amount of work in progress in order to limit it. Due to a bug, if the base replica had to update both a local view replica and a remote view replica, then only the work to update the local view replica was tracked. This could lead to running out of memory, and is now fixed. #17890
  • We now fail base table writes rather than dropping materialized view updates, to reduce base/view inconsistencies. #17795
  • MV: Delete all partitions from the table with few MV failed with “bad_alloc” and cluster shutted down #12379
  • MV: ScyllaDB disseminates the materialized view update backlog in order to control update rates. A bug that prevented this in some circumstances was fixed. #18462

Performance Related

  • The automatic creation of internal keyspaces and tables has been streamlined, resulting in faster start-up. #15437
  • The start stop native transport API (used by nodetool enablebinary) mistakenly launched the listener in the streaming scheduling group, causing subsequent queries to run with reduced priority compared to maintenance operations such as repair and bootstrap. The listener is now correctly launched in the statement scheduling group. #15485
  • Node startup will now recalculate the schema digest fewer times during restart, resulting in faster startups. #16112
  • Repair has gained a new mode for small tables that makes repair significantly faster #16011
  • ScyllaDB detects internal stalls using a stall detector. It can do so using a timer, or using a more hardware performance counter, which can also track kernel stalls. It now sets up permissions for itself to use the performance counter #15743
  • The gossip-based schema propagation code uses a hash of the schema metadata to check if nodes have the same schema, since changes to the schema can happen in different nodes independently. This schema calculation can be time consuming in a cluster with thousands of tables. In consistent schema mode, therefore, we no longer compute a hash of the entire schema but instead generate a schema version using a timeuuid. This speeds up operations on clusters with many tables. #7620
  • The rewritesstables command will now execute in the streaming/maintenance group, reducing its impact on the rest of the system. #16699
  • The system will now scan sstable files during startup in a way that prevents fragmentation of the kernel inode and dentry caches. This helps reduce memory pressure on systems with many sstables. #14506
  • In tables with many small partitions (or many partition tombstones), sstable index pages can contain many entries. They are now destroyed gently to avoid stalls. #17605
  • Performance: A reactor stall when reading the materializing very large schemas has been fixed. #17841
  • scylla start very slowly, Spend a lot of time for Loading repair history #16774 #17993

All kinds of edge cases

  • A bug in which a read retry after an allocation failure caused a crash was fixed. #15578
  • Streaming now checks more closely if a table was dropped while it was streamed. #15370
  • ScyllaDB uses two separate memory reservation systems for memtables: user, used for user writes, and system, used for ScyllaDB’s own writes. This ensures that a heavy write workload does not impact ScyllaDB’s internal housekeeping. The raft table was moved to the system memory reservation #15622
  • A number of rare bugs in the row cache were fixed. #15483
  • Stability: A regression in IPv6 address formatting, which caused nodetool problems, like breaking when there is an Alternator GSI in the database #16153, or cause a node to be stuck with “?U” status and Host ID is "null #16039
  • Scylla crashes when table is dropped concurrently with streaming #16181
  • The fencing mechanism prevents a read or write from accessing an outdated replica. The mechanism is now disabled on local tables, as these can never be out of date. #16423
  • Internal updates to system.peers ensure the host_id column is filled in correctly; this is required for Raft to operate correctly. Note all released versions already fill in this column. #16376
  • Queries to a local table will now no longer be automatically parallelized to avoid shutdown problems. Local tables typically have little data and don’t benefit from parallelization. #16570
  • The hint manager is now started later in the boot process, until we have better information about other nodes in the cluster. #11870
  • A recent regression in calculating whether we reached the in-flight hint limit was fixed . #17636
  • The hinted handoff code now uses host IDs instead of IP addresses. This conforms with consistent topology, which uses host IDs throughout. #12278
  • Change Data Capture (CDC) maintains a history of the cluster topology to allow clients to fetch older change records. This history is now pruned earlier to reclaim space. #18497
  • Large data handler does not take range tombstones into account #13968
  • Reader-concurrency-semaphore: saved readers can refer to dangling tablets if tablet is migrated away #18110

Guardrails

Guardrails is a framework to protect ScyllaDB users and admins from common mistakes and pitfalls. In this release the following Guardrails are added:

  • Replication strategies Guardrails can be used to deny SimpleStratagy strategy, for example, which is not recommended for production.

Two new configurations

replication_strategy_warn_list

replication_strategy_fail_list

Replace the old restrict_replication_simplestrategy and give more granularity for DB Admin to warn, or block non production strategies.

CQL

  • Remove dclocal_read_repair_chance and read_repair_chance options. Both was removed from Apache Cassandra 3.11 and 4.x, see [#CASSANDRA-13910] .
  • ScyllaDB now rejects adding fields of type duration to a user-defined type ((UDT), if that UDT is used as a clustering key component. #12913
  • Correctness: a rare combination of reconciliation (read repair) with reverse queries and range tombstones, could cause incorrect data to be returned from queries #10598
  • CQL: toJson() produces invalid JSON for columns with “time” type #7988
  • CQL: a 5.4.0 regression in GROUP BY behavior #16531
  • CQL tracing now records two additional fields: the user who initiated the query, and the reader concurrency semaphore that is in charge of admitting the query on the replica.
  • There is a new CQL statement, LIST EFFECTIVE SERVICE LEVEL, to query what service level attributes apply to a role. A role’s service level attributes may be a mix of different SERVICE LEVEL configurations. #15604
  • CQL multi-column comparisons now have better NULL checks. #13217
  • DROP TYPE IF EXISTS statement now works if the type’s keyspace doesn’t exist. #9082
  • Correctness: a bug in the row cache could cause a query to return incomplete data if, under moderate load, the query was preempted. This is now fixed. Note that QUORUM reads would usually see the missing data completed from the other replica, so the bug is mostly visible with consistency level ONE or similar. #16759
  • Correctness: A bug was fixed where a deletion in a base table that affected a large number of materialized view rows to be updated could cause some updates to be missed. #17117
  • The CQL “date” type is now more resistant to overflow. #17066
  • The totimestamp() CQL function is now protected against undefined behavior when applied to extreme values. #17516
  • INSERT JSON now understands more data types when used as keys: blob,date,inet,time,timestamp,timeuuid,uuid #18477
  • An edge case in converting JSON numbers close to the maximum integer has been fixed. #13077
  • A crash in the mintimeuuid() function has been fixed. #17035
  • Unpaged queries will no longer fail when they reach the tombstone limit. The tombstone limit ends a page in an unpaged query, but unpaged queries are allowed to scan tombstones until they time out of reach the memory limit. It is strongly recommended to use paged queries.#17241

Alternator

  • Alternator now responds correctly to the DeleteTable API. #14132
  • An ABBA-style deadlock was fixed for this scenario: if an Alternator table used strong write isolation and had a global secondary index, then index updates could wait on the base table update, and vice versa. This also happens with materialized views on a table updated using lightweight transactions (LWT). #15844
  • Alternator now supports the ReturnValuesOnConditionCheckFailure feature. This makes handling contention more efficient. #14481
  • A correctness problem when using the Alternator Global Secondary Index that has two key columns has been fixed. #17119

Images and packaging

  • The container image now avoids installing “suggested” packages, to reduce its size. #15579
  • Prometheus node_exporter, used to expose operating system metrics, was updated to version 1.7.0 to address a security vulnerability. #16085
  • Remove Java from the docker image #18566

Tooling and REST API

  • Scylla REST API is now documented! (beta)

  • The sstable validation tools, scylla sstable validate-checksum and scylla sstable validate, now returns output in json format.

  • Admin API: a new API for asynchronous compaction:

    /tasks/compaction/keyspace_compaction/{keyspace}

    Similar to the existing synchronos storage_service API. #15092

  • Bundled tools now use the Seastar epoll reactor backend rather than linux-aio; this reduces the risk of startup failures.

  • The bundled cqlsh has been updated , with a fix for a COPY TO STDOUT regression. #17451

  • The REST API for reporting cache statistics is now more accurate. #9418

  • Some false-positives were eliminated from the scrub command. #16326

  • The iotune utility is used to measure a disk’s performance when installing ScyllaDB. It now works when executed on machines with a very large core count (208 cores )#18546

Configuration

  • There is now a mechanism to deprecate configuration options. #15887
  • The me sstable format is now mandatory. The system can still read older formats. #16551
  • The service level mechanism works by polling the internal tables used to represent it. The polling interval can now be configured. This is useful to speed up tests.
  • The scylla sstable tool now supports loading the schema of materialized views and indexes. #16492
  • The unchecked_tombstone_compaction compaction option has been implemented. Setting this option will ignore tombstone_threshold. #1487
  • Failure detector’s ping timeout is configurable, with ping_timeout, and with higher default (600ms) #16607

Tracing

  • The large partition system tables now count range tombstones as regular rows, so that large partitions with mainly range tombstones are reported. #18346
  • Tracing: Large partition/row/cell warnings should not print keys contents #18041

Monitoring

Scylla Monitoring Stack released 4.7 and later supports ScyllaDB 6.0.

See metrics update between 5.4 and 6.0 here, as well as the new, beta, metrics reference here.

More monitoring related updates:

  • Internode remote procedure call metrics are now reported per (datacenter, verb) combination. This makes the latency metrics more meaningful. #15785

Deprecated and removed features

  • Remove probabilistic read repair #3502 from the code. Probabilistic Read repairs were deprecated for a long time. This is a mirror for CASSANDRA-13910.

Bug fixes and stability

See git log for a full (long) list of fixed issues.

2 Likes