[RELEASE] Scylla 5.2 RC1

The ScyllaDB team is pleased to announce ScyllaDB Open Source 5.2 RC1, the first Release Candidate for the ScyllaDB Open Source 5.2 minor release.

ScyllaDB 5.2 introduces Raft-based Strongly Consistent Schema Management, Alternator TTL, and many more improvements and bug fixes.

Find the ScyllaDB Open Source 5.2 repository for your Linux distribution here. ScyllaDB 5.2 RC1 Docker is also available.

We encourage you to run ScyllaDB 5.2 release candidates on your test environments; this will help ensure that an upgrade to ScyllaDB 5.2 General Availability will proceed smoothly with your workload. Use the release candidate with caution; RC1 is not production-ready yet. You can help stabilize ScyllaDB Open Source 5.2 by reporting bugs here.

Only the last two minor releases of the ScyllaDB Open Source project are supported. Once ScyllaDB Open Source 5.2 is officially released, only ScyllaDB Open Source 5.2 and ScyllaDB 5.1 will be supported, and ScyllaDB 5.0 will be retired.

Moving to Raft based cluster management

Consistent Schema Management is the first Raft based feature in ScyllaDB, and ScyllaDB 5.2 is the first release to enable Raft by default.

#12572

Starting from ScyllaDB 5.2, all new databases will be created with Raft enabled by default. Upgrading from 5.1 will only use Raft if you explicitly enable it (see upgrade to 5.2 docs). As soon as all nodes in the cluster opt-in to using Raft, the cluster will automatically migrate those subsystems to using Raft, and you should validate it is the case.

Once Raft is enabled, every cluster-level operation, like updating schema, adding and removing nodes, and adding and removing data centers - requires a quorum to be executed.

For example, in the following use cases, the cluster does not have a quorum and will not allow updating the schema:

  • A cluster with 1 out of 3 nodes available
  • A cluster with 2 out of 4 nodes available
  • A cluster with two data centers (DCs), each with 3 nodes, where one of the DCs is not available

This is different from the behavior of a ScyllaDB cluster with Raft disabled.

Nodes might be unavailable due to network issues, node issues, or other reasons.

To reduce the chance of quorum loss, it is recommended to have 3 or more nodes per DC, and 3 or more DCs, for a multi-DCs cluster.

To recover from a quorum loss, the best is to revive the failed nodes or fix the network partitioning. If this is impossible, see Raft manual recovery procedure.

More on handling failures in Raft here.

New Features

Strongly Consistent Schema Management

Schema management operations are DDL operations that modify the schema, like CREATE, ALTER, or DROP for KEYSPACE, TABLE, INDEX, UDT, MV, etc.

Unstable schema management has been a problem in all Apache Cassandra and ScyllaDB versions. The root cause is the unsafe propagation of schema updates over gossip, as concurrent schema updates can lead to schema collisions.

Once Raft is enabled, all schema management operations are serialized by the Raft consensus algorithm.

Additional Raft related updates:

  • The node replace procedure, used to replace a dead node, will now work with Raft (when enabled). #12172
  • The schema of the system.raft_config table, used for storing Raft cluster membership data, has been streamlined.
  • Raft group 0 (responsible for managing topology and schema) now has improved availability during removenode and decommission. #11723.
  • Commitlog has a limit on the largest mutation it can write to disk. Raft now knows about this limit and rejects a too-large command if it doesn’t fit within the limit, rather than crashing.
  • Raft broadcast tables, an experimental replication strategy that uses Raft, now supports bind variables in statements.
  • The Raft protocol implementation now supports changing a node’s IP address without changing its identity.
  • Raft failure detection now uses a separate RPC verb from gossip failure detection.
  • Raft’s error handling when an ex-cluster-member tries to modify configuration was improved; it is now rejected with a permanent error.
  • The node replace procedure, used to replace a dead node, will now work through Raft (when enabled).
  • Building on the uniqueness of host IDs, Raft will now use the host ID rather than its own ID. This simplifies administration as there is now just one host ID type to track.
  • A crash in the task manager was fixed. #12204
  • The server will warn if a peer’s address cannot be found when pinging it. #12156
  • Raft group 0 verbs are used for metadata management, and so always run on shard 0. It was assumed that registering those RPC verbs only on shard 0 would therefore be sufficient, but that’s not the case on all configurations. This was fixed by registering the verbs on all shards. #12252

Alternator TTL

In ScyllaDB 5.0 we introduced Time To Live (TTL) to DynamoDB compatible API (Alternator) as an experimental feature. In ScyllaDB 5.2 we promote it to production ready. #12037 #11737

Like in DynamoDB, Alternator items that are set to expire at a specific time will not disappear precisely at that time but only after some delay. DynamoDB guarantees that the expiration delay will be less than 48 hours (though for small tables, the delay is often much shorter). In Alternator, the expiration delay is configurable - it defaults to 24 hours but can be set with the --alternator-ttl-period-in-seconds configuration option.

Large Collection Detection

ScyllaDB records large partitions, large rows, and large cells in system tables so that the primary key can be used to deal with them. It additionally records collections with large numbers of elements, since these can cause degraded performance.

The warning threshold is configurable: compaction_collection_elements_count_warning_threshold - how many elements are considered a “large” collection (default is 10,000 elements).

The information about large collections is stored in the large_cells table, with a new collection_elements column that contains the number of elements of the large collection.

Large_cells table retention is 30 days. #11449

Example of a large collection below:

(cqlsh) SELECT * from system.large_cells;

keyspace_name | table_name | sstable_name | cell_size | partition_key | clustering_key | column_name | collection_elements | compaction_time

---------------+------------+-------------------+-----------+---------------+----------------+-------------+---------------------+---------------------------------

ks | maps | me-10-big-Data.db | 10929 | user | | properties | 1001 | 2023-02-06 13:07:23.203000+0000

(1 rows)

Automating away the gc_grace_seconds parameter

There is now optional automatic management of tombstone garbage collection, replacing gc_grace_seconds. This drops tombstones more frequently if repairs are made on time, and prevents data resurrection if repairs are delayed beyond gc_grace_seconds. Tombstones older than the most recent repair will be eligible for purging, and newer ones will be kept. The feature is disabled by default and needs to be enabled via ALTER TABLE.

For example:

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {‘mode’:‘repair’};

Materialized Views: Synchronous Mode

There is now a synchronous mode for materialized views. In ordinary, asynchronous materialized views the operation returns before the view is updated. In synchronous materialized view the operation does not return until the view is updated. This enhances consistency but reduces availability as in some situations all nodes might be required to be functional.

Example:

CREATE MATERIALIZED VIEW main.mv

AS SELECT * FROM main.t

wHERE v IS NOT NULL

PRIMARY KEY (v, id)

WITH synchronous_updates = true;

ALTER MATERIALIZED VIEW main.mv WITH synchronous_updates = true;

Synchronous Mode reference in Scylla Docs

Empty replica pages

Before this release, the paging code requires that pages have at least one row before filtering. This can cause an unbounded amount of work if there is a long sequence of tombstones in a partition or token range, leading to timeouts. ScyllaDB will now send empty pages to the client, allowing progress to be made before a timeout. This prevents analytics workloads from failing when processing long sequences of tombstones. #7689, #3914, #7933

Secondary index on collection columns

Secondary indexes can now index collection columns. Individual keys and values within maps, sets, and lists can be indexed. Fixes #2962, #8745, #10707

Examples:

CREATE TABLE test(int id, somemap map<int, int>, somelist<int>, someset<int>, PRIMARY KEY(id));
CREATE INDEX ON test(keys(somemap));
CREATE INDEX ON test(values(somemap));
CREATE INDEX ON test(entries(somemap));
CREATE INDEX ON test(values(somelist));
CREATE INDEX ON test(values(someset));
CREATE INDEX IF NOT EXISTS ON test(somelist);
CREATE INDEX IF NOT EXISTS ON test(someset);
CREATE INDEX IF NOT EXISTS ON test(somemap);
SELECT * FROM test WHERE someset CONTAINS 7;
SELECT * FROM test WHERE somelist CONTAINS 7;
SELECT * FROM test WHERE somemap CONTAINS KEY 7;
SELECT * FROM test WHERE somemap CONTAINS 7;
SELECT * FROM test WHERE somemap[7] = 7;

Improvements

CQL API

  • ScyllaDB now supports server-side DESCRIBE. This is required for the latest cqlsh, and reduces the need to update cqlsh as server features are added. However, the version number bump needed to inform cqlsh about this change was reverted as other changes related to the version number are not ready.
  • Secondary indexes on static columns are now supported.#2963
  • The CQL binary protocol versions 1 and 2 are no longer supported. Version 3 and above have been supported for 9 years, so it’s unlikely to be in real use. You can check for version 1 and 2 in the system.clients virtual table. #10607
  • There is now documentation about how NULL is treated in ScyllaDB. #12494
  • USING TIMESTAMP allows setting the mutation timestamp on a CQL statement level. It has a sanity check that prevents setting timestamps in the future, as these can be hard to delete, but sometimes one wishes to do so anyway. There is now a configuration option that allows disabling the feature. #12527
  • TRUNCATE statements are usually much slower than other statements. TRUNCATE statements now support the WITH TIMEOUT clause to help deal with that. #11408
  • The CQL CONTAINS and CONTAINS KEY operators are used to check if a collection contains an element. CONTAINS and CONTAINS KEY have been changed to return false if it is asked whether a collection contains NULL, since no collection can contain a NULL. This is a behavior change, but is not expected to affect users since the query is not useful. #10359
  • In the expression element IN list, we now allow “list” to be NULL (and the expression evaluates to false). Previously, this failed with an error.
  • A regression where CQL ignored some WHERE clause components when a multi-column restriction ((col_a, col_b) < (0, 0)) was present was fixed. #6200 #12014
  • In SELECT JSON statements, column names can be given aliases (just as with traditional SELECT). However, ScyllaDB ignored those aliases. It will now honor them. #8078
  • Evaluation of Boolean binary operators (e.g. “=”) has been refactored to use the same expression evaluation code as other expressions, paving the way for relaxation of the CQL grammar to be more similar to SQL.
  • The experimental User-defined aggregates (UDAs) have a state type that can be initialized using the INITCOND clause. The initializer can now be a collection type.
  • CQL: scylla: types: is_tuple(): doesn’t handle reverse types. For example, a schema with reversed clustering key component; this component will be incorrectly represented in the schema CQL dump: the UDT will lose the frozen attribute. When attempting to recreate this schema based on the dump, it will fail as the only frozen UDTs are allowed in primary key components.
    #12576 (RC1)

Amazon DynamoDB Compatible API (Alternator)

Alternator, ScyllaDB’s implementation of the DynamoDB API, now provides the following improvements:

  • Alternator will now associate service level information configured via the CQL ATTACH SERVICE LEVEL statement with the authenticated alternator user.
  • Bug fix: UpdateItem setting the same value again for GSI range key may drop the row #11801
  • Alternator has better validation of malformed base64 encoded values. #6487
  • Alternator now sets the Projection field in response to the DescribeTable request.
  • A bug in alternator WHERE condition for global secondary index range key, that does not appear to have any user visible impact, has been fixed.
  • Alternator implements Time-To-Live (TTL) by scanning data. Some shutdown-related hangs in the scanning process were fixed.
  • Alternator, uses the rapidjson library to parse queries presented as JSON. Due to a compile-time misconfiguration, aggressive inlining was not enabled. This is now fixed, yielding a substantial 5 performance improvement.
  • A Seastar update fixed an ambiguity in the http parser, which caused Alternator requests to be unnecessarily slow. 12468
  • Alternator Streaming API: unexpected ARN values by list streams paged responses. This issue may only affect users with many, more than 100, tables with streams #12601 (RC1)

Correctness

  • ScyllaDB represents reads using mutation fragment streams. Several minor violations of fragment stream integrity were fixed. These could result in incorrect reads during range scans. #12735
  • The view builder is responsible for building materialized views when a view is created and after repair. To control its memory consumption, it reads base table sstables in chunks. Due to a bug, it could forget the partition tombstone or an open range tombstone when moving between chunks, leading to a discrepancy between the base table data and the view data. This is now fixed. #11668
  • When a table with a materialized view had a large partition, and that large partition was deleted with the USING TIMESTAMP clause, the materialized view might be only partially updated. This is now fixed. #12297.
  • The row cache could have missed reading a row if a row is evicted concurrently while being read, plus other conditions. This is now fixed. #11239

Performance and stability

  • TRUNCATE statements are usually much slower than other statements. TRUNCATE statements now support the WITH TIMEOUT clause to help deal with that. #11408

  • ScyllaDB uses an interval map data structure from the Boost library to quickly locate sstables needed to service a read. Due to the way we interface with the library, updating the interval map was unnecessarily slow. This is now fixed. #11669

  • Materialized view building will now ignore partitions no longer owned by the node after a topology change.

  • When a materialized view processes updates to the base table, it locks the partition and clustering key. In some rare cases involving one of the locks timing out but the other not, this can cause a crash. This was fixed by acquiring the locks sequentially. #12168

  • The experimental WASM user defined function (UDF) implementation has been switched to Rust (UDFs can be written in any language WASM supports, not just Rust). The new implementation avoids stalls for long-running UDFs and shares memory with the rest of the database. #11351

  • The experimental WASM user defined function (UDF) has been re-enabled for aarch64 (ARM).

  • User defined aggregates (UDAs) are now correctly persisted and survive a cold start. UDAs are an experimental feature. #11309

  • A bug that caused user-defined aggregates not to be persisted correctly was fixed. #11327

  • A very rare bug involving reads from memtables and multi-version concurrency control has been fixed.

  • Automatically parallelized query aggregation, introduced in 5.1, used a node-local clock for timeouts, rather than the standard clock. This meant that automatically parallelized queries would fail, unless the nodes were started at the same time. This is now fixed. #12458

  • During startup, ScyllaDB makes sstables conform to the compaction strategy in a process called reshaping, so that future reads will perform well. It is now more careful when reshaping Leveled Compaction Strategy tables, to avoid doing unnecessary work. #12495.

  • Lightweight transactions are now more robust when the schema is changed during a transaction. #10770

  • A very rare bug involving reads from memtables and multi-version concurrency control has been fixed.

  • Single-partition reads could, in conditions involving multi-page queries, a completely empty page, and a partition or range tombstone covering the first row, terminate paging prematurely, leading to incorrect results. This is now fixed. #12361

  • Everywhere replication strategy exhibited quadratic behavior since each vNode has every node in its replica set. Since the number of vNodes is proportional to the number of nodes, the amount of work grows quadratically, slowing down topology operations on large clusters. We now special-case this path to avoid the problem. #10337

  • When using the EverywhereReplicationStrategy, vnodes have no impact on data distribution, and so we can avoid splitting the query across vnodes. This improves performance on tables with a small amount of data.

  • A small race between DROP TABLE and inactive reads (readers that have just completed sending a page to the coordinator and are awaiting the next page fetch) was fixed. #11264

  • The internal representation of a CREATE TABLE statement now includes the equivalent of a DROP TABLE. This is normally unnecessary, since CREATE TABLE checks if the table already exists and refuses to proceed, but in the case of two CREATE TABLE statements executing concurrently on two different nodes (but creating the same table), each node’s check misses the other’s concurrently created new table. If the table definitions are different, this can lead to a cluster crash due to the mixture between the two definitions being illegal. The change ensures that the two table definitions are not intermixed and just one survives. Users are still urged not to create tables with the same name concurrently until the Raft schema management transition is complete. #11396

  • Vestigial support for compaction groups has been merged. Compaction groups are token ranges that have dedicated sstables and memtables, and so can be compacted and moved independently of each other. Currently, exactly one compaction group per table is supported, so this doesn’t change anything, but it will be expanded in the future.

  • The database now performs sanity tests around TimeWindowCompactionStrategy tables, to limit the number of time windows. Too many time windows can cause the system to run out of memory or open file handles. #9029

  • Per partition rate limit error error reporting has been corrected to report the new error code when an updated driver is available.#11517
    Support for the new error code has been merged to:

    • Scylla Rust Driver support since version 0.6.0
    • Gocq - merged to master but it’s not a part of any release.
  • Off-strategy compaction is used when sstables from an external source (such as repair) needs to be reshaped before being handed off to the table’s compaction strategy. If off-strategy compaction is stopped, ScyllaDB used to just leave those sstables in their unreshaped form without compacting them again. It will now hand off such sstables directly to the table’s compaction strategy. #11543.

  • In compaction strategies such as LeveledCompactionStrategy and ScyllaDB Enterprise’s IncrementalCompactionStrategy, sstables are sealed when they reach a certain size (160MB and 1GB respectively). They are not split in the middle of a partition, because we were not able to recover the ordering of such split sstables. This is now possible (though not yet integrated into the compaction strategies).

  • Performance: Long-term index caching in the global cache, as introduced in 4.6, hurts the performance for workloads where accesses to the index are sparse. To mitigate this, a new configuration parameter cache_index_pages (default true) is introduced to control index caching. Setting the flag to false causes all index reads to behave like they would in BYPASS CACHE queries. Consider using false if you notice performance problems due to lowered cache hit ratio in 4.6 or 5.0. The config API can update the parameter live (without restart). #11202

  • Scylla will now reject a too-low bloom_fulter_fp_chance when creating (or altering) the table, rather than crash while flushing memtables. #11524.

  • ScyllaDB represents reads using mutation fragment streams. Several minor violations of fragment stream integrity were fixed. These could result in incorrect reads during range scans.

  • The log-structured allocator is used to manage cache and memtable memory. When memory runs out, the allocator tries to reclaim memory by evicting cache items and by defragmenting memory. If this takes too long, the allocator logs a stall report. Due to a bug, if the report threshold was set too low then the report is generated even if a stall did not happen, slowing down the system and flooding the logs. This is now fixed. #10981

  • The compaction manager now ignores out-of-disk-space (ENOSPC) exceptions when shutting down, so the server doesn’t crash in these scenarios.

  • An inaccuracy in the per_partition_rate_limit read metric was corrected. Tables with the per_partition_rate_limit property can throttle read and write activity on a per-partition basis. #11651

  • A large schema (with thousands of tables) could cause stalls when propagated from node to node. This is now fixed. #11574

  • A recently introduced regression caused a crash when speculative retry was enabled. This is now fixed. #11825

  • segmentation fault in cases where the base table schema change while MV schema is cached #10026, #11542

  • A crash when the compaction manager was asked to stop multiple times (for different reasons) was fixed.

  • A problem with RPC connections being needlessly dropped was fixed. #11780

  • When a query completes a page, ScyllaDB caches the query activity as an inactive read. When the client requests the next page, ScyllaDB re-activates the read and continues where it left off. A bug in this mechanism that could cause crashes has been fixed. #11923

  • A crash was fixed during an illegal lightweight transaction INSERT with NULL clustering key

  • ScyllaDB caches rows and (since 4.6) index entries in a single unified cache. It was observed that in some small-partition workloads index caching causes a performance regression, so index caching is now disabled by default. It can still be enabled for workloads that benefit from it. We plan to re-enable it when the regression is fixed. #11889

  • The topology management code is more relaxed about unknown endpoints to prevent crashes in tests that check for edge cases. This fixes a recent regression. #11870

  • Hinted handoff now checks that a node exists in topology before doing anything; this helps with a recent regression due to topology refactoring.

  • A crash while fetching repaid ids from the repair history table was fixed. #11966

  • Usually repair can compare and update the same shard in different nodes, for example shard 3 in one node is compared against shard 3 in another. When the number of shards in nodes is dissimilar, this doesn’t work and each shard compares against data from multiple shards in other nodes. This is now made more efficient by reducing sstable reader thrashing for this dissimilar shard count case. #12157

  • The algorithm for removing nodes from the token ring was corrected and made more efficient. It’s not known that this had any user impact. #12082

  • The CQL server will now only run requests that benefit from concurrency (e.g. QUERY and EXECUTE) in parallel. Configuration and authentication related requests will be serialized, reducing the chance for errors in those code paths.

  • A rare bug involving an allocation failure while updating cached rows was fixed. #12068

  • The system.truncated table holds information about truncation times of user tables. A recent regression caused it to be unreadable by cqlsh. It is now fixed. 12239

  • COMPACT STORAGE tables allow the user to only specify a prefix of a compound clustering key. Bugs relating to such partial keys and reversed rows were fixed.Note that compact storage is deprecated (see section). #12180

  • ScyllaDB now supports multiple compaction groups 1. This is not a user-visible feature for now.

  • Some copies of the lists of ranges to stream were eliminated from the decommission path, reducing latency spikes. #12332

  • When the global index cache is disabled, a local (per query) cache was used instead. When that cache was destroyed, a stall could result, generating a latency spike. This is now fixed. #12271

  • Compaction manager generally reacts to events to initiate compactions, but also has an hourly timer in case an event was missed (and for tombstone compaction, which isn’t triggered by an event). This timer is now less susceptible to stalls. #12390.

  • Repair tried to trigger off-strategy compaction even for a table that was dropped during repair, failing the entire repair. It ignores the dropped table now.

  • Off-strategy compaction is now enabled for all streaming topology operations (adding and removing nodes). Previously it was enabled only for repair-based node operations. Off-strategy compaction takes advantage of the fact that incoming sstables are non-overlapping to perform more efficient compaction that the one performed by the regular compaction strategy.

  • ScyllaDB sometimes reads ahead of the user request, in order to hide latency. In one case a read-ahead request which timed out caused errors to be emitted, even though this did not affect the query. The errors are now silenced. #12435

  • ScyllaDB tracks transient memory used by queries. However, it did not track decompressed memory buffers, which could lead to running out of memory in some complicated queries. This is now fixed.

  • “Unset” values are an obscure prepared statement feature that allows only some columns in an UPDATE or INSERT statement to be modified. It was a source of minor bugs and inconvenience in code. The feature has been refactored so it has less impact on the code and is more robust.

  • Fix a crash in Materialized View update row locking, caused by a race condition #12632 (RC1)

  • Fix a crash when reporting error on invalid CQL query involving field selection from a user-defined type #12739 (RC1)

Operations

  • Task Manager - a generic job tracker for long-running jobs. The Task Manager is a new per-node API that makes longlive internal tasks (such as repair or compaction) observable and controllable. Task Manager updates in 5.2 are:
    • The task manager will no longer run tasks that were aborted before being started.
    • The task manager is now aware of repair-related tasks.
    • The task manager now controls repair tasks with shard granularity.
    • A crash in the task manager related to repair tasks has been fixed.
    • Repair tasks can now be aborted via the task manager.
  • Materialized Views: When auto-snapshot is enabled and a snapshot of a base table is taken, its views are now also snapshotted. #11612
  • The removenode operation will reject streaming data from the leaving node. #11704
  • It is now possible to replace a node by mentioning its host id, rather than its IP address. This is useful in container environments, where IP addresses are transient.
  • A node checks if it was removed from the cluster before rejoining. #11355
  • In the system.local system table, we now advertise broadcast_rpc_address rather than rpc_address, similar to system.peers and Cassandra 4.1. #11201
  • Decommissioning a node is now rejected if some node is down, so that it’s possible to inform all nodes about the topology change. Similarly, bootstrap is rejected if there is a node in an unknown state. #11302
  • Gossip no longer updates topology with the IP address of a removed node, during a replacenode operation, preventing a dead node from being resurrected. #11925
  • The replacenode operation is used to replace a dead node. It has now changed to generate a new unique host ID rather than taking over the ID of the node it replaces. This makes the cluster more robust against cases where a node previously thought dead comes back alive.

Deployment and install

  • ScyllaDB Image is now available for Azure.
  • If you are running with a multi data center deployment, check the Raft section above.
  • Scylla Images (AMI, Docker, GCP IMage) are now using ubuntu-minimal instead of full Ubuntu.
  • The container (docker) image now uses the C locale 1 to reduce image size. Part of the ubuntu-minimal update.
  • There are now two tar packages: the standard package now does not contain debug information, while a new package with the -unstripped suffix contains the debug information. This reduces download time.
  • The .tar.gz package contents are now under a directory, similar to existing practice in most tarball packages. #8349
  • The bundled Prometheus node_exporter package was upgraded to version 1.4.0. #11400
  • The Azure snitch, used to derive datacenter and rack information from instance metadata, now handles regions which have a single availability zone. #12274
  • The container (Docker) image is now based on Ubuntu 22.04.
  • The “snitch” is used to derive the node’s location (data center and rack). The AWS EC2 snitch now uses Instance Metadata Service V2 (IMDSv2) to access instance metadata, per AWS recommendations. #10490
  • The installer no longer checks that systemd is installed if --without-systemd is provided. #11898
  • The bundled Java driver was updated to version 3.11.2.4.
  • The various cloud snitches, which determine the datacenter and rack association of a node, improved error checking of their communication with the cloud provider.
  • The docker image is now more robust against different network conditions, which could cause it to fail to launch. #12011

Tools

Configuration

  • The Cassandra tombstone_warn_threshold (default 1000) configuration for the maximum number of tombstones a query can scan before a warning item is now respected, producing a warning if a query takes too long.
  • Messaging will now prevent 0.0.0.0 and its IPv6 equivalent from being used as a node IP address.
  • More new config parameters:
    • Restrict_future_timestamp Controls whether to detect and forbid unreasonable USING TIMESTAMP, more than 3 days into the future. See Sanity check for USING TIMESTAMP above.
    • replace_node_first_boot - The Host ID of a dead node to replace. And alternative to the old replace_address_first_boot which uses the old node address. See replace node docs.
    • WASM (experimental feature) related configs:
      • wasm_cache_memory_fraction
      • wasm_cache_timeout_in_ms
      • wasm_cache_instance_size_limit
      • wasm_udf_yield_fuel
      • wasm_udf_total_fuel
      • wasm_udf_memory_limit
    • consistent_cluster_management - replace the Raft experimental flag (see Raft above)
    • x_log2_compaction_groups - new config for setting static number of compaction groups
    • unspooled_dirty_soft_limit - replace the old virtual_dirty_soft_limit.
    • compaction_collection_elements_count_warning_threshold - see large collection above.
    • cache_index_pages - Keep SSTable index pages in the global cache after a SSTable read
    • restrict_twcs_without_default_ttl - Controls whether to prevent creating TimeWindowCompactionStrategy tables without a default TTL. Can be true, false, or warn (default)
    • Twcs_max_window_count - The maximum number of compaction windows allowed when making use of TimeWindowCompactionStrategy (default: 50)
    • task_ttl_seconds - Time for which information about finished tasks stays in memory (default 10s)
    • broadcast-tables - new experimental Raft feature for internal testing
    • query_tombstone_page_limit - The number of tombstones after which a query cuts a page, even if not full or even empty (default 10000)

Deprecated and removed features

  • The CQL binary protocol versions 1 and 2 are no longer supported. Version 3 and above have been supported for 9 years, so it’s unlikely to be in real use. You can check for version 1 and 2 in the system.clients virtual table. #10607
  • New DateTieredCompactionStrategy tables are now rejected by default. Users should switch to TimeWindowCompactionStrategy. Existing DateTieredCompactionStrategy tables are still supported, and it is still possible to configure the database to allow new DateTieredCompactionStrategy tables.
  • Thrift API - legacy ScyllaDB (and Apache Cassandra) API is deprecated and will be removed in followup release. Thrift has been disabled by default.
  • Compact Storage - a file format used by Thrift and deprecated from Apache Cassandra, is deprecated and will be removed in followup release.

Build

  • ScyllaDB is now built using a Fedora 37 based toolchain, including clang 15.
  • A build time misconfiguration caused the CQL query parser to be compiled with a low optimization level. This is not important by itself, since generally prepared statements hide the parsing overhead, but the C++ compilation model can cause some functions in unrelated areas to also be compiled with the lower optimization level, reducing performance. This is now fixed. 12463

Monitoring and tracing

Scylla Monitoring Stack release 4.3 and later will support ScyllaDB 5.2.

metrics related updates below:

  • Shard Latencies are now reported as summaries. This is part of an effort to reduce the total number of generated metrics. In addition, empty histograms and summaries will not be reported. The overall result is a 5x reduction in the number of metrics #11173.
    This is how a summary looks like:

    • scylla_storage_proxy_coordinator_read_latency_summary_count{scheduling_group_name=“statement”,shard=“1”} 2

    • scylla_storage_proxy_coordinator_read_latency_summary{quantile=“0.990000”,scheduling_group_name=“statement”,shard=“1”} 640

  • There is now a metric that allows observation of update progress of materialized views from staging sstables.

  • There are now completion percentage metrics for node operations using streaming; previously the completion metrics were only available when using repair-based node operations. #11600

  • The sstable row_reads metric for m-format sstables is now properly incremented, instead of showing zeroes. #12406

  • The replica-side read metrics, which have been incorrect for some time, have been revamped. #10065