[RELEASE] Scylla 5.2.0 Release - part 1

The ScyllaDB team is pleased to announce the release of ScyllaDB Open Source 5.2.0, a production-ready release of our open-source NoSQL database.

ScyllaDB 5.2 introduces Raft-based Strongly Consistent Schema Management, Alternator TTL, and many more improvements and bug fixes.

Only the last two minor releases of the ScyllaDB Open Source project are supported. Once ScyllaDB Open Source 5.2 is officially released, only ScyllaDB Open Source 5.2 and ScyllaDB 5.1 will be supported, and ScyllaDB 5.0 will be retired.

Related Links

Moving to Raft based cluster management

Consistent Schema Management is the first Raft based feature in ScyllaDB, and ScyllaDB 5.2 is the first release to enable Raft by default. #12572

Starting from ScyllaDB 5.2, all new databases will be created with Raft enabled by default. Upgrading from 5.1 will only use Raft if you explicitly enable it (see upgrade to 5.2 docs). As soon as all nodes in the cluster opt-in to using Raft, the cluster will automatically migrate those subsystems to using Raft, and you should validate it is the case.

Once Raft is enabled, updating schema requires a quorum to be executed.

For example, in the following use cases, the cluster does not have a quorum and will not allow updating the schema:

  • A cluster with 1 out of 3 nodes available
  • A cluster with 2 out of 4 nodes available
  • A cluster with two data centers (DCs), each with 3 nodes, where one of the DCs is not available

This is different from the behavior of a ScyllaDB cluster with Raft disabled.

Nodes might be unavailable due to network issues, node issues, or other reasons.

To reduce the chance of quorum loss, it is recommended to have 3 or more nodes per DC, and 3 or more DCs, for a multi-DCs cluster.

To recover from a quorum loss, reviving the failed nodes or fixing the network partitioning is best. If this is impossible, see Raft manual recovery procedure.

More on handling failures in Raft here.

New Features

Strongly Consistent Schema Management

Schema management operations are DDL operations that modify the schema, like CREATE, ALTER, or DROP for KEYSPACE, TABLE, INDEX, UDT, MV, etc.

Unstable schema management has been a problem in past ScyllaDB releases. The root cause is the unsafe propagation of schema updates over gossip, as concurrent schema updates can lead to schema collisions.

Once Raft is enabled, all schema management operations are serialized by the Raft consensus algorithm.

Additional Raft related updates:

  • The node replace procedure, used to replace a dead node, will now work with Raft (when enabled). #12172
  • The Raft protocol implementation now supports changing a node’s IP address without changing its identity. If just one or a few of your node’s IP addresses are changed, you can just restart these nodes with a new IP address and this will work. There’s no need to change the configuration files (e.g. seed lists).
  • The schema of the system.raft_config table, used for storing Raft cluster membership data, has been streamlined.
  • Commitlog has a limit on the largest mutation it can write to disk. Raft now knows about this limit and rejects a too-large command if it doesn’t fit within the limit, rather than crashing.
  • Raft broadcast tables, an experimental replication strategy that uses Raft, now supports bind variables in statements.
  • Raft failure detection now uses a separate RPC verb from gossip failure detection.
  • Raft’s error handling when an ex-cluster-member tries to modify configuration was improved; it is now rejected with a permanent error.
  • Building on the uniqueness of host IDs, Raft will now use the host ID rather than its own ID. This simplifies administration as there is now just one host ID type to track.
  • A crash in the task manager was fixed. #12204
  • The server will warn if a peer’s address cannot be found when pinging it. #12156
  • Raft group 0 verbs are used for metadata management, and so always run on shard 0. It was assumed that registering those RPC verbs only on shard 0 would therefore be sufficient, but that’s not the case on all configurations. This was fixed by registering the verbs on all shards. #12252

Alternator TTL

In ScyllaDB 5.0 we introduced Time To Live (TTL) to DynamoDB compatible API (Alternator) as an experimental feature. In ScyllaDB 5.2 we promote it to production ready. #12037 #11737

Like in DynamoDB, Alternator items that are set to expire at a specific time will not disappear precisely at that time but only after some delay. DynamoDB guarantees that the expiration delay will be less than 48 hours (though for small tables, the delay is often much shorter). In Alternator, the expiration delay is configurable - it defaults to 24 hours but can be set with the --alternator-ttl-period-in-seconds configuration option.

Large Collection Detection

ScyllaDB records large partitions, large rows, and large cells in system tables so that the primary key can be used to deal with them. It additionally records collections with large numbers of elements, since these can cause degraded performance.

The warning threshold is configurable: compaction_collection_elements_count_warning_threshold - how many elements are considered a “large” collection (default is 10,000 elements).

The information about large collections is stored in the large_cells table, with a new collection_elements column that contains the number of elements of the large collection.

Large_cells table retention is 30 days. #11449

Example of a large collection below:

(cqlsh) SELECT * from system.large_cells;

keyspace_name | table_name | sstable_name | cell_size | partition_key | clustering_key | column_name | collection_elements | compaction_time

---------------+------------+-------------------+-----------+---------------+----------------+-------------+---------------------+---------------------------------

ks | maps | me-10-big-Data.db | 10929 | user | | properties | 1001 | 2023-02-06 13:07:23.203000+0000

(1 rows)

Automating away the gc_grace_seconds parameter

There is now optional automatic management of tombstone garbage collection, replacing gc_grace_seconds. This drops tombstones more frequently if repairs are made on time, and prevents data resurrection if repairs are delayed beyond gc_grace_seconds. Tombstones older than the most recent repair will be eligible for purging, and newer ones will be kept. The feature is disabled by default and needs to be enabled via ALTER TABLE.

For example:

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {‘mode’:‘repair’};

Materialized Views: Synchronous Mode

There is now a synchronous mode for materialized views. In ordinary, asynchronous materialized views, the operation returns before the view is updated. In synchronous materialized view, the operation does not return until the view is updated - each base replica waits for a view replica. This enhances consistency but reduces availability as, in some situations, all nodes might be required to be functional.

Example:

CREATE MATERIALIZED VIEW main.mv

AS SELECT * FROM main.t

wHERE v IS NOT NULL

PRIMARY KEY (v, id)

WITH synchronous_updates = true;

ALTER MATERIALIZED VIEW main.mv WITH synchronous_updates = true;

Synchronous Mode reference in Scylla Docs

Empty replica pages

Before this release, the paging code requires that pages have at least one row before filtering. This can cause an unbounded amount of work if there is a long sequence of tombstones in a partition or token range, leading to timeouts. ScyllaDB will now send empty pages to the client, allowing progress to be made before a timeout. This prevents analytics workloads from failing when processing long sequences of tombstones. #7689, #3914, #7933

Secondary index on collection columns

Secondary indexes can now index collection columns. Individual keys and values within maps, sets, and lists can be indexed. Fixes #2962, #8745, #10707

Examples:

CREATE TABLE test(id int, somemap map<int, int>, somelist list<int>, someset set<int>, PRIMARY KEY(id));
INSERT INTO test (id, somemap, somelist, someset) VALUES (7, {7: 7}, [1,3,7], {7,8,9});
CREATE INDEX ON test(keys(somemap));

CREATE INDEX ON test(values(somemap));

CREATE INDEX ON test(entries(somemap));

CREATE INDEX ON test(values(somelist));

CREATE INDEX ON test(values(someset));

CREATE INDEX IF NOT EXISTS ON test(somelist);

CREATE INDEX IF NOT EXISTS ON test(someset);

CREATE INDEX IF NOT EXISTS ON test(somemap);
SELECT * FROM test WHERE someset CONTAINS 7;

SELECT * FROM test WHERE somelist CONTAINS 7;

SELECT * FROM test WHERE somemap CONTAINS KEY 7;

SELECT * FROM test WHERE somemap CONTAINS 7;

SELECT * FROM test WHERE somemap[7] = 7;

Part 2 of the release notes