ScyllaDB Enterprise Release 2023.1.0 - part 1

The ScyllaDB team is pleased to announce the release of ScyllaDB Enterprise 2023.1.0 LTS, a production-ready ScyllaDB Enterprise Long Term Support Major Release. With more than 5,000 commits, we’re excited to move forward with ScyllaDB Enterprise 2023.1.

With 2023.1 LTS out, ScyllaDB enterprise 2021.1 support will be ended.

More information on ScyllaDB Long Term Support (LTS) policy is available here.

The ScyllaDB Enterprise 2023.1 release is based on ScyllaDB Open Source 5.2, introducing a Raft-based Strongly Consistent Schema Management, Alternator TTL, introduces Partition level rate limit, distributed select count, and many more improvements and bug fixes.

Related Links

ScyllaDB Enterprise customers are encouraged to upgrade to ScyllaDB Enterprise 2023.1, and are welcome to contact our Support Team with questions.

Moving to Raft based cluster management

Consistent Schema Management is the first Raft based feature in ScyllaDB, and 2023.1 is the first Enterprise release to enable Raft by default for new deployments. #12572

Starting from ScyllaDB Enterprise 2023.1, all new databases will be created with Raft enabled by default. Upgrading from 2022.x will only use Raft if you explicitly enable it (see upgrade to 2023.1 docs). As soon as all nodes in the cluster opt-in to using Raft, the cluster will automatically migrate those subsystems to using Raft, and you should validate it is the case.

Once Raft is enabled, updating schema requires a quorum, from all nodes in the cluster, to be executed.

For example, in the following use cases, the cluster does not have a quorum and will not allow updating the schema:

  • A cluster with 1 out of 3 nodes available
  • A cluster with 2 out of 4 nodes available
  • A cluster with two data centers (DCs), each with 3 nodes, where one of the DCs is not available

This is different from the behavior of a ScyllaDB cluster with Raft disabled.

Nodes might be unavailable due to network issues, node issues, or other reasons.

To reduce the chance of quorum loss, it is recommended to have 3 or more nodes per DC, and 3 or more DCs, for a multi-DCs cluster.

To recover from a quorum loss, reviving the failed nodes or fixing the network partitioning is best. If this is not feasible, see Raft manual recovery procedure. More on handling failures in Raft here.

New Features

Strongly Consistent Schema Management

Schema management operations are DDL operations that modify the schema, like CREATE, ALTER, or DROP for KEYSPACE, TABLE, INDEX, UDT, MV, etc.

Unstable schema management has been a problem in past ScyllaDB releases. The root cause is the unsafe propagation of schema updates over gossip, as concurrent schema updates can lead to schema collisions.

Once Raft is enabled, all schema management operations are serialized by the Raft consensus algorithm.

Additional Raft related updates, with reference to open source issue when available:

  • The node replace procedure, used to replace a dead node, will now work with Raft (when enabled). #12172
  • The Raft protocol implementation now supports changing a node’s IP address without changing its identity. If just one or a few of your node’s IP addresses are changed, you can just restart these nodes with a new IP address and this will work. There’s no need to change the configuration files (e.g. seed lists).
  • The schema of the system.raft_config table, used for storing Raft cluster membership data, has been streamlined.
  • Commitlog has a limit on the largest mutation it can write to disk. Raft now knows about this limit and rejects a too-large command if it doesn’t fit within the limit, rather than crashing.
  • Raft broadcast tables, an experimental replication strategy that uses Raft, now supports bind variables in statements.
  • Raft failure detection now uses a separate RPC verb from gossip failure detection.
  • Raft’s error handling when an ex-cluster-member tries to modify configuration was improved; it is now rejected with a permanent error.
  • The node replace procedure, used to replace a dead node, will now work through Raft (when enabled).
  • Building on the uniqueness of host IDs, Raft will now use the host ID rather than its own ID. This simplifies administration as there is now just one host ID type to track.
  • A crash in the task manager was fixed. #12204
  • The server will warn if a peer’s address cannot be found when pinging it. #12156
  • Raft group 0 verbs are used for metadata management, and so always run on shard 0. It was assumed that registering those RPC verbs only on shard 0 would therefore be sufficient, but that’s not the case on all configurations. This was fixed by registering the verbs on all shards. #12252
  • A Data Definition Language (DDL) statement will fail under Raft-managed schema if a concurrent change is applied. ScyllaDB will now automatically retry executing the statement, and if there was no conflict between the two concurrently executing statements (for example, if they apply to different tables) it will succeed. This reduces disruption to user workloads.
  • ScyllaDB will retry a DDL statement if a transient Raft failure has occurred, e.g. due to a leader change.
  • When Raft is enabled, nodes join as non-voting members until bootstrap is complete.
  • When node A sends a read or write request to node B, it will also send the schema version of the relevant table. If B is using a different version, it will ask A for the schema corresponding to that version. If A doesn’t have it, the request will fail. Consequently the cache expiration for schema versions has been increased to reduce the probability of such events.
  • Raft now persists its peer nodes in a local table.
  • The system could deadlock if a schema-modifying statement was executed in Raft mode during shutdown. This is now fixed.
  • A new failure detector was merged for Raft.
  • There is a new test suite for testing topology changes under Raft.

Distributed Aggregations

ScyllaDB will now automatically run aggregations statements, like SELECT COUNT(*), on all nodes and all shards in parallel, which brings a considerable speedup, even 20X in larger clusters.

Distributed Aggregations supports all types of aggregations.

This feature is limited to queries that do not use GROUP BY or filtering.

The implementation includes a new level of coordination.

A Super-Coordinator node splits aggregation queries into sub-queries, distributes them across some group of coordinators, and merges results. Like a regular coordinator, the Super-Coordinator is a per operation function.

Example results:

A 3 node cluster setup on powerful desktops (3x32 vCPU)

Filled the cluster with ~2 * 10^8 rows using scylla-bench and run:

time cqlsh --request-timeout=3600 -e “select count(*) from scylla_bench.test using timeout 1h;”

Before Distributed Select: 68s

After Distributed Select: 2s

You can disable this feature by setting enable_parallelized_aggregation config parameter to false.

#1385 #10131

Large Collection Detection

ScyllaDB records large partitions, large rows, and large cells in system tables so that the primary key can be used to deal with them. It additionally records collections with large numbers of elements, since these can cause degraded performance.

The warning threshold is configurable: compaction_collection_elements_count_warning_threshold - how many elements are considered a “large” collection (default is 10,000 elements).

The information about large collections is stored in the large_cells table, with a new collection_elements column that contains the number of elements of the large collection.

Large_cells table retention is 30 days. #11449

Example of a large collection below:

(cqlsh) SELECT * from system.large_cells;

keyspace_name | table_name | sstable_name | cell_size | partition_key | clustering_key | column_name | collection_elements | compaction_time

---------------+------------+-------------------+-----------+---------------+----------------+-------------+---------------------+---------------------------------

ks | maps | me-10-big-Data.db | 10929 | user | | properties | 1001 | 2023-02-06 13:07:23.203000+0000

(1 rows)

Automating away the gc_grace_seconds parameter

There is now optional automatic management of tombstone garbage collection, replacing gc_grace_seconds. This drops tombstones more frequently if repairs are made on time, and prevents data resurrection if repairs are delayed beyond gc_grace_seconds. Tombstones older than the most recent repair will be eligible for purging, and newer ones will be kept. The feature is disabled by default and needs to be enabled via ALTER TABLE.

For example:

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'};

Materialized Views: Synchronous Mode

There is now a synchronous mode for materialized views. In ordinary, asynchronous materialized views, the operation returns before the view is updated. In synchronous materialized view, the operation does not return until the view is updated - each base replica waits for a view replica. This enhances consistency but reduces availability as, in some situations, all nodes might be required to be functional.

Example:

CREATE MATERIALIZED VIEW main.mv

AS SELECT * FROM main.t

wHERE v IS NOT NULL

PRIMARY KEY (v, id)

WITH synchronous_updates = true;

ALTER MATERIALIZED VIEW main.mv WITH synchronous_updates = true;

Synchronous Mode reference in Scylla Docs

Empty replica pages

Before this release, the paging code requires that pages have at least one row before filtering. This can cause an unbounded amount of work if there is a long sequence of tombstones in a partition or token range, leading to timeouts. ScyllaDB will now send empty pages to the client, allowing progress to be made before a timeout. This prevents analytics workloads from failing when processing long sequences of tombstones. #7689, #3914, #7933

Secondary index on collection columns

Secondary indexes can now index collection columns. Individual keys and values within maps, sets, and lists can be indexed. Fixes #2962, #8745, #10707

Examples:

CREATE TABLE test(id int, somemap map<int, int>, somelist list<int>, someset set<int>, PRIMARY KEY(id));
INSERT INTO test (id, somemap, somelist, someset) VALUES (7, {7: 7}, [1,3,7], {7,8,9});

CREATE INDEX ON test(keys(somemap));

CREATE INDEX ON test(values(somemap));

CREATE INDEX ON test(entries(somemap));

CREATE INDEX ON test(values(somelist));

CREATE INDEX ON test(values(someset));

CREATE INDEX IF NOT EXISTS ON test(somelist);

CREATE INDEX IF NOT EXISTS ON test(someset);

CREATE INDEX IF NOT EXISTS ON test(somemap);

SELECT * FROM test WHERE someset CONTAINS 7;

SELECT * FROM test WHERE somelist CONTAINS 7;

SELECT * FROM test WHERE somemap CONTAINS KEY 7;

SELECT * FROM test WHERE somemap CONTAINS 7;

SELECT * FROM test WHERE somemap[7] = 7;

Azure Support

  • ScyllaDB Enterprise Image is now available for Azure.
  • ScyllaDB now has an Azure snitch for inferring rack/datacenter from the instance metadata.
  • The Azure snitch, used to derive datacenter and rack information from instance metadata, now handles regions which have a single availability zone. #12274

Alternator TTL (introduced in 2022.2)

Like in DynamoDB, Alternator items that are set to expire at a specific time will not disappear precisely at that time but only after some delay. DynamoDB guarantees that the expiration delay will be less than 48 hours (though for small tables, the delay is often much shorter). In Alternator, the expiration delay is configurable - it defaults to 24 hours but can be set with the --alternator-ttl-period-in-seconds configuration option.

#12037 #11737

Limit partition access rate (introduced in 2022.2)

It is now possible to limit read rates and writes rates into a partition with a new WITH per_partition_rate_limit clause for the CREATE TABLE and ALTER TABLE statements. This is useful to prevent hot-partition problems when high rate reads or writes are bogus (for example, arriving from spam bots). #4703

Examples:

Limits are configured separately for reads and writes. Some examples:

ALTER TABLE t WITH per_partition_rate_limit = {

'max_reads_per_second': 100,

'max_writes_per_second': 200

};

Limit reads only, no limit for writes:

ALTER TABLE t WITH per_partition_rate_limit = {

'max_reads_per_second': 200

};

Learn More

Load and stream (introduced in 2022.2)

This feature extends nodetool refresh to allow loading arbitrary sstables that do not belong to a particular node into the cluster. It loads the sstables from disk, calculates the data’s owning nodes, and automatically streams the data to the owning nodes. In particular this is useful when restoring a cluster from backup.

For example, say the old cluster has 6 nodes and the new cluster has 3 nodes.

One can copy the sstables from the old cluster to the new nodes and trigger the load and stream process.

This can make restores and migrations much easier:

  • You can place sstables from any node from the old cluster to any node in the new cluster
  • No need to run nodetool cleanup to remove data that does not belong to the local node after refresh

Load_and_stream option also updates the relevant Materialized Views #9205

Load and stream is used as part of the new ScyllaDB Manager 3.1 Restore functionality.

Usage examples:

REST admin API:

curl -X POST "http://{ip}:10000/storage_service/sstables/{keyspace}?cf={table}&load_and_stream=true

nodetool refresh --load-and-streaml

Docs: nodetool refresh

Materialized Views: Prune (introduced in 2022.2)

A new CQL extension PRUNE MATERIALIZED VIEW statement can now be used to remove inconsistent rows from materialized views. A special statement is dedicated for pruning ghost rows from materialized views.

A ghost row is an inconsistency issue which manifests itself by having rows in a materialized view which do not correspond to any base table rows. Such inconsistencies should be prevented altogether and ScyllaDB strives to avoid them, but if they happen, this statement can be used to restore a materialized view

to a fully consistent state without rebuilding it from scratch.

Example usages:

PRUNE MATERIALIZED VIEW my_view;

PRUNE MATERIALIZED VIEW my_view WHERE token(v) > 7 AND token(v) < 1535250;

PRUNE MATERIALIZED VIEW my_view WHERE v = 19;

Performance: Eliminate exceptions from the read and write path (introduced in 2022.2)

When a coordinator times out, it generates an exception which is then caught in a higher layer and converted to a protocol message. Since exceptions are slow, this can make a node that experiences timeouts become even slower. To prevent that, the coordinator write path and read path has been converted not to use exceptions for timeout cases, treating them as another kind of result value instead. Further work on the read path and on the replica reduces the cost of timeouts, so that goodput is preserved while a node is overloaded.

Part 2

1 Like