[REALESE] Scylla 5.4 RC1 - part 2

More Improvements

CQL API

  • CQL table columns that have the list data type aren’t allowed to contain NULLs, but in certain situations list values in CQL literals or bind variables are allowed to contain NULLs (for example, in LWT IF conditions that use the IN operator). The type system was relaxed to accept NULLs where this is allowed. Previously, these cases were handled by hard-to-maintain workarounds.

  • The CQL USING TTL clause allows one to specify an INSERT or UPDATE’s time-to-live property, after which the cells are automatically deleted. TTL 0 was misinterpreted as the default TTL (which happens to be unlimited, usually) rather than an explicitly unlimited TTL. This is now fixed. #6447

  • The C-style cast syntax ((type) expression) can now be applied to bind variables ((type) ? or (type) :var) to explicitly specify the type of bind variables Example: blob_column = (blob)(int)12323

  • Error messages for incorrect usage of the CQL TOKEN() function have been improved. #13468

  • The check for altering permissions of functions in the system keyspace has been tightened.

  • Error messages involving the CQL token function have been improved.

  • Error messages involving CQL expressions will not be printed in a more user-friendly way. Previously they contained some debug information.

  • Change Data Capture (CDC) exports updates to the database as a table containing changes. One option is to capture not only the change, but also the state of the row before it was changed. In some cases, in a lightweight transaction (LWT) change, the preimage could return the state of the row after the change instead of before the change. This is now fixed. #12098

  • The NetworkTopologyStrategy replication strategy will now reject an empty value for the replication factor. #13986

  • Materialized views require the “IS NOT NULL” qualifier on primary key elements, but also accept (and ignore) the qualifier on regular columns. The qualifier is now rejected when applied to regular columns. A configuration variable allows you to warn about the rejected clause, emit an error and fail the request, or ignore it. #10365

  • The count(column) function is supposed to only count cells where the column is not NULL. A regression caused count(column) to behave like count(*) for collection, tuple, and user-defined column types. This is now fixed. #14198.

  • When performing the last-write-wins rule comparison, if the timestamp of the two versions being compared was equal, ScyllaDB first compared the cell value and then the expiration time (TTL). This is compatible with earlier versions of Cassandra. However, this could cause a NULL value to appear if the cell was overwritten with the same timestamp but a different TTL. The algorithm was changed to compare the cell value last, and check all the other metadata first, resulting in fewer surprising results. It is also compatible with current Cassandra versions. #14182

  • A GROUP BY query ought to return one row per group, except when all rows of a group are filtered out. However, ScyllaDB returned a row even for fully-filtered groups. This is now fixed, and ScyllaDB will not emit rows for filtered groups. #12477

  • In older versions of ScyllaDB, different clauses of CQL statements were processed using different code bases. ScyllaDB is gradually moving towards a single code base for processing expressions. It is now the SELECT clause’s turn, moving us closer to the goal of a unified expression syntax. As this is an internal refactoring, there are no user visible changes, apart from some names of fields in SELECT JSON statements changing (specifically, if those fields are function evaluations).

  • A recent regression when using GROUP BY together with the ttl() and writetime() pseduo-functions was fixed. #14715

  • There is a new SELECT MUTATION_FRAGMENTS statement that allows seeing where the data that composes a selection comes from. Normally, cache, sstable, and memtable data are merged before output, but with this variant one can see the original source of the data. This is intended for forensics and is not a stable API. #11130

  • The CQL grammar incorrectly accepted nonsensical empty limit clauses such as SELECT * FROM tab LIMIT;. The errors were discovered later in processing, but with unhelpful error messages. They are now rejected. #14705.

  • The CQL grammar incorrectly accepted nonsensical INSERT JSON statements such as INSERT INTO tab JSON;, causing a crash. This is now fixed. #14709

  • A mistake in function type inference, which could lead the CQL statements to claim there is ambiguity when in fact there is none, was fixed.

  • The format of the timestamp data type is now compatible with Cassandra. #14518

  • In CQL, a few functions for dealing with counter types were added. #14501

  • A SELECT statement that has the DISTINCT keyword and also GROUP BY on clustering keys is now rejected. DISTINCT implies only selecting the partition key and static rows, so grouping on the clustering keys is nonsensical. #12479

  • When ALTERing a table, the compaction strategy options are now validated. #2336

  • A bug in the fromJson() CQL function when operating on NULL operands has been fixed #7912

  • The DESCRIBE statement now includes user defined types and functions #14170

  • The column names for SELECT CAST(b AS int) and similar expressions have been adjusted to match Cassandra. #14508

  • In some cases where a bind variable was used both for the partition key and to match a non-key column, ScyllaDB would not generate correct partition key routing for the driver. This is now fixed. #15374

  • A map<ascii, something> value, when parsed from its JSON representation, did not parse the key correctly. This is now fixed. #7949

  • SSTable compression can be configured with a chunk size, with larger chunks trading less efficient I/O and higher latency for higher compression ratios. The chunk size is now capped at 128 kB, to avoid running out of memory. #9933

Amazon DynamoDB Compatible API (Alternator)

Alternator is ScyllaDB’s implementation of the DynamoDB API.

  • A bug was fixed that could cause error handling while streaming responses to the client to crash the server. #14453
  • It’s now possible to disable the DescribeEndpoints API. This makes it possible to run the dynamodb shell against ScyllaDB. #14410
  • Alternator now limits embedded expression length and nesting. #14473
  • Table name validation has been optimized.
  • In alternator (ScyllaDB’s implementation of the DynamoDB API), a bug in concurrent modification of table tags has been fixed. #6389
  • Validation of decimal numbers has been improved. #6794
  • Timeout configuration value can be hot-updated without restarting the node.
  • Alternator now returns the full table description as a response to the DeleteTable API request. #11472
  • Alternator now avoids latency spikes for unrelated requests while building large responses for batch_get_item. #13689
  • Alternator validation of the table name on ordinary read/write requests is done only if the table lookup fails. This provides a small optimization. #12538
  • Alternator implemented the error path of the size() function incorrectly. This is now fixed. #14592

Strongly Consistent Schema Management with Raft

Strongly Consistent Schema Management with Raft became the default for new clusters in ScyllaDB 5.2. In this release it is enabled by default when upgrading existing clusters.

If you do not want to enable Raft, you should explicitly disable it in scylla.yaml of each node before the upgrade. #13980

Below are additional related fixes and updates:

  • When Raft-based schema and topology management is in use, it will also manage the Change Data Capture (CDC) generation table. This increases the reliability of this operation.
  • Raft remote procedure call (RPC) verbs now check that the call arrived at its intended recipient and not somewhere else.
  • When a node synchronizes the schema from another node, if Raft is in use, it will issue a read barrier first to make sure it’s not missing any keyspaces.
  • Schema pulls happen when a node receives a read or write request (as a replica) with an unknown schema; it will then ask the requesting node for an updated schema. These are now disabled when the schema is managed using Raft; instead the system will rely solely on Raft for schema distribution. #12870
  • When a node is decommissioned or forcibly removed, Raft will now ban it from communicating with the cluster, to avoid theA GROUP BY query ought to return one row per group, except when all rows of a group are filtered out. However, ScyllaDB returned a row even for fully-filtered groups. This is now fixed, and ScyllaDB will not emit rows for filtered groups. removed nodes from affecting the cluster.
  • ScyllaDB uses Raft to coordinate changes to the schema and topology. It now attempts to merge adjacent changes to reduce overhead.
  • When using Raft for topology and schema changes, ScyllaDB will force the schema and topology to be transferred to new nodes. #14066
  • In gossip-managed clusters, the schema is propagated by nodes contacting each other ad-hoc. In Raft-managed clusters, the schema is centrally managed by the group 0 leader. We now disable the ad-hoc schema pull method when Raft cluster management is enabled. #12870
  • Raft cluster management still uses gossip to translate host IDs to IP addresses. It is now more careful not to let old IP address mappings overwrite new mappings. #14257
  • Raft-managed clusters run Data Definition Language (DDL) statements in a transaction. The transaction scope has been extended to also include access checking and validation, and not just the actual schema change. #13942
  • A subtle bug leading to incorrect merging is now fixed. #14600
  • ScyllaDB uses feature flags to coordinate rolling upgrades; a feature isn’t enabled until all nodes report they support that particular feature flag. Occasionally some older feature flags are considered “always on” and aren’t negotiated. A problem with non-negotiated features and storing feature flags in Raft group 0 would have prevented upgrades, but it was fixed.
  • The system.group0_history table now has descriptions for events. #13370
  • Data definition language (DDL) statements are used to modify the schema. They are covered by a Raft transaction to ensure atomicity. The scope of the transaction has been extended to cover access checking to prevent check/use races (this change was already committed in the past but reverted due to performance regressions). #13942
  • The Raft leadership monitor is now started during normal node start, not only bootstrap. #15166
  • Raft snapshot update and commit log truncation are now atomic, removing a failure case. #9603

Strongly Consistent Topology Updates with Raft - Experimental

This release includes an experimental Strongly Consistent Topology Updates. To enable it, use the new consistent-topology-changes flag.

Below are additional related fixes and updates:

  • The experimental flag used to enable consistent topology changes has been renamed from “raft” to "consistent-topology-changes”. #14145
  • Raft topology now verifies that the gossip view of the token ring matches the raft view. #12153
  • A bug in topology management with Raft, when starting up a node, has been fixed #13495
  • The old gossip-based failure detector has been removed. We now use the direct failure detector exclusively.
  • A bootstrapping node will now wait for schema agreement before joining the cluster. This prevents conflict between the new node’s system distributed tables and the cluster’s tables. The conflict is eventually resolved, but while it exists, the cluster is under heavier load. #14065 #14073
  • A race condition between the startup of raft group 0 and its rpc listener was fixed.
  • When using the experimental Raft-managed topology, the cluster is able to verify that all reachable nodes are using current topology, and is able to block requests that use old topology. This lays the ground for faster and safer topology changes (addition and removal of nodes).
  • Bugs preventing a node from starting when using the new raft-based topology mechanism have been fixed. #14303
  • Fencing the mechanism in which requests that were sent using an outdated view of the cluster topology are rejected , in order to avoid reading outdated data or resurrecting old data. It now applies to hints, a mechanism used to heal the cluster after a short node downtime.
  • Fencing is a way to prevent a coordinator from interacting with replicas when it has an outdated view of cluster topology. This now applies to counter updates too.
  • When a node is decommissioned or removed, and Raft topology management is active, the node stops being a voter earlier in the process in order to improve availability. #13959
  • A crash during rebuild operations with experimental consistent cluster topology was fixed. #14958
  • ScyllaDB now supports the –ignore-dead-nodes option family when experimental consistent cluster topology is enabled. #15025
  • Gossip SYN messages now carry the Raft cluster ID. This is used to prevent nodes from different clusters from communicating. This can happen if incorrect seed configuration was used when bootstrapping the cluster. #14448
  • Consistent cluster topology using Raft now supports the --ignore-dead-nodes with IP addresses in nodetool removenode operation. The option is now deprecated in favor of host IDs.
  • In consistent topology mode, the leader now prevents the previous leader from affecting the cluster before starting its own changes.
  • Cluster features are ScyllaDB’s way of making rolling upgrades seamless - a feature isn’t enabled until all nodes support it. We now propagate cluster features via Raft rather than gossip for improved reliability. #15152