[RELEASE] Scylla 5.2.0 Release - part 2

For part one

Other Improvements

CQL API

  • ScyllaDB now supports server-side DESCRIBE. This is required for the latest cqlsh, and reduces the need to update cqlsh as server features are added. However, the version number bump needed to inform cqlsh about this change was reverted as other changes related to the version number are not ready.
  • Secondary indexes on static columns are now supported.#2963
  • The CQL binary protocol versions 1 and 2 are no longer supported. Version 3 and above have been supported for 9 years, so it’s unlikely to be in real use. You can check for version 1 and 2 in the system.clients virtual table. #10607
  • There is now documentation about how NULL is treated in ScyllaDB. #12494
  • USING TIMESTAMP allows setting the mutation timestamp on a CQL statement level. It has a sanity check that prevents setting timestamps in the future, as these can be hard to delete, but sometimes one wishes to do so anyway. There is now a configuration option that allows disabling the feature. #12527
  • TRUNCATE statements are usually much slower than other statements. TRUNCATE statements now support the WITH TIMEOUT clause to help deal with that. #11408
  • The CQL CONTAINS and CONTAINS KEY operators are used to check if a collection contains an element. CONTAINS and CONTAINS KEY have been changed to return false if it is asked whether a collection contains NULL, since no collection can contain a NULL. This is a behavior change, but is not expected to affect users since the query is not useful. #10359
  • In the expression element IN list, we now allow “list” to be NULL (and the expression evaluates to false). Previously, this failed with an error.
  • A regression where CQL ignored some WHERE clause components when a multi-column restriction ((col_a, col_b) < (0, 0)) was present was fixed. #6200 #12014
  • In SELECT JSON statements, column names can be given aliases (just as with traditional SELECT). However, ScyllaDB ignored those aliases. It will now honor them. #8078
  • Evaluation of Boolean binary operators (e.g. “=”) has been refactored to use the same expression evaluation code as other expressions, paving the way for relaxation of the CQL grammar to be more similar to SQL.
  • The experimental User-defined aggregates (UDAs) have a state type that can be initialized using the INITCOND clause. The initializer can now be a collection type.
  • CQL: scylla: types: is_tuple(): doesn’t handle reverse types. For example, a schema with reversed clustering key component; this component will be incorrectly represented in the schema CQL dump: the UDT will lose the frozen attribute. When attempting to recreate this schema based on the dump, it will fail as the only frozen UDTs are allowed in primary key components.

#12576 (RC1)

Amazon DynamoDB Compatible API (Alternator)

Alternator, ScyllaDB’s implementation of the DynamoDB API, now provides the following improvements:

  • Alternator will now associate service level information configured via the CQL ATTACH SERVICE LEVEL statement with the authenticated alternator user.
  • Bug fix: UpdateItem setting the same value again for GSI range key may drop the row #11801
  • Alternator has better validation of malformed base64 encoded values. #6487
  • Alternator now sets the Projection field in response to the DescribeTable request.
  • A bug in alternator WHERE condition for global secondary index range key, that does not appear to have any user visible impact, has been fixed.
  • Alternator implements Time-To-Live (TTL) by scanning data. Some shutdown-related hangs in the scanning process were fixed.
  • Alternator, uses the rapidjson library to parse queries presented as JSON. Due to a compile-time misconfiguration, aggressive inlining was not enabled. This is now fixed, yielding a substantial 5 performance improvement.
  • A Seastar update fixed an ambiguity in the http parser, which caused Alternator requests to be unnecessarily slow. 12468
  • Alternator Streaming API: unexpected ARN values by list streams paged responses. This issue may only affect users with many, more than 100, tables with streams #12601 (RC1)

Correctness

  • ScyllaDB represents reads using mutation fragment streams. Several minor violations of fragment stream integrity were fixed. These could result in incorrect reads during range scans. #12735
  • The view builder is responsible for building materialized views when a view is created and after repair. To control its memory consumption, it reads base table sstables in chunks. Due to a bug, it could forget the partition tombstone or an open range tombstone when moving between chunks, leading to a discrepancy between the base table data and the view data. This is now fixed. #11668
  • When a table with a materialized view had a large partition, and that large partition was deleted with the USING TIMESTAMP clause, the materialized view might be only partially updated. This is now fixed. #12297.
  • The row cache could have missed reading a row if a row is evicted concurrently while being read, plus other conditions. This is now fixed. #11239

Performance and stability

  • TRUNCATE statements are usually much slower than other statements. TRUNCATE statements now support the WITH TIMEOUT clause to help deal with that. #11408
  • ScyllaDB uses an interval map data structure from the Boost library to quickly locate sstables needed to service a read. Due to the way we interface with the library, updating the interval map was unnecessarily slow. This is now fixed. #11669
  • Materialized view building will now ignore partitions no longer owned by the node after a topology change.
  • When a materialized view processes updates to the base table, it locks the partition and clustering key. In some rare cases involving one of the locks timing out but the other not, this can cause a crash. This was fixed by acquiring the locks sequentially. #12168
  • The experimental WASM user defined function (UDF) implementation has been switched to Rust (UDFs can be written in any language WASM supports, not just Rust). The new implementation avoids stalls for long-running UDFs and shares memory with the rest of the database. #11351
  • The experimental WASM user defined function (UDF) has been re-enabled for aarch64 (ARM).
  • User defined aggregates (UDAs) are now correctly persisted and survive a cold start. UDAs are an experimental feature. #11309
  • A bug that caused user-defined aggregates not to be persisted correctly was fixed. #11327
  • A very rare bug involving reads from memtables and multi-version concurrency control has been fixed.
  • Automatically parallelized query aggregation, introduced in 5.1, used a node-local clock for timeouts, rather than the standard clock. This meant that automatically parallelized queries would fail, unless the nodes were started at the same time. This is now fixed. #12458
  • During startup, ScyllaDB makes sstables conform to the compaction strategy in a process called reshaping, so that future reads will perform well. It is now more careful when reshaping Leveled Compaction Strategy tables, to avoid doing unnecessary work. #12495.
  • Lightweight transactions are now more robust when the schema is changed during a transaction. #10770
  • A very rare bug involving reads from memtables and multi-version concurrency control has been fixed.
  • Single-partition reads could, in conditions involving multi-page queries, a completely empty page, and a partition or range tombstone covering the first row, terminate paging prematurely, leading to incorrect results. This is now fixed. #12361
  • Everywhere replication strategy exhibited quadratic behavior since each vNode has every node in its replica set. Since the number of vNodes is proportional to the number of nodes, the amount of work grows quadratically, slowing down topology operations on large clusters. We now special-case this path to avoid the problem. #10337
  • When using the EverywhereReplicationStrategy, vnodes have no impact on data distribution, and so we can avoid splitting the query across vnodes. This improves performance on tables with a small amount of data.
  • A small race between DROP TABLE and inactive reads (readers that have just completed sending a page to the coordinator and are awaiting the next page fetch) was fixed. #11264
  • The internal representation of a CREATE TABLE statement now includes the equivalent of a DROP TABLE. This is normally unnecessary, since CREATE TABLE checks if the table already exists and refuses to proceed, but in the case of two CREATE TABLE statements executing concurrently on two different nodes (but creating the same table), each node’s check misses the other’s concurrently created new table. If the table definitions are different, this can lead to a cluster crash due to the mixture between the two definitions being illegal. The change ensures that the two table definitions are not intermixed and just one survives. Users are still urged not to create tables with the same name concurrently until the Raft schema management transition is complete. #11396
  • Vestigial support for compaction groups has been merged. Compaction groups are token ranges that have dedicated sstables and memtables, and so can be compacted and moved independently of each other. Currently, exactly one compaction group per table is supported, so this doesn’t change anything, but it will be expanded in the future.
  • The database now performs sanity tests around TimeWindowCompactionStrategy tables, to limit the number of time windows. Too many time windows can cause the system to run out of memory or open file handles. #9029
  • Per partition rate limit error error reporting has been corrected to report the new error code when an updated driver is available.#11517

Support for the new error code has been merged to:

  • Scylla Rust Driver support since version 0.6.0

  • Gocq - merged to master but it’s not a part of any release.

  • Off-strategy compaction is used when sstables from an external source (such as repair) needs to be reshaped before being handed off to the table’s compaction strategy. If off-strategy compaction is stopped, ScyllaDB used to just leave those sstables in their unreshaped form without compacting them again. It will now hand off such sstables directly to the table’s compaction strategy. #11543.

  • In compaction strategies such as LeveledCompactionStrategy and ScyllaDB Enterprise’s IncrementalCompactionStrategy, sstables are sealed when they reach a certain size (160MB and 1GB respectively). They are not split in the middle of a partition, because we were not able to recover the ordering of such split sstables. This is now possible (though not yet integrated into the compaction strategies).

  • Performance: Long-term index caching in the global cache, as introduced in 4.6, hurts the performance for workloads where accesses to the index are sparse. To mitigate this, a new configuration parameter cache_index_pages (default true) is introduced to control index caching. Setting the flag to false causes all index reads to behave like they would in BYPASS CACHE queries. Consider using false if you notice performance problems due to lowered cache hit ratio in 4.6 or 5.0. The config API can update the parameter live (without restart). #11202

  • Scylla will now reject a too-low bloom_fulter_fp_chance when creating (or altering) the table, rather than crash while flushing memtables. #11524.

  • ScyllaDB represents reads using mutation fragment streams. Several minor violations of fragment stream integrity were fixed. These could result in incorrect reads during range scans.

  • The log-structured allocator is used to manage cache and memtable memory. When memory runs out, the allocator tries to reclaim memory by evicting cache items and by defragmenting memory. If this takes too long, the allocator logs a stall report. Due to a bug, if the report threshold was set too low then the report is generated even if a stall did not happen, slowing down the system and flooding the logs. This is now fixed. #10981

  • The compaction manager now ignores out-of-disk-space (ENOSPC) exceptions when shutting down, so the server doesn’t crash in these scenarios.

  • An inaccuracy in the per_partition_rate_limit read metric was corrected. Tables with the per_partition_rate_limit property can throttle read and write activity on a per-partition basis. #11651

  • A large schema (with thousands of tables) could cause stalls when propagated from node to node. This is now fixed. #11574

  • A recently introduced regression caused a crash when speculative retry was enabled. This is now fixed. #11825

  • segmentation fault in cases where the base table schema change while MV schema is cached #10026, #11542

  • A crash when the compaction manager was asked to stop multiple times (for different reasons) was fixed.

  • A problem with RPC connections being needlessly dropped was fixed. #11780

  • When a query completes a page, ScyllaDB caches the query activity as an inactive read. When the client requests the next page, ScyllaDB re-activates the read and continues where it left off. A bug in this mechanism that could cause crashes has been fixed. #11923

  • A crash was fixed during an illegal lightweight transaction INSERT with NULL clustering key

  • ScyllaDB caches rows and (since 4.6) index entries in a single unified cache. It was observed that in some small-partition workloads index caching causes a performance regression, so index caching is now disabled by default. It can still be enabled for workloads that benefit from it. We plan to re-enable it when the regression is fixed. #11889

  • The topology management code is more relaxed about unknown endpoints to prevent crashes in tests that check for edge cases. This fixes a recent regression. #11870

  • Hinted handoff now checks that a node exists in topology before doing anything; this helps with a recent regression due to topology refactoring.

  • A crash while fetching repaid ids from the repair history table was fixed. #11966

  • Usually repair can compare and update the same shard in different nodes, for example shard 3 in one node is compared against shard 3 in another. When the number of shards in nodes is dissimilar, this doesn’t work and each shard compares against data from multiple shards in other nodes. This is now made more efficient by reducing sstable reader thrashing for this dissimilar shard count case. #12157

  • The algorithm for removing nodes from the token ring was corrected and made more efficient. It’s not known that this had any user impact. #12082

  • The CQL server will now only run requests that benefit from concurrency (e.g. QUERY and EXECUTE) in parallel. Configuration and authentication related requests will be serialized, reducing the chance for errors in those code paths.

  • A rare bug involving an allocation failure while updating cached rows was fixed. #12068

  • The system.truncated table holds information about truncation times of user tables. A recent regression caused it to be unreadable by cqlsh. It is now fixed. 12239

  • COMPACT STORAGE tables allow the user to only specify a prefix of a compound clustering key. Bugs relating to such partial keys and reversed rows were fixed.Note that compact storage is deprecated (see section). #12180

  • ScyllaDB now supports multiple compaction groups 1. This is not a user-visible feature for now.

  • Some copies of the lists of ranges to stream were eliminated from the decommission path, reducing latency spikes. #12332

  • When the global index cache is disabled, a local (per query) cache was used instead. When that cache was destroyed, a stall could result, generating a latency spike. This is now fixed. #12271

  • Compaction manager generally reacts to events to initiate compactions, but also has an hourly timer in case an event was missed (and for tombstone compaction, which isn’t triggered by an event). This timer is now less susceptible to stalls. #12390.

  • Repair tried to trigger off-strategy compaction even for a table that was dropped during repair, failing the entire repair. It ignores the dropped table now.

  • Off-strategy compaction is now enabled for all streaming topology operations (adding and removing nodes). Previously it was enabled only for repair-based node operations. Off-strategy compaction takes advantage of the fact that incoming sstables are non-overlapping to perform more efficient compaction that the one performed by the regular compaction strategy.

  • ScyllaDB sometimes reads ahead of the user request, in order to hide latency. In one case a read-ahead request which timed out caused errors to be emitted, even though this did not affect the query. The errors are now silenced. #12435

  • ScyllaDB tracks transient memory used by queries. However, it did not track decompressed memory buffers, which could lead to running out of memory in some complicated queries. This is now fixed.

  • “Unset” values are an obscure prepared statement feature that allows only some columns in an UPDATE or INSERT statement to be modified. It was a source of minor bugs and inconvenience in code. The feature has been refactored so it has less impact on the code and is more robust.

  • Fix a crash in Materialized View update row locking, caused by a race condition #12632 (RC1)

  • Fix a crash when reporting error on invalid CQL query involving field selection from a user-defined type #12739 (RC1)

Operations

  • Task Manager - a generic job tracker for long-running jobs. The Task Manager is a new per-node API that makes longlive internal tasks (such as repair or compaction) observable and controllable. Task Manager updates in 5.2 are:
    • The task manager will no longer run tasks that were aborted before being started.
    • The task manager is now aware of repair-related tasks.
    • The task manager now controls repair tasks with shard granularity.
    • A crash in the task manager related to repair tasks has been fixed.
    • Repair tasks can now be aborted via the task manager.
  • Materialized Views: When auto-snapshot is enabled and a snapshot of a base table is taken, its views are now also snapshotted. #11612
  • The removenode operation will reject streaming data from the leaving node. #11704
  • It is now possible to replace a node by mentioning its host id, rather than its IP address. This is useful in container environments, where IP addresses are transient.
  • A node checks if it was removed from the cluster before rejoining. #11355
  • In the system.local system table, we now advertise broadcast_rpc_address rather than rpc_address, similar to system.peers and Cassandra 4.1. #11201
  • Decommissioning a node is now rejected if some node is down, so that it’s possible to inform all nodes about the topology change. Similarly, bootstrap is rejected if there is a node in an unknown state. #11302
  • Gossip no longer updates topology with the IP address of a removed node, during a replacenode operation, preventing a dead node from being resurrected. #11925
  • The replacenode operation is used to replace a dead node. It has now changed to generate a new unique host ID rather than taking over the ID of the node it replaces. This makes the cluster more robust against cases where a node previously thought dead comes back alive.

Deployment and install

  • ScyllaDB Image is now available for Azure.
  • If you are running with a multi data center deployment, check the Raft section above.
  • Scylla Images (AMI, Docker, GCP IMage) are now using ubuntu-minimal, based on Ubuntu 22.04, instead of full Ubuntu.
  • Ubuntu 18.04 EOL is April 2023. As such, ScyllaDB 5.2 does not support Ubuntu 18 install. If you are still using Ubuntu 18.04 and can not upgrade, please let us know.
  • Some unneeded services are now disabled to improve security and reduce startup time ([Images]: remove services which we don't need during image creation · Issue #408 · scylladb/scylla-machine-image · GitHub )
  • The container (docker) image now uses the C locale 1 to reduce image size. Part of the ubuntu-minimal update.
  • There are now two tar packages: the standard package now does not contain debug information, while a new package with the -unstripped suffix contains the debug information. This reduces download time.
  • The .tar.gz package contents are now under a directory, similar to existing practice in most tarball packages. #8349
  • The bundled Prometheus node_exporter package was upgraded to version 1.4.0. #11400
  • The Azure snitch, used to derive datacenter and rack information from instance metadata, now handles regions which have a single availability zone. #12274
  • The “snitch” is used to derive the node’s location (data center and rack). The AWS EC2 snitch now uses Instance Metadata Service V2 (IMDSv2) to access instance metadata, per AWS recommendations. #10490
  • The installer no longer checks that systemd is installed if --without-systemd is provided. #11898
  • The bundled Java driver was updated to version 3.11.2.4.
  • The various cloud snitches, which determine the datacenter and rack association of a node, improved error checking of their communication with the cloud provider.
  • The docker image is now more robust against different network conditions, which could cause it to fail to launch. #12011

Tools

Configuration

  • The Cassandra tombstone_warn_threshold (default 1000) configuration for the maximum number of tombstones a query can scan before a warning item is now respected, producing a warning if a query takes too long.
  • Messaging will now prevent 0.0.0.0 and its IPv6 equivalent from being used as a node IP address.
  • More new config parameters:
    • Restrict_future_timestamp Controls whether to detect and forbid unreasonable USING TIMESTAMP, more than 3 days into the future. See Sanity check for USING TIMESTAMP above.
    • replace_node_first_boot - The Host ID of a dead node to replace. And alternative to the old replace_address_first_boot which uses the old node address. See replace node docs.
    • WASM (experimental feature) related configs:
      • wasm_cache_memory_fraction
      • wasm_cache_timeout_in_ms
      • wasm_cache_instance_size_limit
      • wasm_udf_yield_fuel
      • wasm_udf_total_fuel
      • wasm_udf_memory_limit
    • consistent_cluster_management - replace the Raft experimental flag (see Raft above)
    • x_log2_compaction_groups - new config for setting static number of compaction groups
    • unspooled_dirty_soft_limit - replace the old virtual_dirty_soft_limit.
    • compaction_collection_elements_count_warning_threshold - see large collection above.
    • cache_index_pages - Keep SSTable index pages in the global cache after a SSTable read
    • restrict_twcs_without_default_ttl - Controls whether to prevent creating TimeWindowCompactionStrategy tables without a default TTL. Can be true, false, or warn (default)
    • Twcs_max_window_count - The maximum number of compaction windows allowed when making use of TimeWindowCompactionStrategy (default: 50)
    • task_ttl_seconds - Time for which information about finished tasks stays in memory (default 10s)
    • broadcast-tables - new experimental Raft feature for internal testing
    • query_tombstone_page_limit - The number of tombstones after which a query cuts a page, even if not full or even empty (default 10000)

Deprecated and removed features

  • The CQL binary protocol versions 1 and 2 are no longer supported. Version 3 and above have been supported for 9 years, so it’s unlikely to be in real use. You can check for version 1 and 2 in the system.clients virtual table. #10607
  • New DateTieredCompactionStrategy tables are now rejected by default. Users should switch to TimeWindowCompactionStrategy. Existing DateTieredCompactionStrategy tables are still supported, and it is still possible to configure the database to allow new DateTieredCompactionStrategy tables.
  • Thrift API - legacy ScyllaDB (and Apache Cassandra) API is deprecated and will be removed in followup release. Thrift has been disabled by default.
  • Compact Storage - a file format used by Thrift and deprecated from Apache Cassandra, is deprecated and will be removed in followup release.

Build

  • ScyllaDB is now built using a Fedora 37 based toolchain, including clang 15.
  • A build time misconfiguration caused the CQL query parser to be compiled with a low optimization level. This is not important by itself, since generally prepared statements hide the parsing overhead, but the C++ compilation model can cause some functions in unrelated areas to also be compiled with the lower optimization level, reducing performance. This is now fixed. 12463

Monitoring and tracing

Scylla Monitoring Stack release 4.3 and later will support ScyllaDB 5.2.

metrics related updates below:

  • Shard Latencies are now reported as summaries. This is part of an effort to reduce the total number of generated metrics. In addition, empty histograms and summaries will not be reported. The overall result is a 5x reduction in the number of metrics #11173. This is how a summary looks like:
scylla_storage_proxy_coordinator_read_latency_summary_count{scheduling_group_name="statement",shard="1"} 2

scylla_storage_proxy_coordinator_read_latency_summary{quantile="0.990000",scheduling_group_name="statement",shard="1"} 640
  • There is now a metric that allows observation of update progress of materialized views from staging sstables.
  • There are now completion percentage metrics for node operations using streaming; previously the completion metrics were only available when using repair-based node operations. #11600
  • The sstable row_reads metric for m-format sstables is now properly incremented, instead of showing zeroes. #12406
  • The replica-side read metrics, which have been incorrect for some time, have been revamped. #10065