ScyllaDB Enterprise Release 2024.1.0 - Deployment and more improvements

See 2024.1 release notes

Deployment and install

  • ScyllaDB Enterprise 2024.1 is officially supported on Rocky / RHEL 9.
  • RHEL / CentOS 7 support is deprecated and won’t be supported going forward.
  • ScyllaDB installation will now tune the OS core dump service to allow a longer time to dump cores. This is necessary since ScyllaDB allocates all memory and therefore takes a longer time to dump core if an error is encountered. #5430
  • The installer now wipes filesystem signatures from the individual disks making up a RAID array, preventing problems with reuse of disks. #13737
  • We now tune the Linux kernel’s caching of inodes (in-memory structure representing file metadata) to favor evicting inodes quickly. This aims to reduce kernel memory fragmentation when there are large numbers of sstables, as most files comprising an sstable aren’t accessed after the process starts.
  • Prefault memory when --lock-memory 1 is specified, preventing (very rare) stalls from having the kernel defragment memory when using transparent hugepages.
  • The bundled Prometheus node_exporter has been updated to version 1.7
  • GCP e2-micro is now supported out of the box
  • Fix bugs in Azure images are fixed #428 #420 #431
  • Build: docker: upgrade all 3rd party packages on creation #16222

More Improvements

Streaming: Add stream_plan_ranges_fraction

This option allows user to change the number of ranges to stream in

batch per stream plan. Currently, each stream plan streams 10% of the total ranges.

The default value is the same as before: 10% of total ranges. #14191

CQL API

  • Correctness: a rare combination of reconciliation (read repair) with reverse queries and range tombstones, could cause incorrect data to be returned from queries #10598
  • Correctness: Row cache updates do not provide strong exception safety guarantees. In a rare case, and when using CL=1, cache might return a stale value #15576
  • Correctness: Inserted data only becomes available after restart. The root cause is a very rare bug in the cache, which is hidden (in most cases) by replication and reconciliation #16759
  • Correctness: a very rare bug in row cache might return a wrong value #15483
  • CQL table columns that have the list data type aren’t allowed to contain NULLs, but in certain situations list values in CQL literals or bind variables are allowed to contain NULLs (for example, in LWT IF conditions that use the IN operator). The type system was relaxed to accept NULLs where this is allowed. Previously, these cases were handled by hard-to-maintain workarounds.
  • The CQL USING TTL clause allows one to specify an INSERT or UPDATE’s time-to-live property, after which the cells are automatically deleted. TTL 0 was misinterpreted as the default TTL (which happens to be unlimited, usually) rather than an explicitly unlimited TTL. This is now fixed. #6447
  • The C-style cast syntax ((type) expression) can now be applied to bind variables ((type) ? or (type) :var) to explicitly specify the type of bind variables

Examples: blob_column = (blob)(int)12323

  • Error messages for incorrect usage of the CQL TOKEN() function have been improved. #13468
  • The check for altering permissions of functions in the system keyspace has been tightened.
  • Error messages involving the CQL token function have been improved.
  • Error messages involving CQL expressions will not be printed in a more user-friendly way. Previously they contained some debug information.
  • Change Data Capture (CDC) exports updates to the database as a table containing changes. One option is to capture not only the change, but also the state of the row before it was changed. In some cases, in a lightweight transaction (LWT) change, the preimage could return the state of the row after the change instead of before the change. This is now fixed. #12098
  • The NetworkTopologyStrategy replication strategy will now reject an empty value for the replication factor. #13986
  • Materialized views require the “IS NOT NULL” qualifier on primary key elements, but also accept (and ignore) the qualifier on regular columns. The qualifier is now rejected when applied to regular columns. A configuration variable allows you to warn about the rejected clause, emit an error and fail the request, or ignore it. #10365
  • The count(column) function is supposed to only count cells where the column is not NULL. A regression caused count(column) to behave like count(*) for collection, tuple, and user-defined column types. This is now fixed. #14198.
  • When performing the last-write-wins rule comparison, if the timestamp of the two versions being compared was equal, ScyllaDB first compared the cell value and then the expiration time (TTL). This is compatible with earlier versions of Cassandra. However, this could cause a NULL value to appear if the cell was overwritten with the same timestamp but a different TTL. The algorithm was changed to compare the cell value last, and check all the other metadata first, resulting in fewer surprising results. It is also compatible with current Cassandra versions. #14182
  • A GROUP BY query ought to return one row per group, except when all rows of a group are filtered out. However, ScyllaDB returned a row even for fully-filtered groups. This is now fixed, and ScyllaDB will not emit rows for filtered groups. #12477
  • In older versions of ScyllaDB, different clauses of CQL statements were processed using different code bases. ScyllaDB is gradually moving towards a single code base for processing expressions. It is now the SELECT clause’s turn, moving us closer to the goal of a unified expression syntax. As this is an internal refactoring, there are no user visible changes, apart from some names of fields in SELECT JSON statements changing (specifically, if those fields are function evaluations).
  • A recent regression when using GROUP BY together with the ttl() and writetime() pseduo-functions was fixed. #14715
  • There is a new SELECT MUTATION_FRAGMENTS statement that allows seeing where the data that composes a selection comes from. Normally, cache, sstable, and memtable data are merged before output, but with this variant one can see the original source of the data. This is intended for forensics and is not a stable API. #11130
  • The CQL grammar incorrectly accepted nonsensical empty limit clauses such as SELECT * FROM tab LIMIT;. The errors were discovered later in processing, but with unhelpful error messages. They are now rejected. #14705.
  • The CQL grammar incorrectly accepted nonsensical INSERT JSON statements such as INSERT INTO tab JSON;, causing a crash. This is now fixed. #14709
  • A mistake in function type inference, which could lead the CQL statements to claim there is ambiguity when in fact there is none, was fixed.
  • The format of the timestamp data type is now compatible with Cassandra. #14518
  • In CQL, a few functions for dealing with counter types were added. #14501
  • A SELECT statement that has the DISTINCT keyword and also GROUP BY on clustering keys is now rejected. DISTINCT implies only selecting the partition key and static rows, so grouping on the clustering keys is nonsensical. #12479
  • When ALTERing a table, the compaction strategy options are now validated. #2336
  • A bug in the fromJson() CQL function when operating on NULL operands has been fixed #7912
  • The DESCRIBE statement now includes user defined types and functions #14170
  • The column names for SELECT CAST(b AS int) and similar expressions have been adjusted to match Cassandra. #14508
  • In some cases where a bind variable was used both for the partition key and to match a non-key column, ScyllaDB would not generate correct partition key routing for the driver. This is now fixed. #15374
  • A map<ascii, something> value, when parsed from its JSON representation, did not parse the key correctly. This is now fixed. #7949
  • SSTable compression can be configured with a chunk size, with larger chunks trading less efficient I/O and higher latency for higher compression ratios. The chunk size is now capped at 128 kB, to avoid running out of memory. #9933
  • CQL: Adding (using ALTER TYPE) a UDT field of type Duration doesn’t check whether it’s allowed. For example values of the duration type is not allowed for clustering columns #12913
  • CQL: Changes to tombstone GC (mode, gc period, etc) settings don’t take full effect till restart (or all sstables are recompacted). This is relevant when user switched from timeout to repair mode #15643
  • Setup: dist: ‘ordering cycle’ on var-lib-scylla.mount might be wrong, resulting in a potential wrong init order of dependencies, like file system and scylla service booth. #8761
  • CQL: toJson() produces invalid JSON for columns with “time” type #7988
  • Stability: Use statement might throw an exception instead of returning an exceptional future, for example if the keyspace is doesn’t exist #14449

Amazon DynamoDB Compatible API (Alternator)

  • A bug was fixed that could cause error handling while streaming responses to the client to crash the server. #14453
  • It’s now possible to disable the DescribeEndpoints API. This makes it possible to run the dynamodb shell against ScyllaDB. #14410
  • Alternator now limits embedded expression length and nesting. #14473
  • Table name validation has been optimized.
  • In alternator (ScyllaDB’s implementation of the DynamoDB API), a bug in concurrent modification of table tags has been fixed. #6389
  • Validation of decimal numbers has been improved. #6794
  • Timeout configuration value can be hot-updated without restarting the node.
  • Alternator now returns the full table description as a response to the DeleteTable API request. #11472
  • Alternator now avoids latency spikes for unrelated requests while building large responses for batch_get_item. #13689
  • Alternator validation of the table name on ordinary read/write requests is done only if the table lookup fails. This provides a small optimization. #12538
  • Alternator implemented the error path of the size() function incorrectly. This is now fixed. #14592
  • Some sstables with large sizes left after TTL expiration, gc-grace-period and major compaction (tombstones are not deleted) #1191

Performance and stability

  • The row cache will now purge expired tombstones before populating the cache, removing the performance impact of scanning tombstones. Note that non-expired tombstones are still loaded.
  • ScyllaDB carefully measures the memory consumed by queries, and tries to ensure it will not exceed available memory. However, a query’s memory can grow after it has already started. If this happens to all concurrently running queries, we may run out. To prevent this, two new safeguards are added: first, when one memory threshold is passed, we pause all queries except one with the intent of completing this one query and releasing memory. If this doesn’t help and memory grows even further, we fail all other queries with the intent of letting one succeed, with the rest retired later.
  • The compaction manager reloads sstables during schema change, but it did so with quadratic complexity, causing stalls for tables that had many sstables. This is now fixed. #12499
  • Locking race in the materialized view update path was fixed. #12632
  • A size-on-disk accounting bug in commitlog was fixed. This could lead to segment recycling being stopped indefinitely, with a large reduction in write performance. #12645
  • A bug which could cause crashes while reporting errors in invalid CQL statements involving field selection from a user-defined type was fixed.
  • The compaction backlog tracker computes the amount of work remaining for compaction. It is updated when inserting sstables into the table. The efficiency of this process, for leveled compaction strategy tables, was improved.
  • ScyllaDB is now more careful when dropping user-defined types that are used by a user-defined function.
  • Out-of-memory management can halt allocations from all but one query, in order to get that query to complete and release memory. However, if that query was paused, it could deadlock the system. This is now fixed #12603
  • The row cache and memtables now hold rows and range tombstones in a unified data structure, rather than in separate data structures. This solves performance problems (throughput and latency) when a large partition has many range tombstones. Fixes #2578 #3288 #10587
  • A recently-introduced bug (in 5.2), where a prepared statement with named bind variables was executed without providing a value for one of the variables was fixed. #12727
  • A crash during cql3 aggregation, for the case the query returned no results, has been fixed.
  • ScyllaDB will now treat running out of disk quota (EDQUOT) in the same way it treats running out of disk space (ENOSPC). #12626
  • An edge case where a vnode token boundary coincided with a range scan boundary, but inclusiveness/exclusiveness of the token (< vs <=) did not agree, has been fixed.#12566
  • Some minor bugs in the cql transport server error handling have been corrected.
  • Repair will now ignore local keyspaces.
  • Lightweight transaction IF evaluation has been refactored to have a common code base with the rest of the system. In a few places, semantics were slightly modified. This is not expected to have any impact on production code.
  • Merging schema changes received from other nodes is now faster, when there is a large number of tables.
  • The load-and-stream feature, called by nodetool refresh, reads user-supplied sstables and copies the contents to the correct nodes across the cluster. It is now faster. Load and stream is used by Scylla Manager restore operation.
  • The load-and-stream operation reads user-supplied sstables and streams them to the cluster. It now avoids loading the bloom filter, saving memory.
  • A bug where varint or bool columns could be deserialized incorrectly in rare cases is now fixed. #12821 #12823 #12708
  • A few memory leaks, exposed by the out-of-memory query killer, have been fixed #12767
  • The code for computing proximity (whether nodes are on the same or different rack and datacenter) was optimized.
  • An edge case in repair-based node operations abort process was tightened. #12989
  • The reader concurrency semaphore is responsible for managing concurrency for read queries, balancing memory and CPU use with enough concurrency to keep the disks busy. One case that was not handled well is if a read was blocked due to the system running low on memory, and subsequently made idle (as no one is waiting for its results). This edge case has been fixed #12700
  • Reader concurrency semaphore now has more tracepoints, useful with CQL tracing. #12781
  • Materialized view updates are performed asynchronously relative to updating the main table, so failures there are not visible as an UPDATE or INSERT failure. Instead, errors are logged. Those errors are rate-limited now to avoid flooding the logs.
  • A scan on a disjoint token range (e.g. (1…100), (200…300)) could have resulted in incorrect results. It’s not possible for a user query to specify a disjoint token range; we are checking if internal sources could generate such queries. The bug itself is fixed. #12916
  • A large query is broken up into separate pages. When tracing, each page gets its own trace session. An optimization in ScyllaDB means the following page can reuse state from the previous page. The trace will now link to the previous session when this happens, improving visibility.
  • ScyllaDB uses a separate commitlog for tables holding the schema, so that schema changes do not suffer high latency under heavy write loads. This separate commitlog will now be used for all raft-managed system tables to guarantee atomicity. #12642
  • The separate commitlog for schema is now stored in a separate directory. #11867
  • The in-memory footprint of sstable summaries has been reduced. This is especially noticeable with very small partitions.
  • Running major compaction will no longer increase the compaction scheduling group shares to 200. This is not necessary since major compaction runs in the maintenance/streaming group. #13487
  • The sstable parser is now able to detect more types of corruption involving premature end-of-file with compressed sstables. #13599
  • ScyllaDB avoids allocating large contiguous memory buffers, as these stress the memory allocator. Instead, ScyllaDB uses fragmented buffers which are easier to allocate. However, most compression libraries do not work with fragmented buffers, so large linear buffers are sometimes necessary. ScyllaDB had a mechanism in place to reuse such buffers in order to avoid allocating them for every request, and here it is tightened so that reallocations are even less common. In addition its usage is corrected in sstables. #2694
  • ScyllaDB sometimes caches a running query in order to resume it later. In order to do that, it must also store the position at which the query is at, since the cached query might get purged before it is resumed. To do that, it must scan over-range tombstones to get to a non-ambiguous primary key position. A bug caused this scan to not terminate, resulting in the query running out of memory. This is now fixed. #13491
  • A crash during shutdown due to incorrect service ordering was fixed. #13522
  • The sstable validator had a bug in range tombstone validation fixed.
  • The sstable validation facility now validates the sstable index file (-Index.db). #9611
  • A corner case in replica-side concurrency control related to failed requests was fixed. #13539
  • The selector path (expressions in the SELECT clause) now uses non-contiguous memory. This reduces latency when selecting large blobs, as non-contiguous memory doesn’t suffer from fragmentation.
  • Immediate mode tombstone garbage collection is a schema feature that requests tombstones to be garbage-collected immediately (without waiting for gc_grace_seconds), but it ended up expiring TTLed data too early. This is now fixed. #13572
  • An edge case when converting range tombstones to the internal format used by sstables has been corrected. #12462
  • Very large compactions, involving hundreds of sstables, could cause stalls when the log message announcing the compaction is printed due to quadratic complexity. This is now fixed.
  • A performance regression, introduced in Scylla 5.0, in compactions that process a lot of tombstones has been fixed.
  • When the schema changes, the rows in the row cache have to be upgraded to the new schema. This happens on-demand as rows are hit in the cache. Until now, this happened with partition granularity - all of a partition’s rows that happened to be in cache were upgraded at the same time, causing reactor stalls and high latency when large partitions were cached. This has now been fixed, and the cache is upgraded using row granularity. #2577
  • Commitlog has gained its own scheduling group, to complement the already existing commitlog I/O priority class. This is in preparation for unification of CPU scheduling and I/O scheduling.
  • During shutdown, the system will cancel pending hint writes rather than wait for their 5-minute timeout. This can prevent delays in stopping a server. #8079
  • In internode communications, we now avoid copies of certain heavyweight objects. #12504
  • A cleanup compaction is used to get rid of token ranges that are no longer owned by a node. A bug that delayed deletion of sstables being cleaned up, thus increasing the risk of running out of space, was fixed. #14035
  • Some queries for the internal authentication table used infinite timeouts, leading to shutdown problems. This is now fixed. #13545.
  • The internal data dictionary could lose user defined types on ALTER KEYSPACE statements, resulting in a crash. This is now fixed. #14139
  • The nodetool refresh command loads foreign sstables into ScyllaDB and reshapes them for the current shard distribution. A bug could cause the clean-up after the reshape to crash. It is now fixed. #14001.
  • Repair will now use a more accurate estimate of the partition count to create bloom filters for its sstables.
  • The “forward” service is responsible for execution of automatically parallelized aggregation queries. It is now more careful to stop query retries if a shutdown is requested. #12604
  • SSTable generation numbers are integers used to give SSTables unique names. Generation numbers can now also be UUIDs, which enables placing SSTables on shared storage. #10459
  • Recently, schema changes to data in the row cache changed the upgrade granularity from partition to row, to prevent stalls when large partitions are cached. One place could use an outdated schema, which could cause a crash. This is now fixed. #14110
  • An edge case where querying a data center that has a replication factor equal to zero could lead to a crash has been fixed. #14284
  • ScyllaDB can automatically parallelize certain aggregation queries. The mechanism however had a bug when aggregating columns that had case-sensitive names. This is now fixed #14307
  • A crash when DESCRIBE FUNCTION or DESCRIBE AGGREGATE were used on the wrong function type was fixed. #14360
  • The row cache holds frequently-read rows. When a row is written with the TTL (time-to-live) option, it is set to be automatically deleted after a certain time. If it’s in the cache, however, it will continue to occupy memory, reducing cache utilization. This is now improved, as the cache will detect and remove expired rows when they are read. Infrequently read rows will be removed from the cache using the least-recently-used mechanism.
  • After repair or data movement due to node additions or removals, materialized views need to be updated. This process involves reading from all sstables except those that have been streamed or repaired. This was slow, and is now optimized, speeding up repair and data movement on clusters that have materialized views. #14244.
  • A race condition between cleanup and regular compaction has been fixed.
  • ScyllaDB uses evictable readers in certain places to allow the system to cancel ongoing reads to reclaim memory, with the ability to resume those reads later. #14296
  • While using a lightweight transaction, if inconsistent constraints were given on the clustering key, ScyllaDB would crash. This is now fixed. #13129
  • A rare stack overflow in some repair scenarios has been fixed. #14415
  • The failure detector detects failed nodes by pinging them. Now it does not attempt to ping itself. #14388
  • When topology changes, CDC streams also change.
    The metadata describing these streams is now committed in parts, to avoid overloading the system.
  • When a base table of a materialized view is updated, the affected rows are also changed in the materialized view. For DELETE statements, many rows can be affected, and so the view update code splits the work into batches. However, this split was not performed correctly when range tombstones were involved. This is now fixed. #14503
  • A bug involving incorrect cross-shard access while performing a nodetool scrub command was fixed. #14515.
  • Usually, repair reconciles a shard’s data on one node with the data on the same shard in other nodes. When the number of shards on different nodes doesn’t match, repair has to pick small ranges from all shards on the remote nodes. This adds significant overhead which is most pronounced when there is little or no data in the table. This is common in tests and slows them down, so we now have an optimization for the little-data case. #14093
  • A complication causing problems in bootstrap handling very recent changes in topology was fixed. #14468 #14487
  • A recent change extending the scope of CQL Data Definition Language (DDL) transactions caused a significant performance regression, so it was reverted. #14590
  • A production installation of ScyllaDB locks all memory so we don’t experience high latency due to page faults. However, this only applies from the first time the memory is accessed; the first access can still experience stalls, made larger by using transparent huge pages. To fix this, a Seastar update adds a prefault thread that attempts to access all memory ahead of the database, taking the latency hit on this new thread rather than user queries. This will be visible as increased CPU consumption during the first few seconds (up to a minute on large machines) during process start. #8828.
  • The messaging service, responsible for inter-node communication, now initializes transport-layer security (TLS) earlier, to account for the failure detector pinging its own node. #14299
  • Resharding is a process where an sstable is split into several sstables, each wholly-owned by a single shard. A recent change to integrate resharding into the task manager was found to crash the system, so it was reverted #14475 #14618
  • The system uses a reader_concurrency_semaphore to limit the number of concurrent reads, as each read can consume large amounts of memory when merging sstables. Repair has its own allocation of concurrent reads. We now limit the scope of a read more carefully, to allow new reads to issue more quickly. #14676
  • A recent regression involving a crash in decommission was fixed. #14184
  • A Seastar update will reserve more memory for the operating system in situations that previously led to out-of-memory errors. These situations are ARM machines with 64kB pages (rather than the usual 4kB), and transparent hugepages enabled. As a side effect ScyllaDB will run with less memory.
  • The compaction manager sometimes generates sstables composed of only tombstones, in order to safeguard against a crash causing data resurrection. If there isn’t a crash, these sstables can be safely deleted. However, they are sometimes picked up for compaction before they are deleted, wasting CPU cycles. They are now excluded from compaction. #14560
  • ScyllaDB uses objects called reader_concurrency_semaphores to limit query concurrency and to isolate different service levels. We now check if the service level changed during a query and avoid erroring out in this case.
  • Recently the mechanism to update materialized views after repair was optimized. A latent use-after-free bug was discovered in the optimization, and fixed. #14812
  • A deadlock during shutdown in internode communication was fixed. #14624
  • When updating a materialized view after repair, we chunk the base table data and process each chunk individually. Chunking is based on memory consumption. However, empty partitions were not accounted for, so long runs of empty partitions could create large chunks and run the node out of memory. This is now fixed by accounting for empty partitions. #14819
  • ScyllaDB caches pages from the sstable primary index in order to reduce I/O. In certain cases it reads index pages ahead of the actual need to use them to reduce latency. In rare cases this caused an internal invariant to be violated, crashing the node. This is now fixed. #14814
  • ScyllaDB computes the version of the schema by hashing the mutations that describe the schema in the schema tables. This can lead to an inconsistency between nodes if tombstones are expired at different times. This is now fixed by ignoring empty partitions, making the tombstone expiration time irrelevant. #4485
  • Streaming and repair will now compact data before streaming it, reducing bandwidth usage if the sstables being streamed happen to contain data and tombstones that cover that data. #3561
  • A bug in the Seastar coroutine code, which could lead to unexpected crashes, has been fixed.
  • A source of high latency in multi-partition scans was eliminated. #14008
  • Change Data Capture exposes multiple streams to reflect the cluster topology. It is now more careful to avoid closing and creating new streams unnecessarily. #14055
  • ScyllaDB verifies ownership and permissions for its own files. It now avoids doing this for snapshots, as they might be concurrently being deleted by an administrator or scylla-manager #12010
  • When deleting multiple sstables at once (such as at the end of a compaction), we now avoid flushing the directory unnecessarily as we can rely on the deletion log file instead.
  • We now update the list of sstables requiring cleanup after compaction completion. This avoids a race between decommission and compaction, involving offstrategy compaction that could cause such compacted sstables not to be cleaned. #14304
  • Off-strategy compaction is run after repair or bootstrap on newly received sstables to reduce their count. This compaction now includes a cleanup, to avoid non-owned token ranges from sneaking into the main sstable set via this offstrategy compaction. #15041
  • A bug that caused nodes to fail to start if a tablet was migrated concurrently with its table being dropped was fixed. #15061
  • A recent regression causing a crash on table drop was fixed. #15097
  • ScyllaDB contains two classes of tables, system and user, and uses separate memory pools for their memtables. This avoids a deadlock when a user memtable is being flushed, and needs to allocate memtable space for a system table as part of the flush process. We now automatically designate all system tables as using the system memtable pool. #14529
  • Latency during repair of large numbers of small rows was improved. #14537
  • Previously, the cache was enhanced to remove expired tombstones on read. It will now remove expired range tombstones on read as well. This prevents tombstone accumulation in cache. #6033
  • Read concurrency on replicas is managed by reader_concurrency_semaphore. A deadlock while stopping it has been fixed. #15198
  • The index cache caches the Index.db components. It was previously disabled by default due to regressions on small-partition workloads. It is now enabled by default, with its memory usage capped at 20% of cache memory. This should improve out-of-the-box large partition performance. #15118
  • A crash if the chunk_len table parameter was set to 0 was fixed. #15265
  • Change Data Capture (CDC) updates its view of topology from time to time. It now does so in the background, to avoid slowing down topology changes. #15194
  • The cdc_generations_v3 table stores internal information about Change Data Capture streams, when consistent cluster topology is enabled. Its schema has been changed to allow for efficiently trimming older and unneeded topology information. #15163
  • Some internal tables were moved from the general commitlog to the private schema commitlog. As a result their memtables are flushed less often, reducing latency for topology changes. #15133
  • Off-strategy compaction now uses incremental compaction for run-based compaction strategies reducing temporary storage requirements. #14992.
  • Compaction strategy options are now validated earlier. #14710.
  • A rare crash when a SERVICE LEVEL is dropped has been fixed #15534
  • When checking the bloom filter for a partition key, we now hash the key once, rather than for every sstable being checked.
  • The log-structured allocator (LSA) will evict cache if a query fails because it needs more memory, and retry it. On the other hand, the reader concurrency semaphore will simulate an allocation failure to a query if it detects the system is under severe memory pressure. The two mechanisms work against each other, as we’ll simulate an allocation failure in order to terminate a query, but LSA will respond by retrying it. To avoid this, LSA will now detect the simulated allocation failure and let the query be terminated. #15278
  • If a QUORUM (or higher) read detects a mismatch between data from different replicas, it starts a process of reconciliation to bring all replicas to the same state. Previously, this did not work well when at least one replica had a large prefix of tombstones, as we would read all of them into memory. Now, we are able to incrementally process sections of the data, even if they are all tombstones. #9111
  • Improved its performance by compiling regular expressions used to validate information during process startup.
  • Stability: failure detector apis need to call gossiper on shard 0 #15816
  • Stability: migration_manager: schema version correctness depends on order of feature enabling #16004
  • Stability: nodetool enablebinary starts the CQL server in the streaming group, instead of statement group #15485
  • Stability: Overloading scylla with materialized view writes can lead to deadlock #15844
  • Stability: raft topology: don’t register topology-on-raft RPCs in non-topology-on-raft mode. After this change, topology on raft RPCs are registered only if the experimental topology on raft mode is enabled #15862
  • Stability: raft: large delays between io_fiber iterations in schema change test. #15622

ScyllaDB uses two separate memory reservation systems for memtables: user, used for user writes, and system, used for ScyllaDB’s own writes. The root cause was Raft did not use the system reservation.

  • Stability: read load failing after one node upgrade [bad_enum_set_mask (Bit mask contains invalid enumeration indices.)] #15795

  • Performance: Repairing a cluster after a restore causes severe reactor stalls throughout the cluster (due to expensive logging within do_repair_ranges() without yield) #14330

  • Install: scylla_post_install.sh: “[ $RHEL ]” does not work for RHEL, it only detects CentOS #16040

  • Stability: test_interrupt_build_process dtest failed with schema_registry - Tried to build a global schema for view ks.t_by_v2 with an uninitialized base info #14011

  • Stability: tests.topology_experimental_raft.test_raft_cluster_features.debug test is flaky. The root cause was error handling in the Raft coordinator. #15747 #15728

  • Stability: The mutation compactor now validates its input stream rather than the output stream.

  • [IPv6 configuration] A node is stuck with “?U” status and Host ID is “null”, unclear reason #16039

  • Stability: assigning position_in_partition is not exception safe, can lead to incorrect data during memory stress #15822

  • Stability: Scylla cluster nodes utilizes 100% of CPU even with no load #12774, #13377, #7753

  • Performance: Major compaction will now merge any sstables streamed in due to decommission or repair before starting compaction. This generates compacted sstables more in line with expectations. #11915.

  • Performance: To generate efficient bloom filters, we estimate the number of partitions in the sstable we will produce. The estimation has been improved for data models where the partition keys dominate the on-disk size. #15726.

  • Stability: repair should handle abort_requested_exception mode gracefully #15710

  • Stability: row_cache::row_cache() isn’t exception-safe #15632

  • Stability: Recently, we changed the schema version algorithm not to hash the entire schema as this causes slow performance with large numbers of tables. This has been reverted due to a regression. #15530.

  • Stability: Compaction will now avoid garbage-collecting tombstones that potentially delete data in commitlog. This prevents data resurrection in the event that a node crashes and replays commitlog. This is rare since generally commitlog data is relatively fresh and tombstones that delete such data would not be garbage collected for other reasons. #14870

  • Performance: Bloom filter efficiency can be reduced after node operation. When writing an sstable, ScyllaDB estimates how many partitions it will have in order to size the bloom filter correctly. In some cases, the estimation was suboptimal for TWCS. #15704

  • Stability: commitlog replay can cause abort due to over-extended skip. During commitlog replay, ScyllaDB skips over corrupted sections. However if the corrupted section also has corrupt size, it can lead to a crash. #15269

  • Stability: compaction_manager::perform_cleanup does not handle condition_variable_timed_out, than may cause nodetool cleanup to fail with exit status 2. #15669

  • Stability: tasks: dangling reference to task’s child pointer #16380

  • Stability: ICS is incorrectly calculating GC before on its own, without taking into account the GC mode and other factors. This might lead to ICS wrongly assuming data can be GCed, but then compaction down the road realizes the data cannot be GCed

  • Stability ICS does not respect staleness condition for sstable runs possibly shadowing data in memtable

  • Stability: ICS cross-tier tombstone compaction can be delayed indefinitely and doesn’t respect ‘tombstone_compaction_interval’

  • Stability: ICS is not honoring tombstone GC mode when scheduling compaction jobs

  • Stability: Fix a race condition in LDAP setup

  • Stability: A regression in IPv6 address formatting, which caused nodetool problems, like breaking when there is an Alternator GSI in the database #16153, or cause a node to be stuck with “?U” status and Host ID is "null #16039

  • Stability: a rare case in which after a node restart, during a short window of time, in which workload prioritizations are not yet propagated to the node, a node can assume wrong priorities, and crash with OOM.

  • Stability: Adding a column to the base table should invalidate prepared statements for views #16392

  • Stability: nodes crashing during repair operations (due to no reader-closing on unexpected exception) #16606

  • Stability: tombstone might not be garbage-collected due to conflicts with data in commitlog. #15777

Operations

  • After a replacenode operation, if the new node had the same IP address as the node it was replacing, the IP address was not moved from pending state to normal state. This is now fixed.
  • Logging of node failures during repair has been improved, in order to help diagnose repair failures.
  • ScyllaDB will now wait for all nodes to be healthy before attempting to bootstrap a new node. #12972
  • Cleanup is a process where an sstable is rewritten to discard all partitions that no longer belong to the node (for example, after bootstrap). It has gained an optimization where we skip over the unnecessary partitions rather than reading and discarding them. #12998 #14317.
  • Among its tasks, gossiper disseminates cluster state updates within the node. Node removal notifications were processed in the background, which could cause them to be reordered with other notifications, causing problems. This is fixed by moving processing to the foreground. #14646
  • There is a new option to specify the number of token ranges to repair in parallel. #4847
  • Since compaction tasks started to be managed by the task manager, their lifetime could be extended even after the compaction is complete. This caused the compaction input sstables to be kept on disk even after they should have been removed. They are now removed as soon as compaction is done. #14966 #15030
  • We now abort running repairs on nodetool drain commands.