[REALESE] Scylla 5.4 RC1 - part 3

Performance and stability

  • The row cache will now purge expired tombstones before populating the cache, removing the performance impact of scanning tombstones. Note that non-expired tombstones are still loaded.
  • ScyllaDB carefully measures the memory consumed by queries, and tries to ensure it will not exceed available memory. However, a query’s memory can grow after it has already started. If this happens to all concurrently running queries, we may run out. To prevent this, two new safeguards are added: first, when one memory threshold is passed, we pause all queries except one with the intent of completing this one query and releasing memory. If this doesn’t help and memory grows even further, we fail all other queries with the intent of letting one succeed, with the rest retired later.
  • The compaction manager reloads sstables during schema change, but it did so with quadratic complexity, causing stalls for tables that had many sstables. This is now fixed. #12499
  • Locking race in the materialized view update path was fixed. #12632
  • A size-on-disk accounting bug in commitlog was fixed. This could lead to segment recycling being stopped indefinitely, with a large reduction in write performance. #12645
  • A bug which could cause crashes while reporting errors in invalid CQL statements involving field selection from a user-defined type was fixed.
  • The compaction backlog tracker computes the amount of work remaining for compaction. It is updated when inserting sstables into the table. The efficiency of this process, for leveled compaction strategy tables, was improved.
  • io_uring support was disabled, since it appears to cause regressions.
  • ScyllaDB is now more careful when dropping user-defined types that are used by a user-defined function.
  • Out-of-memory management can halt allocations from all but one query, in order to get that query to complete and release memory. However, if that query was paused, it could deadlock the system. This is now fixed #12603
  • The row cache and memtables now hold rows and range tombstones in a unified data structure, rather than in separate data structures. This solves performance problems (throughput and latency) when a large partition has many range tombstones. Fixes #2578 #3288 #10587
  • A recently-introduced bug (in 5.2), where a prepared statement with named bind variables was executed without providing a value for one of the variables was fixed. #12727
  • A crash during cql3 aggregation, for the case the query returned no results, has been fixed.
  • ScyllaDB will now treat running out of disk quota (EDQUOT) in the same way it treats running out of disk space (ENOSPC). #12626
  • An edge case where a vnode token boundary coincided with a range scan boundary, but inclusiveness/exclusiveness of the token (< vs <=) did not agree, has been fixed.#12566
  • Some minor bugs in the cql transport server error handling have been corrected.
  • Repair will now ignore local keyspaces.
  • Lightweight transaction IF evaluation has been refactored to have a common code base with the rest of the system. In a few places, semantics were slightly modified. This is not expected to have any impact on production code.
  • Merging schema changes received from other nodes is now faster, when there is a large number of tables.
  • The load-and-stream feature, called by nodetool refresh, reads user-supplied sstables and copies the contents to the correct nodes across the cluster. It is now faster. Load and stream is used by Scylla Manager restore operation.
  • The load-and-stream operation reads user-supplied sstables and streams them to the cluster. It now avoids loading the bloom filter, saving memory.
  • A bug where varint or bool columns could be deserialized incorrectly in rare cases is now fixed. #12821 #12823 #12708
  • A few memory leaks, exposed by the out-of-memory query killer, have been fixed #12767
  • The code for computing proximity (whether nodes are on the same or different rack and datacenter) was optimized.
  • An edge case in repair-based node operations abort process was tightened. #12989
  • The reader concurrency semaphore is responsible for managing concurrency for read queries, balancing memory and CPU use with enough concurrency to keep the disks busy. One case that was not handled well is if a read was blocked due to the system running low on memory, and subsequently made idle (as no one is waiting for its results). This edge case has been fixed #12700
  • Reader concurrency semaphore now has more tracepoints, useful with CQL tracing. #12781
  • Materialized view updates are performed asynchronously relative to updating the main table, so failures there are not visible as an UPDATE or INSERT failure. Instead, errors are logged. Those errors are rate-limited now to avoid flooding the logs.
  • A scan on a disjoint token range (e.g. (1…100), (200…300)) could have resulted in incorrect results. It’s not possible for a user query to specify a disjoint token range; we are checking if internal sources could generate such queries. The bug itself is fixed. #12916
  • A large query is broken up into separate pages. When tracing, each page gets its own trace session. An optimization in ScyllaDB means the following page can reuse state from the previous page. The trace will now link to the previous session when this happens, improving visibility.
  • ScyllaDB uses a separate commitlog for tables holding the schema, so that schema changes do not suffer high latency under heavy write loads. This separate commitlog will now be used for all raft-managed system tables to guarantee atomicity. #12642
  • The separate commitlog for schema is now stored in a separate directory. #11867
  • The in-memory footprint of sstable summaries has been reduced. This is especially noticeable with very small partitions.
  • Running major compaction will no longer increase the compaction scheduling group shares to 200. This is not necessary since major compaction runs in the maintenance/streaming group. #13487
  • The sstable parser is now able to detect more types of corruption involving premature end-of-file with compressed sstables. #13599
  • ScyllaDB avoids allocating large contiguous memory buffers, as these stress the memory allocator. Instead, ScyllaDB uses fragmented buffers which are easier to allocate. However, most compression libraries do not work with fragmented buffers, so large linear buffers are sometimes necessary. ScyllaDB had a mechanism in place to reuse such buffers in order to avoid allocating them for every request, and here it is tightened so that reallocations are even less common. In addition its usage is corrected in sstables. #2694
  • ScyllaDB sometimes caches a running query in order to resume it later. In order to do that, it must also store the position at which the query is at, since the cached query might get purged before it is resumed. To do that, it must scan over-range tombstones to get to a non-ambiguous primary key position. A bug caused this scan to not terminate, resulting in the query running out of memory. This is now fixed. #13491
  • A crash during shutdown due to incorrect service ordering was fixed. #13522
  • The sstable validator had a bug in range tombstone validation fixed.
  • The sstable validation facility now validates the sstable index file (-Index.db). #9611
  • A corner case in replica-side concurrency control related to failed requests was fixed. #13539
  • The selector path (expressions in the SELECT clause) now uses non-contiguous memory. This reduces latency when selecting large blobs, as non-contiguous memory doesn’t suffer from fragmentation.
  • Immediate mode tombstone garbage collection is a schema feature that requests tombstones to be garbage-collected immediately (without waiting for gc_grace_seconds), but it ended up expiring TTLed data too early. This is now fixed. #13572
  • An edge case when converting range tombstones to the internal format used by sstables has been corrected. #12462
  • Very large compactions, involving hundreds of sstables, could cause stalls when the log message announcing the compaction is printed due to quadratic complexity. This is now fixed.
  • A performance regression, introduced in Scylla 5.0, in compactions that process a lot of tombstones has been fixed.
  • When the schema changes, the rows in the row cache have to be upgraded to the new schema. This happens on-demand as rows are hit in the cache. Until now, this happened with partition granularity - all of a partition’s rows that happened to be in cache were upgraded at the same time, causing reactor stalls and high latency when large partitions were cached. This has now been fixed, and the cache is upgraded using row granularity. #2577
  • Commitlog has gained its own scheduling group, to complement the already existing commitlog I/O priority class. This is in preparation for unification of CPU scheduling and I/O scheduling.
  • During shutdown, the system will cancel pending hint writes rather than wait for their 5-minute timeout. This can prevent delays in stopping a server. #8079
  • In internode communications, we now avoid copies of certain heavyweight objects. #12504
  • A cleanup compaction is used to get rid of token ranges that are no longer owned by a node. A bug that delayed deletion of sstables being cleaned up, thus increasing the risk of running out of space, was fixed. #14035
  • Some queries for the internal authentication table used infinite timeouts, leading to shutdown problems. This is now fixed. #13545.
  • The internal data dictionary could lose user defined types on ALTER KEYSPACE statements, resulting in a crash. This is now fixed. #14139
  • The nodetool refresh command loads foreign sstables into ScyllaDB and reshapes them for the current shard distribution. A bug could cause the clean-up after the reshape to crash. It is now fixed. #14001.
  • Repair will now use a more accurate estimate of the partition count to create bloom filters for its sstables.
  • The “forward” service is responsible for execution of automatically parallelized aggregation queries. It is now more careful to stop query retries if a shutdown is requested. #12604
  • SSTable generation numbers are integers used to give SSTables unique names. Generation numbers can now also be UUIDs, which enables placing SSTables on shared storage. #10459
  • Recently, schema changes to data in the row cache changed the upgrade granularity from partition to row, to prevent stalls when large partitions are cached. One place could use an outdated schema, which could cause a crash. This is now fixed. #14110
  • An edge case where querying a data center that has a replication factor equal to zero could lead to a crash has been fixed. #14284
  • ScyllaDB can automatically parallelize certain aggregation queries. The mechanism however had a bug when aggregating columns that had case-sensitive names. This is now fixed #14307
  • A crash when DESCRIBE FUNCTION or DESCRIBE AGGREGATE were used on the wrong function type was fixed. #14360
  • The row cache holds frequently-read rows. When a row is written with the TTL (time-to-live) option, it is set to be automatically deleted after a certain time. If it’s in the cache, however, it will continue to occupy memory, reducing cache utilization. This is now improved, as the cache will detect and remove expired rows when they are read. Infrequently read rows will be removed from the cache using the least-recently-used mechanism.
  • After repair or data movement due to node additions or removals, materialized views need to be updated. This process involves reading from all sstables except those that have been streamed or repaired. This was slow, and is now optimized, speeding up repair and data movement on clusters that have materialized views. #14244.
  • A race condition between cleanup and regular compaction has been fixed.
  • ScyllaDB uses evictable readers in certain places to allow the system to cancel ongoing reads to reclaim memory, with the ability to resume those reads later. #14296
  • While using a lightweight transaction, if inconsistent constraints were given on the clustering key, ScyllaDB would crash. This is now fixed. #13129
  • A rare stack overflow in some repair scenarios has been fixed. #14415
  • The failure detector detects failed nodes by pinging them. Now it does not attempt to ping itself. #14388
  • When topology changes, CDC streams also change.
    The metadata describing these streams is now committed in parts, to avoid overloading the system.
  • When a base table of a materialized view is updated, the affected rows are also changed in the materialized view. For DELETE statements, many rows can be affected, and so the view update code splits the work into batches. However, this split was not performed correctly when range tombstones were involved. This is now fixed. #14503
  • A bug involving incorrect cross-shard access while performing a nodetool scrub command was fixed. #14515.
  • Usually, repair reconciles a shard’s data on one node with the data on the same shard in other nodes. When the number of shards on different nodes doesn’t match, repair has to pick small ranges from all shards on the remote nodes. This adds significant overhead which is most pronounced when there is little or no data in the table. This is common in tests and slows them down, so we now have an optimization for the little-data case. #14093
  • A complication causing problems in bootstrap handling very recent changes in topology was fixed. #14468 #14487
  • A recent change extending the scope of CQL Data Definition Language (DDL) transactions caused a significant performance regression, so it was reverted. #14590
  • A production installation of ScyllaDB locks all memory so we don’t experience high latency due to page faults. However, this only applies from the first time the memory is accessed; the first access can still experience stalls, made larger by using transparent huge pages. To fix this, a Seastar update adds a prefault thread that attempts to access all memory ahead of the database, taking the latency hit on this new thread rather than user queries. This will be visible as increased CPU consumption during the first few seconds (up to a minute on large machines) during process start. #8828.
  • The messaging service, responsible for inter-node communication, now initializes transport-layer security (TLS) earlier, to account for the failure detector pinging its own node. #14299
  • Resharding is a process where an sstable is split into several sstables, each wholly-owned by a single shard. A recent change to integrate resharding into the task manager was found to crash the system, so it was reverted #14475 #14618
  • The system uses a reader_concurrency_semaphore to limit the number of concurrent reads, as each read can consume large amounts of memory when merging sstables. Repair has its own allocation of concurrent reads. We now limit the scope of a read more carefully, to allow new reads to issue more quickly. #14676
  • A recent regression involving a crash in decommission was fixed. #14184
  • A Seastar update will reserve more memory for the operating system in situations that previously led to out-of-memory errors. These situations are ARM machines with 64kB pages (rather than the usual 4kB), and transparent hugepages enabled. As a side effect ScyllaDB will run with less memory.
  • The compaction manager sometimes generates sstables composed of only tombstones, in order to safeguard against a crash causing data resurrection. If there isn’t a crash, these sstables can be safely deleted. However, they are sometimes picked up for compaction before they are deleted, wasting CPU cycles. They are now excluded from compaction. #14560
  • ScyllaDB uses objects called reader_concurrency_semaphores to limit query concurrency and to isolate different service levels. We now check if the service level changed during a query and avoid erroring out in this case.
  • Recently the mechanism to update materialized views after repair was optimized. A latent use-after-free bug was discovered in the optimization, and fixed. #14812
  • A deadlock during shutdown in internode communication was fixed. #14624
  • When updating a materialized view after repair, we chunk the base table data and process each chunk individually. Chunking is based on memory consumption. However, empty partitions were not accounted for, so long runs of empty partitions could create large chunks and run the node out of memory. This is now fixed by accounting for empty partitions. #14819
  • ScyllaDB caches pages from the sstable primary index in order to reduce I/O. In certain cases it reads index pages ahead of the actual need to use them to reduce latency. In rare cases this caused an internal invariant to be violated, crashing the node. This is now fixed. #14814
  • ScyllaDB computes the version of the schema by hashing the mutations that describe the schema in the schema tables. This can lead to an inconsistency between nodes if tombstones are expired at different times. This is now fixed by ignoring empty partitions, making the tombstone expiration time irrelevant. #4485
  • Streaming and repair will now compact data before streaming it, reducing bandwidth usage if the sstables being streamed happen to contain data and tombstones that cover that data. #3561
  • A bug in the Seastar coroutine code, which could lead to unexpected crashes, has been fixed.
  • A source of high latency in multi-partition scans was eliminated. #14008
  • Change Data Capture exposes multiple streams to reflect the cluster topology. It is now more careful to avoid closing and creating new streams unnecessarily. #14055
  • ScyllaDB verifies ownership and permissions for its own files. It now avoids doing this for snapshots, as they might be concurrently being deleted by an administrator or scylla-manager #12010
  • When deleting multiple sstables at once (such as at the end of a compaction), we now avoid flushing the directory unnecessarily as we can rely on the deletion log file instead.
  • We now update the list of sstables requiring cleanup after compaction completion. This avoids a race between decommission and compaction, involving offstrategy compaction that could cause such compacted sstables not to be cleaned. #14304
  • Off-strategy compaction is run after repair or bootstrap on newly received sstables to reduce their count. This compaction now includes a cleanup, to avoid non-owned token ranges from sneaking into the main sstable set via this offstrategy compaction. #15041
  • A bug that caused nodes to fail to start if a tablet was migrated concurrently with its table being dropped was fixed. #15061
  • A recent regression causing a crash on table drop was fixed. #15097
  • ScyllaDB contains two classes of tables, system and user, and uses separate memory pools for their memtables. This avoids a deadlock when a user memtable is being flushed, and needs to allocate memtable space for a system table as part of the flush process. We now automatically designate all system tables as using the system memtable pool. #14529
  • Latency during repair of large numbers of small rows was improved. #14537
  • Previously, the cache was enhanced to remove expired tombstones on read. It will now remove expired range tombstones on read as well. This prevents tombstone accumulation in cache. #6033
  • Read concurrency on replicas is managed by reader_concurrency_semaphore. A deadlock while stopping it has been fixed. #15198
  • The index cache caches the Index.db components. It was previously disabled by default due to regressions on small-partition workloads. It is now enabled by default, with its memory usage capped at 20% of cache memory. This should improve out-of-the-box large partition performance. #15118
  • A crash if the chunk_len table parameter was set to 0 was fixed. #15265
  • Change Data Capture (CDC) updates its view of topology from time to time. It now does so in the background, to avoid slowing down topology changes. #15194
  • The cdc_generations_v3 table stores internal information about Change Data Capture streams, when consistent cluster topology is enabled. Its schema has been changed to allow for efficiently trimming older and unneeded topology information. #15163
  • Some internal tables were moved from the general commitlog to the private schema commitlog. As a result their memtables are flushed less often, reducing latency for topology changes. #15133
  • Off-strategy compaction now uses incremental compaction for run-based compaction strategies reducing temporary storage requirements. #14992.
  • Compaction strategy options are now validated earlier. #14710.
  • A rare crash when a SERVICE LEVEL is dropped has been fixed #15534
  • When checking the bloom filter for a partition key, we now hash the key once, rather than for every sstable being checked.
  • The log-structured allocator (LSA) will evict cache if a query fails because it needs more memory, and retry it. On the other hand, the reader concurrency semaphore will simulate an allocation failure to a query if it detects the system is under severe memory pressure. The two mechanisms work against each other, as we’ll simulate an allocation failure in order to terminate a query, but LSA will respond by retrying it. To avoid this, LSA will now detect the simulated allocation failure and let the query be terminated. #15278
  • If a QUORUM (or higher) read detects a mismatch between data from different replicas, it starts a process of reconciliation to bring all replicas to the same state. Previously, this did not work well when at least one replica had a large prefix of tombstones, as we would read all of them into memory. Now, we are able to incrementally process sections of the data, even if they are all tombstones. #9111
  • Improved its performance by compiling regular expressions used to validate information during process startup.
  • The mutation compactor now validates its input stream rather than the output stream.

Operations

  • After a replacenode operation, if the new node had the same IP address as the node it was replacing, the IP address was not moved from pending state to normal state. This is now fixed.
  • Logging of node failures during repair has been improved, in order to help diagnose repair failures.
  • ScyllaDB will now wait for all nodes to be healthy before attempting to bootstrap a new node. #12972
  • Cleanup is a process where an sstable is rewritten to discard all partitions that no longer belong to the node (for example, after bootstrap). It has gained an optimization where we skip over the unnecessary partitions rather than reading and discarding them. #12998 #14317.
  • Among its tasks, gossiper disseminates cluster state updates within the node. Node removal notifications were processed in the background, which could cause them to be reordered with other notifications, causing problems. This is fixed by moving processing to the foreground. #14646
  • There is a new option to specify the number of token ranges to repair in parallel. #4847
  • Since compaction tasks started to be managed by the task manager, their lifetime could be extended even after the compaction is complete. This caused the compaction input sstables to be kept on disk even after they should have been removed. They are now removed as soon as compaction is done. #14966 #15030
  • We now abort running repairs on nodetool drain commands.

Deployment and install

  • ScyllaDB installation will now tune the OS core dump service to allow a longer time 2 to dump cores. This is necessary since ScyllaDB allocates all memory and therefore takes a longer time to dump core if an error is encountered. #5430
  • The installer now wipes filesystem signatures from the individual disks making up a RAID array, preventing problems with reuse of disks. #13737
  • We now tune the Linux kernel’s caching of inodes (in-memory structure representing file metadata) to favor evicting inodes quickly. This aims to reduce kernel memory fragmentation when there are large numbers of sstables, as most files comprising an sstable aren’t accessed after the process starts.
  • The bundled Prometheus node_exporter has been updated to version 1.6.1.

Tools

  • The CQL shell, cqlsh, has been separated into its own repository. As part of that change, cqlsh is now compatible with Python 3. CQLSh is now available as a Docker image, and in PiPy, allowing you to easily use it when you do not need the entire ScyllaDB server, for example with Scylla Cloud.

  • The port option in SSTableLoader was fixed.

  • The cassandra-stress benchmarking tool’s -log hdrfile=… option now works with Java 11

  • Scylla process --list-tools option now correctly lists all tools invocable via the scylla binary.

  • The JMX support application, used to support nodetool, now runs under Java 11.

  • The scylla sstable tool now has more ways to obtain the schema. #10126. See Scylla SSTable docs for more info.

  • A bug in the nodetool command to disable auto compaction has been fixed #13553

  • The nodetool checkAndRepairCdcStreams is used to align CDC streams with the cluster topology. It now works when topology is under Raft control.

  • The nodetool refresh command gained the –primary-replica-only option.

  • The sylla sstable tool now supports the scrub operation, enabling offline (and off-node) scrubbing of sstables. #14203

  • The cassandra-stress tool now supports the Java driver’s rack-aware policy. This can reduce cloud inter availability zone networking costs, with the downside of less even load balancing if care isn’t taken to balance the application.

  • The setup utility supported an --online-discard switch to enable/disable online discard, but it did not actually work. This is now fixed. #14963

  • The nodetool stop RESHAPE command is supposed to stop the reshape operation, but in fact only aborted running reshape compactions, which were promptly restarted. It now aborts the entire operation as expected. #15058

Configuration Updates

The scylla.yaml configuration items are now documented in the documentation website.

Additional update

New and updated configuration options:

  • It is now possible to disable configuration changes via the system.config virtual table using a configuration parameter. Use this option to prevent runtime configuration changes via CQL.#14355

  • task_ttl_in_seconds - Task Manager option: time for which information about finished tasks stays in memory.

  • RF Guardrail config values (see above)

    • minimum_replication_factor_fail_threshold
    • minimum_replication_factor_warn_threshold
    • maximum_replication_factor_warn_threshold
    • Maximum_replication_factor_fail_threshold
  • Stream_plan_ranges_percentage is renamed to stream_plan_ranges_fraction

  • Cache_index_pages is no enabled by default, with an index_cache_fraction value of 0.2

    Index_cache_fraction is the maximum fraction of cache memory permitted for use by index cache. Clamped to the [0.0; 1.0] range. Must be small enough to not deprive the row cache of memory, but should be big enough to fit a large fraction of the index. The default value 0.2 means that at least 80% of cache memory is reserved for the row cache, while at most 20% is usable by the index cache.

  • x_log2_compaction_groups option to controls static number of compaction groups per table per shard - is removed

  • Live_updatable_config_params_changeable_via_cql - If set to true, configuration parameters defined with LiveUpdate option can be updated in runtime with CQL (more above)

  • Enable_node_aggregated_table_metrics - Enable aggregated per node, per keyspace and per table metrics reporting, applicable if enable_keyspace_column_family_metrics is false. Default True.

  • Enable_compacting_data_for_streaming_and_repair - Enable the compacting reader, which compacts the data for streaming and repair (load and stream included) before sending it to, or synchronizing it with peers. Can reduce the amount of data to be processed by removing dead data, but adds CPU overhead. Default: True.

  • Table_digest_insensitive_to_expiry - When enabled, per-table schema digest calculation ignores empty partitions. Default: True.

  • Schema_commitlog_segment_size_in_mb - ScyllaDB uses a separate commitlog, called the schema commitlog, for schema changes and topology operations in order to reduce the latency of these operations. The segmented size of the schema commitlog has been raised from 32MB to 128MB in order to avoid problems with large numbers of tables, as the entire schema must fit in a single segment.

  • Stream_plan_ranges_percentage - Specify the percentage of ranges to stream in a single stream plan. Value is between 0 and 1. Default 0.1 #14191

  • alternator_describe_endpoints - Overrides the behavior of Alternator’s DescribeEndpoints operation. An empty value (the default) means DescribeEndpoints will return the same endpoint used in the request. The string ‘disabled’ disables the DescribeEndpoints operation. Any other string is the fixed value that will be returned by DescribeEndpoints operations. This was require to bypass AWS SDK issue When DynamoDB DescribeEndpoints is used, wrong scheme may be tacked on the result · Issue #2554 · aws/aws-sdk-cpp · GitHub

  • Table_digest_insensitive_to_expiry - When enabled, per-table schema digest calculation ignores empty partitions. Default: True.

  • Auth_certificate_role_queries - Regular expression used by CertificateAuthenticator to extract role name from an accepted transport authentication certificate subject info. See more in the Security section.

  • Auth_superuser_name - Initial authentication super username. Ignored if authentication tables already contain a super user.

  • Auth_superuser_salted_password - Initial authentication super user salted password. Create using mkpassword or similar. The hashing algorithm used must be available on the node host. Ignored if authentication tables already contain a super user password.

  • strict_is_not_null_in_views - In materialized views, restrictions are allowed only on the view’s primary key columns. In old versions Scylla mistakenly allowed IS NOT NULL restrictions on columns which were not part of the view’s primary key. These invalid restrictions were ignored. This option controls the behavior when someone tries to create a view with such invalid IS NOT NULL restrictions. Can be true, false, or warn. Default: True.

  • object_storage_config_file - part of the new experimental object store feature (above). Optionally, read object-storage endpoints config from file.

  • “tablets” - new experimental flag.

  • relabel_config_file - optionally, read relabel config from file.

  • Schema_commitlog_directory - The directory where the schema commit log is stored. This is a special commitlog instance used for schema and system tables. For optimal write performance, it is recommended the commit log be on a separate disk partition (ideally, a separate physical device) from the data file directories.

  • Nodeops_watchdog_timeout_seconds - Time in seconds after which node operations abort when not hearing from the coordinator. Default 120s.

  • Nodeops_heartbeat_interval_seconds - Period of heartbeat ticks in node operations. Default 10.

  • Query timeouts in configuration (e.g. read_request_timeout_in_ms) can now be hot-reloaded using SIGHUP. #12232

  • ScyllaDB has an error injection facility, used by QA to test error paths. It can now be enabled via configuration. Use with caution!

  • The experimental flag used to enable consistent topology changes has been renamed from “raft” to "consistent-topology-changes. #14145

  • The schema commitlog size was accidentally set to 10TB, it’s now set to a reasonable size.

  • The --max-io-requests init option, which has been obsolete for quite some time, was removed.

Admin REST API

  • It’s now possible to disable and enable tombstone compaction on a per-node basis using a REST API endpoint. This is useful if the user knows that all DELETEs were performed with CL=ALL and so there is no risk of data resurrection.

  • The REST API that accepts sstable generation numbers now uses a string value, in preparation for using UUID generations.

  • The type of the “generation” field of “sstable” in the return value of RESTful API entry point at “/storage_service/sstable_info” is changed from “long” to “string”.

  • The API for performing sstable cleanup, and use by nodetool cleanup, will now wait for staging sstables to be cleaned up too.

  • The hints synchronization point API allows an external user to wait for hints to replay. Misuse of the API cookie could lead to unbounded memory usage; the cookie is now protected with a checksum. #9405

  • The --experimental flag was removed. It was replaced some time ago with --experimental-features., which provides fine-grained control about which experimental features are enabled.

  • There is a new REST API call to recalculate schema digests. It can be useful to heal some schema disagreement problems. #15380

Build

  • The ScyllaDB source base contains several performance microbenchmarks. These are now integrated into the main Scylla binary as subcommands, so they can be run on any machine where ScyllaDB is installed e.g. scylla perf_simple_query. #12484
  • In developer and debug mode, ScyllaDB will now configure Seastar in shared library mode. This only affects developers of ScyllaDB itself, as releases still use static libraries.
  • The code base was migrated away from the standard library’s regular expression implementation to the one provided by boost. The standard library implementation was proven several times to be slow (causing stalls) and to consume too much stack space, especially on ARM.
  • The build toolchain has been updated to Fedora 38 with clang 16.0.6.

Monitoring, tracing and logging

Metrics updates below:

  • There is a new metric for prepared statement cache eviction rates. #10463
  • CQL transport metrics were refined, and new metrics were added so one can measure request and response bandwidth, for each opcode type.
  • The CQL transport server (port 9042) recently gained per-opcode bandwidth statistics. They are now measured per service level as well.
  • ScyllaDB can now relabel metrics according to user-provided configuration. This can be used together with Prometheus to reduce the number of metrics reported.
  • We now drop per-table metrics early during teardown of a table. Previously, if a table was dropped and re-created quickly, the metrics from the old and new tables could clash, resulting in an error.
  • The column name reported when writetime() is given a primary key column (which is illegal) is now human readable, even for humans that don’t remember the ASCII table.
  • If the startup sequence is aborted by an interrupt (ctrl-C or systemd shutdown), an exception error message is shown. It is now ignored by the system and not displayed. #12898
  • When compaction completes, it reports the throughput it achieved. We now base it on the input bytes read rather than output bytes, as the latter gives incorrect results for overwrite or expiring workloads. #14533
  • There is now a REST API for configuring Prometheus metrics label rewriting