ScyllaDB Enterprise Release 2022.2.0

The ScyllaDB team is pleased to announce the release of ScyllaDB Enterprise 2022.2, a production-ready ScyllaDB Enterprise Feature release.

ScyllaDB Enterprise 2022.1 is the latest Long-Term Support (LTS) release.

More information on the new ScyllaDB Long-Term Support (LTS) policy is available here.

The ScyllaDB Enterprise 2022.2 release is based on ScyllaDB Open Source 5.1, and introduces Partition level rate limit, Alternator TTL, and more functional, performance and stability improvements.

Related Links

ScyllaDB Enterprise customers are encouraged to upgrade to ScyllaDB Enterprise 2022.2, and are welcome to contact our Support Team with questions.

New Features

Alternator TTL

In Scylla 5.0 we introduced Time To Live (TTL) to the Amazon DynamoDB compatible API (Alternator) as an experimental feature. In ScyllaDB Enterprise 2022.2 we promote it to production ready.

As in DynamoDB, Alternator items that are set to expire at a specific time will not disappear precisely at that time but only after some delay. DynamoDB guarantees that the expiration delay will be less than 48 hours (though for small tables, the delay is often much shorter). In Alternator, the expiration delay is configurable - it defaults to 24 hours but can be set with the --alternator-ttl-period-in-seconds configuration option.

More Alternator updates in this release:

Alternator, ScyllaDB’s implementation of the DynamoDB API, now provides the following improvements:

An improvement to BatchGetItem performance by grouping requests to the same partition.#10757

Limit partition access rate

It is now possible to limit read rates and writes rates into a partition with a new WITH per_partition_rate_limit clause for the CREATE TABLE and ALTER TABLE statements. This is useful to prevent hot-partition problems when high rate reads or writes are bogus (for example, arriving from spam bots). #4703

Examples:

Limits are configured separately for reads and writes. Some examples:

ALTER TABLE t WITH per_partition_rate_limit = {

‘max_reads_per_second’: 100,

‘max_writes_per_second’: 200

};

Limit reads only, no limit for writes:

ALTER TABLE t WITH per_partition_rate_limit = {

‘max_reads_per_second’: 200

};

Learn More

Load and stream

This feature extends nodetool refresh to allow loading arbitrary sstables that are not owned by a particular node into the cluster. It loads the sstables from the disk, calculates the data’s owning nodes, and automatically streams the data to the owning nodes. In particular, this is useful when restoring a cluster from a backup to a new cluster with a different number of nodes.

One can copy the sstables from the old cluster to the new nodes and trigger the load and stream process.

This can make restores and migrations much easier:

  • You can place sstables from any node from the old cluster to any node in the new cluster
  • No need to run nodetool cleanup to remove data that does not belong to the local node after refresh

Load_and_stream option also updates the relevant Materialized Views #9205

curl -X POST "http://{local-ip}:10000/storage_service/sstables/{keyspace}?cf={table}&load_and_stream=true

Note there is an open bug, #282 , for the Nodetool refresh --load-and-stream operation. Until it is fixed, use the REST API above.

Materialized Views: Prune

A new CQL extension PRUNE MATERIALIZED VIEW statement can now be used to remove inconsistent rows from materialized views. A special statement is dedicated for pruning ghost rows from materialized views.

A ghost row is an inconsistency issue that manifests itself by having rows in a materialized view which do not correspond to any base table rows. Such inconsistencies should be prevented altogether and ScyllaDB strives to avoid them, but if they happen, this statement can be used to restore a materialized view

to a fully consistent state without rebuilding it from scratch.

Example usages:

PRUNE MATERIALIZED VIEW my_view;

PRUNE MATERIALIZED VIEW my_view WHERE token(v) > 7 AND token(v) < 1535250;

PRUNE MATERIALIZED VIEW my_view WHERE v = 19;

Performance: Eliminate exceptions from the read and write path

When a coordinator times out, it generates an exception which is then caught in a higher layer and converted to a protocol message. Since exceptions are slow, this can make a node that experiences timeouts even slower. To prevent that, the coordinator write path and read path has been converted not to use exceptions for timeout cases, treating them as another kind of result value instead. Further work on the read path and replica reduces the timeout cost, so goodput is preserved while a node is overloaded.

Improvement results below

Deprecated Features

The following features are deprecated and won’t be available in the following Enterprise releases:

  • In-Memory Tables - an enterprise-only feature
  • Thrift API - legacy ScyllaDB (and Apache Cassandra) API. Thrift has been disabled by default since Scylla Enterprise 2022.1.0
  • Compact Storage - a file format used by Thrift and deprecated from Apache Cassandra.

Please contact Scylla Support for advice if you use any of these features.

Updates in this Release

Deployment and Packaging

  • ScyllaDB now has an Azure snitch for inferring rack/datacenter from the instance metadata.
  • The docker image has received some fixes for regressions: the locale is now set correctly, the supervisord service name has been restored for compatibility with scylla-operator, and the service runs as root to avoid interfering with permissions outside the container. #10310 #10269 #10261

CQL API updates

A list of CQL bug fixes and extensions:

#10178

  • The LIKE operator on descending order clustering keys now works. #10183

  • ScyllaDB would incorrectly use an index with some IN queries, leading to incorrect results. This is now fixed.

  • In CREATE AGGREGATE statements, the INITCOND and FINALFUNC clauses are now optional (defaulting to NULL and the identity function respectively).

  • CREATE KEYSPACE now has a WITH STORAGE clause, allowing to customize where data is stored. For now, this is only a placeholder for future extensions.

  • When talking to drivers using the older v3 protocol, ScyllaDB did not serialize timeout exceptions correctly, resulting in the driver complaining about protocol violations. This is now fixed. #5610

  • The CQL grammar was relaxed to allow bind markers in collection literals, e.g. UPDATE tab SET my_set = { ?, ‘foobar’, :variable }.

  • ScyllaDB now validates collections for NULLs more carefully. #10580

After this change, the following query

INSERT INTO ks.t (list_column) VALUES (?);

And the driver sending a list with null inside as the bound value, something like [1, 2, null, 4] Would result in an invalid_request_exception instead of an ugly marshaling error.

  • Support for lists containing NULLs in IN relations (“WHERE x IN (1, 2, NULL, 3)”) has been removed. This is useless (since NULL doesn’t match anything) and conflicts with how lists work elsewhere.
  • WHERE clause processing has been relaxed to allow filtering on a list element: “WHERE my_list[:index] = :value ALLOW FILTERING”. Previously this was only allowed on maps.
  • Lightweight transaction behavior wrt static rows was adjusted to conform to Cassandra’s behavior.
  • If a named bind variable appeared twice in a statement (“WHERE a = :my_var AND b = :my_var”), ScyllaDB treated this as two separate variables. This is now fixed. #10810
  • The token() built-in function will now correctly return NULL if given NULL inputs, instead of failing the request. #10594

Stability and Performance Improvements

  • If a file page was inserted into cache, but the insertion required a memory allocation, it could cause memory corruption. This is now fixed. File caching is part of scylla-4.6 index caching feature. #9915
  • A recent commitlog regression has been fixed: we might have created two commitlog segment files in parallel, confusing the queue that holds them. The problem was only present in scylla-4.6. #9896
  • When shutting down, compaction tasks that are sleeping while awaiting a retry are now aborted, so the shutdown is not delayed. #10112
  • The compaction manager’s definition of compaction backlog for size-tiered compaction strategy has changed, reducing write amplification. The compaction backlog is used to determine how much resources will be devoted for compaction.
  • The compaction manager’s definition of compaction backlog for size-tiered compaction strategy has changed, reducing write amplification. The compaction backlog is used to determine how much resources will be devoted for compaction. See the results graphed here.
  • An accounting bug in sstable index caching that could lead to running out of memory has been fixed. #10056
  • ScyllaDB will shut down gracefully (without a core dump) if it encounters a permission or space problem on startup. #9573
  • sstables are created in a temporary directory to avoid serializing on an XFS lock. If the sstable writing failed, this directory would be left behind. It is now deleted. #9522
  • When using the spark migrator, ScyllaDB might see TTL’ed cells which it thought needed repair, but could not actually find a difference in, leading to repair not resolving the difference and detecting it again on the next run. This has been fixed. #10156
  • Prepared batch statements are now correctly invalidated when a referenced table changes. #10129
  • A race in the prepared statement cache could cause invalidations (due to a schema change) to be ignored. This is now fixed. #10117
  • When populating the row cache with data from sstables, we will first compact the data in order to avoid populating the cache with data that will be later ignored. #3568
  • When reading, we now notice if data exists in sstables/cache but not in memtables, or vice-versa. In either case there is no need to perform a merge, yielding substantial performance improvements.
  • Truncate now disables compaction temporarily on the table and its materialized views, in order to complete faster.
  • Cleanup compactions, used after bootstrapping a new node to reclaim space in the original nodes, now have reduced write amplification.
  • SSTable index file reads now use read-ahead. This improves performance in workloads that frequently skip over small parts of the partition (e.g. full scans with clustering key restrictions).
  • When ScyllaDB generates a name for a secondary index, it will avoid using special characters (if they were present in the table name). #3403
  • Cleanup compactions, used to discard data that was moved to a new node, now have reduced write amplification on Time Window Compaction Strategy tables
  • Reads from cache are now upgraded to use the new range tombstone representation. This completes the conversion of the read pipeline, and nets a nice performance improvement as detailed in the commit message.
  • A concurrent DROP TABLE while streaming data to another node is now tolerated. Previously, streaming would fail, requiring a restart of the node add or decommission operation. #10395
  • A crash where a map subscript was NULL in certain CQL expressions was fixed. Fixes #10361 #10399 #10401
  • A regression that prevented partially-written sstables from being deleted was fixed.
  • A race condition that allowed queries to be processed after a table was dropped (causing a crash) was fixed. #10450
  • The prepared statement cache was recently split into two sections, one for reused statements and one for single-use statements. This was done for flood protection - so that a bunch of single-use statements won’t evict reused statements from the cache. However, this created a regression when the size of the single-use section was shrunk so that it was too small for statements to promote into the reused section. This is now fixed by maintaining a minimum size for each section. #10440
  • Level selection for Leveled Compaction Strategy was improved, reducing write amplification.
  • Reconciliation is the process that happens when two replicas return non-identical results for a query. Some reactor stalls were removed, reducing latency effects on concurrent queries. #2361 #10038
  • A crash in some cases where an sstable index cursor was at the end of the file was fixed.#10403
  • A compaction job that is waiting in queue is now aborted immediately, rather than waiting to start and then getting aborted.
  • Repair-based node operations use repair to move data between nodes for bootstrap/decommission and similar operations (currently enabled by default only for replacenode). The iteration order has been changed from an outer iteration on vnodes and an inner iteration on tables to an outer iteration on tables and an inner iteration on vnodes, allowing tables to be completed earlier. This in turn allows compaction to reduce the number of sstables earlier, reducing the risk of having too many sstables open.
  • A “promoted index” is the intra-partition index that allows seeking within a partition using the clustering key. Due to a quirk in the sstable index file format, this has to be held in memory when it is being created. As a result, huge partitions risk running out of memory. ScyllaDB will now automatically downscale the promoted index to protect itself from running out of memory.
  • Change Data Capture (CDC) tables are no longer removed when CDC is disabled, to allow the still-queued data to be drained. #10489
  • Until now, a deletion (tombstone) that hit data in memtable or cache did not remove the data from memtable or cache; instead both the data and the tombstone coexisted (with the data getting removed during reads or memtable flush). This was changed to eagerly apply tombstones to memtable/cache data. This reduces write amplification for delete-intensive workloads (including the internal Raft log). #652
  • Recently, repair was changed to complete one table before starting the next (instead of repairing by vnodes first). We now perform off-strategy compaction immediately after a table was completed. This reshapes the sstables received from different nodes and reduces the number of sstables in the node.
  • The Leveled Compaction Strategy was made less aggressive. #10583
  • Compaction now updates the compaction history table in the background, so if the compaction history table is slow, compaction throughput is not affected.
  • Memtable flushes will now be blocked if flushes generate sstables faster than compaction can clear them. This prevents running out of memory during reads. This reduces problems with frequent schema updates, as schema updates cause memtable flushes for the schema tables. #4116
  • A recent regression involving a lightweight transaction conditional on a list element was fixed. #10821
  • A race condition between the failure detector and gossip startup was fixed.
  • Node startup for large clusters was sped up in the common case where there are no nodes in the process of leaving the cluster.
  • An unnecessary copy was removed from the memtable/cache read path, leading to a nice speedup.
  • A Seastar update reduces the node-to-node RPC latency.
  • Internal materialized view reads incorrectly read static columns, confusing the following code and causing it to crash. This is now fixed. #10851
  • Staging sstables are used when materialized views need to be built after token range is moved to a different node, or after repair. They are now compacted regularly, helping control read and space amplification.
  • Dropping a keyspace while a repair is running used to kill the repair; now this is handled more gracefully.
  • A large amount of range tombstones in the cache could cause a reactor stall; this is now fixed.
  • Adding range tombstones to the cache or a memtable could cause quadratic complexity; this is also fixed.
  • Under certain rare and complicated conditions, memtables could miss a range tombstone. This could result in temporary data resurrection. This is now fixed. Fixes #10913 #10830
  • Previously, a schema update caused a flush of all memtables containing schema information (i.e. in the system_schema keyspace). This made schema updates (e.g. ALTER TABLE) quite slow. This was because we could not ensure that commitlog replay of the schema update would come before the commitlog replay of changes that depend on it. Now, however, we have a separate commitlog domain that can be replayed before regular data mutations, and so we no longer flush schema update mutations, speeding up schema updates considerably. #10897
  • Gossip convergence time in large clusters has been improved by disregarding frequently changing state that is not important to cluster topology - cache hit rate and view backlog statistics.
  • Reads and writes no longer use C++ exceptions to communicate timeouts. Since C++ exceptions are very slow, this reduces CPU consumption and allows an overloaded node to retain throughput for requests that do succeed (“goodput”).
  • Major compaction will now happen using the maintenance scheduling group (so its CPU and I/O consumption will be limited), and regular compaction of sstables created since it was started will be allowed. This will reduce read amplification while a major compaction is ongoing. #10961
  • The Seastar I/O scheduler was adjusted to allow higher latency on slower disks, in order to avoid a crash on startup or just slow throughput. #10927
  • ScyllaDB will now clean up the table directory skeleton when a table is dropped. #10896
  • The repair watchdog interval has been increased, to reduce false failures on large clusters.
  • ScyllaDB propagates cache hit rate information through gossip, to allow coordinators to send less traffic to newly started node. It will now spend less effort to do so on large clusters. #5971
  • Improvements to token calculation mean that large cluster bootstrap is much faster.
  • An automatically parallelized aggregation query now limits the number of token ranges it sends to a remote node in a single request, in order to reduce large allocations. #10725
  • Accidentally quadratic behavior when a large number of range tombstones is present in a partition has been fixed. #11211
  • Row cache will miss a row if upper bound of population range is evicted and has an adjacent dummy row #11239

Tooling

  • scylla-api-cli is a lightweight command line tool interfacing the ScyllaDB REST API.

The tool can be used to list the different API functions and their parameters, and to print detailed help for each function.

Then, when invoking any function, scylla-api-cli performs basic validation on the function arguments and prints the result to the standard output. Note that json results msy be pretty-printed using commonly available command line utilities. It is recommended to use scylla-api-cli for interactive usage of the REST API over plain http tools, like curl, to prevent human errors.

  • The sstable utilities now emit JSON output. See example output here.
  • There two new sstable tools, validate-checksums and decompress, allowing for more offline inspection options of sstables.
    • scylla-sstable validate-checksums: helps identifying whether an sstable is intact or not, but checking the digest and the per-chunk checksums against the data on disk.
    • scylla-sstable decompress: helps when one wants to manually examine the content of a compressed sstable.
  • The SSTableLoader code base has been updated to support “me” format sstables.
  • The sstable parsing tools usually need the schema to interpret an sstable’s data. For the special case of system tables, the tools can now use well-known schemas.
  • Nodetool was updated to fix IPv6 related errors (even when IPv4 is used) with update JVMs. #10442
  • Cassandra-derived tooling such as cqlsh and cassandra-stress was synchronized with Cassandra 3.11.3.
  • The bundled Prometheus node_exporterm used to report OS level metrics to ScyllaDB Monitoring Stack was upgraded to version 1.3.1.
  • Repairs that were in their preparation stage previously could not be aborted. This is now fixed.
  • ScyllaDB documentation has been moved from the scylla-docs.git repository to scylla.git. This will allow us to provide versioned documentation.
  • The sstable tools gained a write operation that can convert a json dump of an sstable back into an sstable.

Storage

  • “me” format sstables are now supported (and the default format).
  • ScyllaDB will now store the ScyllaDB version and build-id used to generate an sstable. This is helpful in tracking down bugs and altered persisted data.

Configuration

It is now possible to limit, and control in real time, the bandwidth of streaming and compaction.

These and more configuration updates below:

  • It is now possible to limit I/O for repair and streaming to a user-defined bandwidth limit, using the new stream_io_throughput_mb_per_sec config value. The value throttles streaming I/O to the specified total throughput (in MiBs/s) across the entire system. Streaming I/O includes the one performed by repair and both RBNO and legacy topology operations such as adding or removing a node. Setting the value to 0 disables stream throttling (default). The value can be updated in real time via the config virtual table or via configuration file hot-reload. It is recommended not to change this configuration from its default value, which dynamically determines the best bandwidth to use.
  • compaction_throughput_mb_per_sec: Throttles compaction to the specified total throughput across the entire system. The faster you insert data, the faster you need to compact in order to keep the SSTable count down. The recommended Value is 16 to 32 times the rate of write throughput (in MBs/second). Setting the value to 0 disables compaction throttling, It is recommended not to change this configuration from its default value, which dynamically determines the best bandwidth to use.
  • It is now possible to disable updates to node configuration via the configuration virtual table. This is aimed at ScyllaDB Cloud, where users have access to CQL but not the node configuration. #9976
  • EC2MultiRegionSnitch will now honor broadcast_rpc_address if set in the configuration file.#10236
  • The permissions cache configuration is now live-updatable (via SIGHUP); and there is now an API to clear the authorization cache.
  • The compaction_static_shares and memtable_flush_static_shares configuration items, used to override the controllers, can now be updated without restarting the server.
  • column_index_auto_scale_threshold_in_kb to the configuration (defaults to 10MB). When the promoted index (serialized) size gets to this threshold, it’s halved by merging each two adjacent blocks into one and doubling the desired_block_size.
  • Commitlog_flush_threshold_in_mb: Threshold for commitlog disk usage. When used disk space goes above this value, ScyllaDB initiates flushes of memtables to disk for the oldest commitlog segments, removing those log segments. Adjusting this affects disk usage vs. write latency.

Monitoring and Tracing

Scylla Monitoring Stack 4.1 and later includes a dashboard for ScyllaDB Enterprise 2022.2.

Below are a list of monitoring and tracing related work in this release:

  • Slow query tracing only considered local times - the time from when a request first hit the replica - to determine if a request needs to be traced. This could cause some parts of slow query tracing to be missed. To fix that, slow queries on the replicas are determined using the start time on the coordinator.
  • The system.large_partitions and similar system tables will now hold only the base name of the sstable, not the full path. This is to avoid confusion if the large partition is reported while the sstable is in one directory, but later moved to another, for example from staging to the main directory after view building is done or into the quarantine subdirectory if they are found to be inconsistent with scrub.
  • #10075
  • There are now metrics showing each node’s idea of how many live nodes and how many unreachable nodes there are. This aids understanding problems where failure detection is not symmetric. #10102
  • The system.clients table has been virtualized. This is a refactoring with no UX impact.
  • Aggregated queries that use an index are now properly traced.
  • The amount of per-table metrics has been reduced by sending metric summaries instead of histograms and not sending unused metrics.

Bug Fixes

For a full list of fixed issues see git log and 5.1 release candidates notes.