ScyllaDB Enterprise Release 2024.2.0 - more improvments

tzach · November 20, 2024, 11:12am

See ScyllaDB Enterprise Release 2024.2.0 for main release features.

Improvements

The following is a list of improvements and bug fixes included in the release, grouped by domain.

Bloom Filters

Bloom filters are used to determine which SStables do not contain a partition key, speeding up reads when SStables can be filtered out.

Since the Bloom filters are held in memory, and their size depends on the data (small partitions require larger Bloom filters), there is a tradeoff between allocated memory, and the risk of OOM and the filter efficiency. The following improvements were made to Bloom filters in this release:

To generate efficient Bloom filters, we estimate the number of partitions in the SStable we will produce. The estimation has been improved for data models where the partition keys dominate the on-disk size.#15726
When writing an SStable, we estimate how many partitions it will have in order to size the bloom filter correctly. A few bugs were corrected for this estimation with the Time Window Compaction Strategy. #15704.
For stability, we drop Bloom filters from memory if their total memory consumption is above the configured limit. #17747
Also for stability, a reclaimed Bloom filter is left on disk when the SSTable is deleted #18398
ScyllaDB drops Bloom filters when they use up too much space. It will now reload them when space is available again (for example due to compaction). #18186
ScyllaDB estimates a compaction’s partition count in order to correctly size the Bloom filter. It will now improve the estimate for garbage collection SStables. #18283

Stability and performance

Compaction Related

Compaction will now avoid garbage-collecting tombstones that potentially delete data in commitlog. This prevents data resurrection in the event that a node crashes and replays commitlog. This is rare since generally commitlog data is relatively fresh and tombstones that delete such data would not be garbage collected for other reasons. #14870
Major compaction will now merge also sstables that were streamed in by e.g. repair, repair-based node operations. Those sstables were wrongly ignored by major compaction, leading to #11915
Stability: save memory by making regular compaction tasks internal #16735
A possible crash in the REST API when stopping compaction on a keyspace was fixed. #16975
During shutdown, we now wait for compactions to complete when shutting down system tables to avoid updates to system.compaction_history from racing with its shutdown. #15721
A potential data resurrection problem when cleanup was performed as a side-effect of regular compaction was fixed. #17501 #17452.
During compaction tombstone garbage collection, we now consider the memtable only if it contains the key . This prevents old data in memtable from preventing tombstone garbage collection. #17599

Commitlog Related

Each node contains a local system.truncated table containing truncation records for other tables on the node. This table is now cleared after commitlog replay to avoid it being re-interpreted incorrectly in the case of two consecutive commitlog replays. #15354
Stability: tombstone might not be garbage-collected due to conflicts with data in commitlog. #15777
Commitlog: During commitlog replay, we skip over corrupted sections. However if the corrupted section also has corrupt size, it can lead to a crash. This is now fixed. #15269
Commitlog will now avoid going over its configured disk space size. To do so, it will flush memtables earlier. #10974
Commitlog format has changed to individually checksum each disk sector. This allows discrimination between three types of sectors: those containing valid data, those corrupted, and those that contain data from a previous use of the commitlog segment file. This should reduce false-positive warnings about commitlog corruptions during replay. Note the format is not compatible with previous versions, so upgrades should flush memtables first. #15494
ScyllaDB can store changes to its own metadata using a separate commitlog, to prevent it waiting for user data to commit. This separate commitlog is now mandatory. #16254
When recovering after a crash, we skip commitlog replay for data that we know was captured in sstables. However, we must ignore sstables that were not created on this node, as the commitlog positions they refer to are invalid on this node. #10080
When replaying commitlog mutations after a crash, we now ignore mutations to tablets that were migrated away from this node, preventing data resurrection. #16752

Cluster Operation Related

Topology changes, Repairs, etc

To fix an edge case, when a node joins the raft group, it now first waits for the first node to enter the NORMAL state #15807
Topology: An edge case where a joining node is rejected from the cluster, but the rejection message times out, has been fixed. #15865
A cleanup operation will now flush memtables, to prevent data that managed to stay in memtables from not being cleaned up. #4734
A protocol change that caused streaming in a mixed version cluster to fail has been rectified. #17175
A bug in repair that could have caused problems with mixed-version clusters has been fixed. #16941
Streaming now has additional protection for the case where the streamed table is dropped during the process. #15370
CDC: Each time topology changes, a new generation is created with a new set of CDC streams for the application to listen to. We now allow writes to be written to the previous generation’s streams, to accommodate clients with unsynchronized clocks, or with lightweight transactions (LWT) which can have the same effect as unsynchronized clocks. This also works with consistent topology. #7251 #15260
A hang when a decommissioned node is restarted was fixed; the operation now correctly fails. #17282
Repair failures if a table is dropped during repair was fixed. #17028 #15370 #15598
A deadlock in repair has been fixed #17591
A use-after-free error that manifested during repair with very large partition keys is now fixed. #17625
Stability: repair: memory accounting does not include keys #16710
service levels on raft: upgrade might get stuck if nodes were removed recently #18198

Materialized Views Related

In some cases a materialized view table’s schema can be constructed when the base table’s schema is not yet known. We now avoid this illegal state.#14011
ScyllaDB invalidates materialized view prepared statements when the schema of the base table changes. This prevents a stale prepared statement from returning incorrect results. #16392
Materialized view memory accounting is now more accurate. This reduces the risk of running out of memory in materialized view intensive workloads. #17364
A rare failure when a materialized view update happened to be empty was fixed. #15228 #17549
MV: Tracking of memory within materialized view updates was improved. #17854
MV: A crash when a view was created while a TRUNCATE operation was in progress was fixed . #17543
MV: Materialized views track the amount of work in progress in order to limit it. Due to a bug, if the base replica had to update both a local view replica and a remote view replica, then only the work to update the local view replica was tracked. This could lead to running out of memory, and is now fixed. #17890
We now fail base table writes rather than dropping materialized view updates, to reduce base/view inconsistencies. #17795
MV: Delete all partitions from the table with few MV failed with “bad_alloc” and cluster shutted down #12379
MV: ScyllaDB disseminates the materialized view update backlog in order to control update rates. A bug that prevented this in some circumstances was fixed. #18462

Performance Related

The automatic creation of internal keyspaces and tables has been streamlined, resulting in faster start-up. #15437
The start stop native transport API (used by nodetool enablebinary) mistakenly launched the listener in the streaming scheduling group, causing subsequent queries to run with reduced priority compared to maintenance operations such as repair and bootstrap. The listener is now correctly launched in the statement scheduling group. #15485
Node startup will now recalculate the schema digest fewer times during restart, resulting in faster startups. #16112
Repair has gained a new mode for small tables that makes repair significantly faster #16011
ScyllaDB detects internal stalls using a stall detector. It can do so using a timer, or using a more hardware performance counter, which can also track kernel stalls. It now sets up permissions for itself to use the performance counter #15743
The gossip-based schema propagation code uses a hash of the schema metadata to check if nodes have the same schema, since changes to the schema can happen in different nodes independently. This schema calculation can be time consuming in a cluster with thousands of tables. In consistent schema mode, therefore, we no longer compute a hash of the entire schema but instead generate a schema version using a timeuuid. This speeds up operations on clusters with many tables. #7620
The rewritesstables command will now execute in the streaming/maintenance group, reducing its impact on the rest of the system. #16699
The system will now scan sstable files during startup in a way that prevents fragmentation of the kernel inode and dentry caches. This helps reduce memory pressure on systems with many sstables. #14506
In tables with many small partitions (or many partition tombstones), sstable index pages can contain many entries. They are now destroyed gently to avoid stalls. #17605
Performance: A reactor stall when reading the materializing very large schemas has been fixed. #17841
scylla start very slowly, Spend a lot of time for Loading repair history #16774 #17993

Edge cases

A bug in which a read retry after an allocation failure caused a crash was fixed. #15578
Streaming now checks more closely if a table was dropped while it was streamed. #15370
ScyllaDB uses two separate memory reservation systems for memtables: user, used for user writes, and system, used for ScyllaDB’s own writes. This ensures that a heavy write workload does not impact ScyllaDB’s internal housekeeping. The raft table was moved to the system memory reservation #15622
A number of rare bugs in the row cache were fixed. #15483
Stability: A regression in IPv6 address formatting, which caused nodetool problems, like breaking when there is an Alternator GSI in the database #16153, or cause a node to be stuck with “?U” status and Host ID is "null #16039
Scylla crashes when table is dropped concurrently with streaming #16181
The fencing mechanism prevents a read or write from accessing an outdated replica. The mechanism is now disabled on local tables, as these can never be out of date. #16423
Internal updates to system.peers ensure the host_id column is filled in correctly; this is required for Raft to operate correctly. Note all released versions already fill in this column. #16376
Queries to a local table will now no longer be automatically parallelized to avoid shutdown problems. Local tables typically have little data and don’t benefit from parallelization. #16570
The hint manager is now started later in the boot process, until we have better information about other nodes in the cluster. #11870
A recent regression in calculating whether we reached the in-flight hint limit was fixed . #17636
The hinted handoff code now uses host IDs instead of IP addresses. This conforms with consistent topology, which uses host IDs throughout. #12278
Change Data Capture (CDC) maintains a history of the cluster topology to allow clients to fetch older change records. This history is now pruned earlier to reclaim space. #18497
Large data handler does not take range tombstones into account #13968
Reader-concurrency-semaphore: saved readers can refer to dangling tablets if tablet is migrated away #18110

Guardrails

Guardrails is a framework to protect ScyllaDB users and admins from common mistakes and pitfalls. In this release the following Guardrails are added:

Replication strategies Guardrails can be used to deny SimpleStratagy strategy, for example, which is not recommended for production.

Two new configurations

replication_strategy_warn_list

replication_strategy_fail_list

Replace the old restrict_replication_simplestrategy and give more granularity for DB Admin to warn, or block non production strategies.

CQL

Remove dclocal_read_repair_chance and read_repair_chance options. Both was removed from Apache Cassandra 3.11 and 4.x, see [#CASSANDRA-13910] .
ScyllaDB now rejects adding fields of type duration to a user-defined type ((UDT), if that UDT is used as a clustering key component. #12913
Correctness: a rare combination of reconciliation (read repair) with reverse queries and range tombstones, could cause incorrect data to be returned from queries #10598
CQL: toJson() produces invalid JSON for columns with “time” type #7988
CQL: a 5.4.0 regression in GROUP BY behavior #16531
CQL tracing now records two additional fields: the user who initiated the query, and the reader concurrency semaphore that is in charge of admitting the query on the replica.
There is a new CQL statement, LIST EFFECTIVE SERVICE LEVEL, to query what service level attributes apply to a role. A role’s service level attributes may be a mix of different SERVICE LEVEL configurations. #15604
CQL multi-column comparisons now have better NULL checks. #13217
DROP TYPE IF EXISTS statement now works if the type’s keyspace doesn’t exist. #9082
Correctness: a bug in the row cache could cause a query to return incomplete data if, under moderate load, the query was preempted. This is now fixed. Note that QUORUM reads would usually see the missing data completed from the other replica, so the bug is mostly visible with consistency level ONE or similar. #16759
Correctness: A bug was fixed where a deletion in a base table that affected a large number of materialized view rows to be updated could cause some updates to be missed. #17117
The CQL “date” type is now more resistant to overflow. #17066
The totimestamp() CQL function is now protected against undefined behavior when applied to extreme values. #17516
INSERT JSON now understands more data types when used as keys: blob,date,inet,time,timestamp,timeuuid,uuid #18477
An edge case in converting JSON numbers close to the maximum integer has been fixed. #13077
A crash in the mintimeuuid() function has been fixed. #17035
Unpaged queries will no longer fail when they reach the tombstone limit. The tombstone limit ends a page in an unpaged query, but unpaged queries are allowed to scan tombstones until they time out of reach the memory limit. It is strongly recommended to use paged queries.#17241

Alternator

Alternator now responds correctly to the DeleteTable API. #14132
An ABBA-style deadlock was fixed for this scenario: if an Alternator table used strong write isolation and had a global secondary index, then index updates could wait on the base table update, and vice versa. This also happens with materialized views on a table updated using lightweight transactions (LWT). #15844
Alternator now supports the ReturnValuesOnConditionCheckFailure feature. This makes handling contention more efficient. #14481
A correctness problem when using the Alternator Global Secondary Index that has two key columns has been fixed. #17119

Images and packaging

The container image now avoids installing “suggested” packages, to reduce its size. #15579
Prometheus node_exporter, used to expose operating system metrics, was updated to version 1.7.0 to address a security vulnerability. #16085
Remove Java from the docker image #18566

Tooling and REST API

Scylla REST API is now documented! (beta)
The sstable validation tools, scylla sstable validate-checksum and scylla sstable validate, now returns output in json format.
Admin API: a new API for asynchronous compaction:

/tasks/compaction/keyspace_compaction/{keyspace}

Similar to the existing synchronos storage_service API.

#15092

Bundled tools now use the Seastar epoll reactor backend rather than linux-aio; this reduces the risk of startup failures.
The bundled cqlsh has been updated , with a fix for a COPY TO STDOUT regression. #17451
The REST API for reporting cache statistics is now more accurate. #9418
Some false-positives were eliminated from the scrub command. #16326
The iotune utility is used to measure a disk’s performance when installing ScyllaDB. It now works when executed on machines with a very large core count (208 cores )#18546

Configuration

There is now a mechanism to deprecate configuration options. #15887
The me sstable format is now mandatory. The system can still read older formats. #16551
The service level mechanism works by polling the internal tables used to represent it. The polling interval can now be configured. This is useful to speed up tests.
The scylla sstable tool now supports loading the schema of materialized views and indexes. #16492
The unchecked_tombstone_compaction compaction option has been implemented. Setting this option will ignore tombstone_threshold. #1487
Failure detector’s ping timeout is configurable, with ping_timeout, and with higher default (600ms) #16607
Two new Encryption at Rest Configuration Parameters:
- key_cache_expiry - Key cache expiry period, after which keys will be re-requested from the server. Default is 600s.
- Key_cache_refresh - Key cache refresh period - the frequency at which cache is checked for expired entries. Default is 1200s.

Tracing and Auditing

Audit: Replication Factor for audit table is changed from 1 to 3 for better availability.
The large partition system tables now count range tombstones as regular rows, so that large partitions with mainly range tombstones are reported. #18346
Tracing: Large partition/row/cell warnings should not print keys contents #18041

Monitoring

Scylla Monitoring Stack released 4.8.1 and later supports ScyllaDB 2024.2

Deprecated and removed features

Remove probabilistic read repair #3502. Probabilistic read repairs were deprecated for a long time. This is a mirror for CASSANDRA-13910.

Topic		Replies	Views
ScyllaDB Enterprise Release 2024.1.0 - Deployment and more improvements Release Notes enterprise , enterprise-release , enterprise-2024-1	0	465	February 8, 2024
ScyllaDB Enterprise Release 2022.2.0 Release Notes enterprise , enterprise-release , enterprise-2022-2	1	1947	August 2, 2023
ScyllaDB Enterprise Release 2023.1.0 - part 2 Release Notes enterprise , enterprise-release , enterprise-2023-1	0	438	August 21, 2023
[RELEASE] Scylla 5.2.0 Release - part 2 Release Notes open-source , open-source-release , open-source-5-2	0	1870	May 4, 2023
[REALESE] Scylla 5.4 RC1 - part 3 Release Notes open-source , release-candidate , open-source-release , open-source-5-4	0	594	November 8, 2023