ScyllaDB Enterprise Release 2024.2.0 - more improvments

See ScyllaDB Enterprise Release 2024.2.0 for main release features.

Improvements

The following is a list of improvements and bug fixes included in the release, grouped by domain.

Bloom Filters

Bloom filters are used to determine which SStables do not contain a partition key, speeding up reads when SStables can be filtered out.

Since the Bloom filters are held in memory, and their size depends on the data (small partitions require larger Bloom filters), there is a tradeoff between allocated memory, and the risk of OOM and the filter efficiency. The following improvements were made to Bloom filters in this release:

  • To generate efficient Bloom filters, we estimate the number of partitions in the SStable we will produce. The estimation has been improved for data models where the partition keys dominate the on-disk size.#15726
  • When writing an SStable, we estimate how many partitions it will have in order to size the bloom filter correctly. A few bugs were corrected for this estimation with the Time Window Compaction Strategy. #15704.
  • For stability, we drop Bloom filters from memory if their total memory consumption is above the configured limit. #17747
  • Also for stability, a reclaimed Bloom filter is left on disk when the SSTable is deleted #18398
  • ScyllaDB drops Bloom filters when they use up too much space. It will now reload them when space is available again (for example due to compaction). #18186
  • ScyllaDB estimates a compaction’s partition count in order to correctly size the Bloom filter. It will now improve the estimate for garbage collection SStables. #18283

Stability and performance

Compaction Related

  • Compaction will now avoid garbage-collecting tombstones that potentially delete data in commitlog. This prevents data resurrection in the event that a node crashes and replays commitlog. This is rare since generally commitlog data is relatively fresh and tombstones that delete such data would not be garbage collected for other reasons. #14870
  • Major compaction will now merge also sstables that were streamed in by e.g. repair, repair-based node operations. Those sstables were wrongly ignored by major compaction, leading to #11915
  • Stability: save memory by making regular compaction tasks internal #16735
  • A possible crash in the REST API when stopping compaction on a keyspace was fixed. #16975
  • During shutdown, we now wait for compactions to complete when shutting down system tables to avoid updates to system.compaction_history from racing with its shutdown. #15721
  • A potential data resurrection problem when cleanup was performed as a side-effect of regular compaction was fixed. #17501 #17452.
  • During compaction tombstone garbage collection, we now consider the memtable only if it contains the key . This prevents old data in memtable from preventing tombstone garbage collection. #17599

Commitlog Related

  • Each node contains a local system.truncated table containing truncation records for other tables on the node. This table is now cleared after commitlog replay to avoid it being re-interpreted incorrectly in the case of two consecutive commitlog replays. #15354
  • Stability: tombstone might not be garbage-collected due to conflicts with data in commitlog. #15777
  • Commitlog: During commitlog replay, we skip over corrupted sections. However if the corrupted section also has corrupt size, it can lead to a crash. This is now fixed. #15269
  • Commitlog will now avoid going over its configured disk space size. To do so, it will flush memtables earlier. #10974
  • Commitlog format has changed to individually checksum each disk sector. This allows discrimination between three types of sectors: those containing valid data, those corrupted, and those that contain data from a previous use of the commitlog segment file. This should reduce false-positive warnings about commitlog corruptions during replay. Note the format is not compatible with previous versions, so upgrades should flush memtables first. #15494
  • ScyllaDB can store changes to its own metadata using a separate commitlog, to prevent it waiting for user data to commit. This separate commitlog is now mandatory. #16254
  • When recovering after a crash, we skip commitlog replay for data that we know was captured in sstables. However, we must ignore sstables that were not created on this node, as the commitlog positions they refer to are invalid on this node. #10080
  • When replaying commitlog mutations after a crash, we now ignore mutations to tablets that were migrated away from this node, preventing data resurrection. #16752

Cluster Operation Related

Topology changes, Repairs, etc

  • To fix an edge case, when a node joins the raft group, it now first waits for the first node to enter the NORMAL state #15807
  • Topology: An edge case where a joining node is rejected from the cluster, but the rejection message times out, has been fixed. #15865
  • A cleanup operation will now flush memtables, to prevent data that managed to stay in memtables from not being cleaned up. #4734
  • A protocol change that caused streaming in a mixed version cluster to fail has been rectified. #17175
  • A bug in repair that could have caused problems with mixed-version clusters has been fixed. #16941
  • Streaming now has additional protection for the case where the streamed table is dropped during the process. #15370
  • CDC: Each time topology changes, a new generation is created with a new set of CDC streams for the application to listen to. We now allow writes to be written to the previous generation’s streams, to accommodate clients with unsynchronized clocks, or with lightweight transactions (LWT) which can have the same effect as unsynchronized clocks. This also works with consistent topology. #7251 #15260
  • A hang when a decommissioned node is restarted was fixed; the operation now correctly fails. #17282
  • Repair failures if a table is dropped during repair was fixed. #17028 #15370 #15598
  • A deadlock in repair has been fixed #17591
  • A use-after-free error that manifested during repair with very large partition keys is now fixed. #17625
  • Stability: repair: memory accounting does not include keys #16710
  • service levels on raft: upgrade might get stuck if nodes were removed recently #18198

Materialized Views Related

  • In some cases a materialized view table’s schema can be constructed when the base table’s schema is not yet known. We now avoid this illegal state.#14011
  • ScyllaDB invalidates materialized view prepared statements when the schema of the base table changes. This prevents a stale prepared statement from returning incorrect results. #16392
  • Materialized view memory accounting is now more accurate. This reduces the risk of running out of memory in materialized view intensive workloads. #17364
  • A rare failure when a materialized view update happened to be empty was fixed. #15228 #17549
  • MV: Tracking of memory within materialized view updates was improved. #17854
  • MV: A crash when a view was created while a TRUNCATE operation was in progress was fixed . #17543
  • MV: Materialized views track the amount of work in progress in order to limit it. Due to a bug, if the base replica had to update both a local view replica and a remote view replica, then only the work to update the local view replica was tracked. This could lead to running out of memory, and is now fixed. #17890
  • We now fail base table writes rather than dropping materialized view updates, to reduce base/view inconsistencies. #17795
  • MV: Delete all partitions from the table with few MV failed with “bad_alloc” and cluster shutted down #12379
  • MV: ScyllaDB disseminates the materialized view update backlog in order to control update rates. A bug that prevented this in some circumstances was fixed. #18462

Performance Related

  • The automatic creation of internal keyspaces and tables has been streamlined, resulting in faster start-up. #15437
  • The start stop native transport API (used by nodetool enablebinary) mistakenly launched the listener in the streaming scheduling group, causing subsequent queries to run with reduced priority compared to maintenance operations such as repair and bootstrap. The listener is now correctly launched in the statement scheduling group. #15485
  • Node startup will now recalculate the schema digest fewer times during restart, resulting in faster startups. #16112
  • Repair has gained a new mode for small tables that makes repair significantly faster #16011
  • ScyllaDB detects internal stalls using a stall detector. It can do so using a timer, or using a more hardware performance counter, which can also track kernel stalls. It now sets up permissions for itself to use the performance counter #15743
  • The gossip-based schema propagation code uses a hash of the schema metadata to check if nodes have the same schema, since changes to the schema can happen in different nodes independently. This schema calculation can be time consuming in a cluster with thousands of tables. In consistent schema mode, therefore, we no longer compute a hash of the entire schema but instead generate a schema version using a timeuuid. This speeds up operations on clusters with many tables. #7620
  • The rewritesstables command will now execute in the streaming/maintenance group, reducing its impact on the rest of the system. #16699
  • The system will now scan sstable files during startup in a way that prevents fragmentation of the kernel inode and dentry caches. This helps reduce memory pressure on systems with many sstables. #14506
  • In tables with many small partitions (or many partition tombstones), sstable index pages can contain many entries. They are now destroyed gently to avoid stalls. #17605
  • Performance: A reactor stall when reading the materializing very large schemas has been fixed. #17841
  • scylla start very slowly, Spend a lot of time for Loading repair history #16774 #17993

Edge cases

  • A bug in which a read retry after an allocation failure caused a crash was fixed. #15578
  • Streaming now checks more closely if a table was dropped while it was streamed. #15370
  • ScyllaDB uses two separate memory reservation systems for memtables: user, used for user writes, and system, used for ScyllaDB’s own writes. This ensures that a heavy write workload does not impact ScyllaDB’s internal housekeeping. The raft table was moved to the system memory reservation #15622
  • A number of rare bugs in the row cache were fixed. #15483
  • Stability: A regression in IPv6 address formatting, which caused nodetool problems, like breaking when there is an Alternator GSI in the database #16153, or cause a node to be stuck with “?U” status and Host ID is "null #16039
  • Scylla crashes when table is dropped concurrently with streaming #16181
  • The fencing mechanism prevents a read or write from accessing an outdated replica. The mechanism is now disabled on local tables, as these can never be out of date. #16423
  • Internal updates to system.peers ensure the host_id column is filled in correctly; this is required for Raft to operate correctly. Note all released versions already fill in this column. #16376
  • Queries to a local table will now no longer be automatically parallelized to avoid shutdown problems. Local tables typically have little data and don’t benefit from parallelization. #16570
  • The hint manager is now started later in the boot process, until we have better information about other nodes in the cluster. #11870
  • A recent regression in calculating whether we reached the in-flight hint limit was fixed . #17636
  • The hinted handoff code now uses host IDs instead of IP addresses. This conforms with consistent topology, which uses host IDs throughout. #12278
  • Change Data Capture (CDC) maintains a history of the cluster topology to allow clients to fetch older change records. This history is now pruned earlier to reclaim space. #18497
  • Large data handler does not take range tombstones into account #13968
  • Reader-concurrency-semaphore: saved readers can refer to dangling tablets if tablet is migrated away #18110

Guardrails

Guardrails is a framework to protect ScyllaDB users and admins from common mistakes and pitfalls. In this release the following Guardrails are added:

  • Replication strategies Guardrails can be used to deny SimpleStratagy strategy, for example, which is not recommended for production.

Two new configurations

replication_strategy_warn_list

replication_strategy_fail_list

Replace the old restrict_replication_simplestrategy and give more granularity for DB Admin to warn, or block non production strategies.

CQL

  • Remove dclocal_read_repair_chance and read_repair_chance options. Both was removed from Apache Cassandra 3.11 and 4.x, see [#CASSANDRA-13910] .
  • ScyllaDB now rejects adding fields of type duration to a user-defined type ((UDT), if that UDT is used as a clustering key component. #12913
  • Correctness: a rare combination of reconciliation (read repair) with reverse queries and range tombstones, could cause incorrect data to be returned from queries #10598
  • CQL: toJson() produces invalid JSON for columns with “time” type #7988
  • CQL: a 5.4.0 regression in GROUP BY behavior #16531
  • CQL tracing now records two additional fields: the user who initiated the query, and the reader concurrency semaphore that is in charge of admitting the query on the replica.
  • There is a new CQL statement, LIST EFFECTIVE SERVICE LEVEL, to query what service level attributes apply to a role. A role’s service level attributes may be a mix of different SERVICE LEVEL configurations. #15604
  • CQL multi-column comparisons now have better NULL checks. #13217
  • DROP TYPE IF EXISTS statement now works if the type’s keyspace doesn’t exist. #9082
  • Correctness: a bug in the row cache could cause a query to return incomplete data if, under moderate load, the query was preempted. This is now fixed. Note that QUORUM reads would usually see the missing data completed from the other replica, so the bug is mostly visible with consistency level ONE or similar. #16759
  • Correctness: A bug was fixed where a deletion in a base table that affected a large number of materialized view rows to be updated could cause some updates to be missed. #17117
  • The CQL “date” type is now more resistant to overflow. #17066
  • The totimestamp() CQL function is now protected against undefined behavior when applied to extreme values. #17516
  • INSERT JSON now understands more data types when used as keys: blob,date,inet,time,timestamp,timeuuid,uuid #18477
  • An edge case in converting JSON numbers close to the maximum integer has been fixed. #13077
  • A crash in the mintimeuuid() function has been fixed. #17035
  • Unpaged queries will no longer fail when they reach the tombstone limit. The tombstone limit ends a page in an unpaged query, but unpaged queries are allowed to scan tombstones until they time out of reach the memory limit. It is strongly recommended to use paged queries.#17241

Alternator

  • Alternator now responds correctly to the DeleteTable API. #14132
  • An ABBA-style deadlock was fixed for this scenario: if an Alternator table used strong write isolation and had a global secondary index, then index updates could wait on the base table update, and vice versa. This also happens with materialized views on a table updated using lightweight transactions (LWT). #15844
  • Alternator now supports the ReturnValuesOnConditionCheckFailure feature. This makes handling contention more efficient. #14481
  • A correctness problem when using the Alternator Global Secondary Index that has two key columns has been fixed. #17119

Images and packaging

  • The container image now avoids installing “suggested” packages, to reduce its size. #15579
  • Prometheus node_exporter, used to expose operating system metrics, was updated to version 1.7.0 to address a security vulnerability. #16085
  • Remove Java from the docker image #18566

Tooling and REST API

/tasks/compaction/keyspace_compaction/{keyspace}

Similar to the existing synchronos storage_service API.

#15092

  • Bundled tools now use the Seastar epoll reactor backend rather than linux-aio; this reduces the risk of startup failures.
  • The bundled cqlsh has been updated , with a fix for a COPY TO STDOUT regression. #17451
  • The REST API for reporting cache statistics is now more accurate. #9418
  • Some false-positives were eliminated from the scrub command. #16326
  • The iotune utility is used to measure a disk’s performance when installing ScyllaDB. It now works when executed on machines with a very large core count (208 cores )#18546

Configuration

  • There is now a mechanism to deprecate configuration options. #15887
  • The me sstable format is now mandatory. The system can still read older formats. #16551
  • The service level mechanism works by polling the internal tables used to represent it. The polling interval can now be configured. This is useful to speed up tests.
  • The scylla sstable tool now supports loading the schema of materialized views and indexes. #16492
  • The unchecked_tombstone_compaction compaction option has been implemented. Setting this option will ignore tombstone_threshold. #1487
  • Failure detector’s ping timeout is configurable, with ping_timeout, and with higher default (600ms) #16607
  • Two new Encryption at Rest Configuration Parameters:
    • key_cache_expiry - Key cache expiry period, after which keys will be re-requested from the server. Default is 600s.
    • Key_cache_refresh - Key cache refresh period - the frequency at which cache is checked for expired entries. Default is 1200s.

Tracing and Auditing

  • Audit: Replication Factor for audit table is changed from 1 to 3 for better availability.
  • The large partition system tables now count range tombstones as regular rows, so that large partitions with mainly range tombstones are reported. #18346
  • Tracing: Large partition/row/cell warnings should not print keys contents #18041

Monitoring

Scylla Monitoring Stack released 4.8.1 and later supports ScyllaDB 2024.2

Deprecated and removed features

  • Remove probabilistic read repair #3502. Probabilistic read repairs were deprecated for a long time. This is a mirror for CASSANDRA-13910.