ScyllaDB Enterprise Release 2023.1.0 - Part 3

Part 1
Part 2

Tools

  • The sstable tools gained Lua scripting. This is an expert feature intended for offline analysis of sstables. #9679

  • The scylla-types tool can now compute the token and shard of a partition key, using the tokenof and shardof subcommands.

  • The bundled cqlsh now uses the ScyllaDB Python driver (rather than the generic Cassandra driver) and supports Scylla Cloud Serverless connection bundles.

  • The bundled cqlsh now considers system_distributed_everywhere a system keyspace.

  • The bundled scylla types tool can now serialize a value to the sstable binary format.

  • The scylla-api-client tool is now documented. The tool is suitable for interactive usage as well as shell automation of the REST API. #11999.

  • scylla-api-cli is a lightweight command line tool interfacing the ScyllaDB REST API. The tool can be used to list the different API functions and their parameters, and to print detailed help for each function.

    Then, when invoking any function, scylla-api-cli performs basic validation on the function arguments
    and prints the result to the standard output. Note that json results msy be pretty-printed using
    commonly available command line utilities. It is recommended to use scylla-api-cli for interactive
    usage of the REST API over plain http tools, like curl, to prevent human errors.

  • The sstable utilities now emit JSON output. See example output here.

  • There two new sstable tools, validate-checksums and decompress, allowing for more offline inspection options of sstables.

    • scylla-sstable validate-checksums: helps identifying whether an sstable is intact or not, but checking the digest and the per-chunk checksums against the data on disk.
    • scylla-sstable decompress: helps when one wants to manually examine the content of a compressed sstable.
  • The SSTableLoader code base has been updated to support “me” format sstables.

  • The sstable parsing tools usually need the schema to interpret an sstable’s data. For the special case of system tables, the tools can now use well-known schemas.

  • Nodetool was updated to fix IPv6 related errors (even when IPv4 is used) with update JVMs. #10442

  • Cassandra-derived tooling such as cqlsh and cassandra-stress was synchronized with Cassandra 3.11.3.

  • The bundled Prometheus node_exporterm used to report OS level metrics to ScyllaDB Monitoring Stack was upgraded to version 1.3.1.

  • Repairs that were in their preparation stage previously could not be aborted. This is now fixed.

  • ScyllaDB documentation has been moved from the scylla-docs.git repository to scylla.git. This will allow us to provide versioned documentation.

  • The sstable tools gained a write operation that can convert a json dump of an sstable back into an sstable.

Storage

  • “me” format sstables are now supported (and the default format).
  • ScyllaDB will now store the ScyllaDB version and build-id used to generate an sstable. This is helpful in tracking down bugs and altered persisted data.

Configuration

It is now possible to limit, and control in real time, the bandwidth of streaming and compaction.

These and more configuration updates below:

  • Audit is now disabled by default.
  • It is now possible to limit I/O for repair and streaming to a user-defined bandwidth limit, using the new stream_io_throughput_mb_per_sec config value. The value throttles streaming I/O to the specified total throughput (in MiBs/s) across the entire system. Streaming I/O includes the one performed by repair and both RBNO and legacy topology operations such as adding or removing a node. Setting the value to 0 disables stream throttling (default). The value can be updated in real time via the config virtual table or via configuration file hot-reload. It is recommended not to change this configuration from its default value, which dynamically determines the best bandwidth to use.
  • compaction_throughput_mb_per_sec: Throttles compaction to the specified total throughput across the entire system. The faster you insert data, the faster you need to compact in order to keep the SSTable count down. The recommended Value is 16 to 32 times the rate of write throughput (in MBs/second). Setting the value to 0 disables compaction throttling, It is recommended not to change this configuration from its default value, which dynamically determines the best bandwidth to use.
  • It is now possible to disable updates to node configuration via the configuration virtual table. This is aimed at ScyllaDB Cloud, where users have access to CQL but not the node configuration. #9976
  • EC2MultiRegionSnitch will now honor broadcast_rpc_address if set in the configuration file.#10236
  • The permissions cache configuration is now live-updatable (via SIGHUP); and there is now an API to clear the authorization cache.
  • The compaction_static_shares and memtable_flush_static_shares configuration items, used to override the controllers, can now be updated without restarting the server.
  • column_index_auto_scale_threshold_in_kb to the configuration (defaults to 10MB). When the promoted index (serialized) size gets to this threshold, it’s halved by merging each two adjacent blocks into one and doubling the desired_block_size.
  • Commitlog_flush_threshold_in_mb: Threshold for commitlog disk usage. When used disk space goes above this value, ScyllaDB initiates flushes of memtables to disk for the oldest commitlog segments, removing those log segments. Adjusting this affects disk usage vs. write latency.
  • The Cassandra tombstone_warn_threshold (default 1000) configuration for the maximum number of tombstones a query can scan before a warning item is now respected, producing a warning if a query takes too long.
  • Messaging will now prevent 0.0.0.0 and its IPv6 equivalent from being used as a node IP address.
  • New config parameters:
    • Restrict_future_timestamp Controls whether to detect and forbid unreasonable USING TIMESTAMP, more than 3 days into the future. See Sanity check for USING TIMESTAMP above.
    • replace_node_first_boot - The Host ID of a dead node to replace. And alternative to the old replace_address_first_boot which uses the old node address. See replace node docs.
    • WASM (experimental feature) related configs:
      • wasm_cache_memory_fraction
      • wasm_cache_timeout_in_ms
      • wasm_cache_instance_size_limit
      • wasm_udf_yield_fuel
      • wasm_udf_total_fuel
      • wasm_udf_memory_limit
    • consistent_cluster_management - replace the Raft experimental flag (see Raft above)
    • x_log2_compaction_groups - new config for setting static number of compaction groups
    • unspooled_dirty_soft_limit - replace the old virtual_dirty_soft_limit.
    • compaction_collection_elements_count_warning_threshold - see large collection above.
    • cache_index_pages - Keep SSTable index pages in the global cache after a SSTable read
    • restrict_twcs_without_default_ttl - Controls whether to prevent creating TimeWindowCompactionStrategy tables without a default TTL. Can be true, false, or warn (default)
    • Twcs_max_window_count - The maximum number of compaction windows allowed when making use of TimeWindowCompactionStrategy (default: 50)
    • task_ttl_seconds - Time for which information about finished tasks stays in memory (default 10s)
    • broadcast-tables - new experimental Raft feature for internal testing
    • query_tombstone_page_limit - The number of tombstones after which a query cuts a page, even if not full or even empty (default 10000)

Deprecated and removed features

  • The CQL binary protocol versions 1 and 2 are no longer supported. Version 3 and above have been supported for 9 years, so it’s unlikely to be in real use. You can check for version 1 and 2 in the system.clients virtual table. #10607
  • New DateTieredCompactionStrategy tables are now rejected by default. Users should switch to TimeWindowCompactionStrategy. Existing DateTieredCompactionStrategy tables are still supported, and it is still possible to configure the database to allow new DateTieredCompactionStrategy tables.
  • Thrift API - legacy ScyllaDB (and Apache Cassandra) API is deprecated and will be removed in followup release. Thrift has been disabled by default.
  • Compact Storage - a file format used by Thrift and deprecated from Apache Cassandra, is deprecated and will be removed in followup release.
  • In-Memory Tables - an enterprise-only feature**

Monitoring and tracing

Scylla Monitoring Stack release 4.4 and later will support ScyllaDB Enterprise 2023.1

metrics related updates below:

  • Shard Latencies are now reported as summaries. This is part of an effort to reduce the total number of generated metrics. In addition, empty histograms and summaries will not be reported. The overall result is a 5x reduction in the number of metrics #11173.

    This is how a summary looks like: scylla_storage_proxy_coordinator_read_latency_summary_count{scheduling_group_name="statement",shard="1"} 2
    scylla_storage_proxy_coordinator_read_latency_summary{quantile="0.990000",scheduling_group_name="statement",shard="1"} 640

  • There is now a metric that allows observation of update progress of materialized views from staging sstables.

  • There are now completion percentage metrics for node operations using streaming; previously the completion metrics were only available when using repair-based node operations. #11600

  • The sstable row_reads metric for m-format sstables is now properly incremented, instead of showing zeroes. #12406

  • The replica-side read metrics, which have been incorrect for some time, have been revamped. #10065

  • Slow query tracing only considered local times - the time from when a request first hit the replica - to determine if a request needs to be traced. This could cause some parts of slow query tracing to be missed. To fix that, slow queries on the replicas are determined using the start time on the coordinator.

  • The system.large_partitions and similar system tables will now hold only the base name of the sstable, not the full path. This is to avoid confusion if the large partition is reported while the sstable is in one directory, but later moved to another, for example from staging to the main directory after view building is done or into the quarantine subdirectory if they are found to be inconsistent with scrub.

  • #10075

  • There are now metrics showing each node’s idea of how many live nodes and how many unreachable nodes there are. This aids understanding problems where failure detection is not symmetric. #10102

  • The system.clients table has been virtualized. This is a refactoring with no UX impact.

  • Aggregated queries that use an index are now properly traced.

  • The amount of per-table metrics has been reduced by sending metric summaries instead of histograms and not sending unused metrics.

Additional bug fixes

The following issues have been fixed on top of what was fixed in Scylla Open Source 5.2.0, with open source reference if available. In addition, all relevant bug fixes from 2022.1.x and 2022.2.x are fixed in 2023.1.0

  • Stability: an extremely rare case can cause Iterator invalidation in lsa_partition_reader::reset_state(), following by process exit #14696

  • Stability: mutation_reader_merger can overflow stack when merging many empty readers. This may happen when running a second repair right after the other. #14415

  • Stability: a lot of lsa-timing log messages during node replace cause c-s stuck and aborted. The fix update the reactor shares for default IO class from 1 to 200 #13753

  • DynamoDB API (Alternator) stability: assertion in output_stream when exception occurs during response streaming #14453.

  • Stability: cached_file, used by index caching, will potentially cause a crash after OOM #14814

  • Stability: compaction: excessive reallocation during input list formatting #14071. Issue is more likely with offstrategy compaction.

  • Stability: deadlock caused by view update _registration_sem and streaming reader _streaming_concurrency_sem #14676

  • Stability: a failure when reading metrics, caused by a rare race condition when another node is down. (seastar::metrics::double_registration (registering metrics twice for metrics: storage_proxy_coordinator_background_replica_writes_failed_remote_node)) #11017

  • Stability: ICS compaction is not working in cleanup #14035 (introduced in 2022.2.0)

  • Stability: messaging: when upgrading OSS nodes to Enterprise, service-levels are matched to the default scheduling group #13841, #12552

  • Stability: Range-scans have a protection against using the wrong service-level to continue a suspended range-scan. This protection had a mistake, resulting in the node crashing when the protection mechanism was triggered. multishard_mutation_query: reader_context::lookup_readers() is not exception safe w.r.t. closing readers #13784

  • Stability: partitioned_sstable_set::insert might stall when called by table::make_reader_v2_excluding_sstables. The root cause is View building from staging creates a reader from scratch for every partition, in order to calculate the diff between new staging data and data in base sstable set, and then pushes the result into the view replicas. #14244

  • Setup: scylla-fstrim.timer is enabled but not started #14249

  • Setup: The installer now wipes filesystem signatures from the individual disks making up a RAID array, preventing problems with reuse of disks. #13737

  • Stability: bad_alloc (seastar - Failed to allocate 536870912 bytes) #13491. Root cause is a logic fault causing the reader to attempt to read all the data, consuming all memory. Can occur during sstableloader/nodetool refresh, repair or range scan.

  • Stability stack-use-after-return in table::make_reader_v2_excluding_staging() #14812

  • Stability: View building crashes on large partitions with range tombstones. #14503

  • DynamoDB API (Alternator) stability: Yield while building large results in Alternator - rjson::print, executor::batch_get_item #13689

  • Setup: fix a regression in setup, which overrides the manual update of perftune.yaml #11385 #10121

  • Setup: updates in perftune.py, improving performance for larger servers (32 cores and above)

    • introduce a generic auto_detect_irq_mask(cpu_mask) function
    • auto-select the same number of IRQ cores on each NUMA
  • Stability: ‘sleep_aborted’ error during Scylla shutdown #13374

  • Stability: a rare failure in row_cache_test/test_concurrent_reads_and_eviction #12462

  • Stability: ALTER KEYSPACE can break tables with UDT columns #14139

  • Correctness: Decommission and removenode may lead to consistency issues if one of the nodes decides to abort during streaming #12989

  • UX: non informative iotune warnings in scylla_kernel_check #13373

  • Stability: a race condition in scylla boot, when migration_manager::sync_schema failed with seastar::rpc::closed_error causing repair to fail #12956, #12764

  • Stability: Node operations failures get masked by abort request failures #12798

  • Stability: Node operations may fail if prepare takes longer than heartbeat timeout #12969, #11011

  • Stability: Segmentation fault happend on alive nodes during adding new node with replace terminated one #13368 (issues introduced in 5.2)

  • Stability: Shutting down auth service may hang #13545

  • Correctness: tables with the new tombstone_gc ‘immediate’ mode might delete ttl data that is not expired #13572

  • Stability: possible use-after-move in virtual table for secondary indexes #13396

  • Stability: possible use-after-move when initializing row cache with dummy entry #13400

  • Stability: possible use-after-move in virtual table for secondary indexes #13396

  • Stability: possible use-after-move when making streaming reader #13397

  • Stability: possible use-after-move when reading from SSTable in reverse #13394

  • Stability: possible use-after-move when tracking view builder progress #13395

  • Stability: reactor stalls in commitlog replay path due to commit log regexp processing #11710

  • Stability: Replication of default auth settings may fail #2852

  • Stability: db/view: update view generator doesn’t close staging sstable reader on exceptions #13413

  • Stability: direct_failure_detector::ping_with_timeout() causes exceptions to be thrown every 100ms times the number of live nodes, which spam the logs, and might slow it down #13278

  • Stability: on_internal_error doesn’t log an error when not aborting #13786

  • Packaging: RPM package dependencies issue. When installing a specific version with yum/dnf, scylla-python3 version will not match the specified version, but the latest one. #13222

  • Stability: bad_alloc (seastar - Failed to allocate 536870912 bytes) #13491. Root cause is a logic fault causing the reader to attempt to read all the data, consuming all memory. Can occur during sstableloader/nodetool refresh, repair or range scan.

  • Monitoring: new metric for CQL request and response sizes #13061

  • Audit: do not round timestamp in the audit table

  • Encryption at rest: rare deadlock when creating a table using encryption with replicated key provider (default) for the first time

  • Stability: Adding nodes to a large cluster (90+ nodes) may cause existing nodes to crash. The root cause is quadratic behavior in get_address_ranges function #12724

  • Stability: a rare crash due to null pointer dereference: clear_gently of disengaged unique_ptr dereferences nullptr #13636

  • Performance: Compaction manager “periodic reevaluation” is one-off. This means that compaction was not kicking in later for a table, with low to none write activity, that had expired data 1 hour from now. #13430

  • Stability:: Internal error in a COUNT request with empty IN. The query “select count(*) from {table1} where p in ()” should result in the count 0, because the empty p in () matches no row. However, what we get in Scylla now is an internal error. #12475

  • Tools: total disk space used metric incorrectly tells the amount of disk space ever used, which is wrong. It should tell the size of all SSTable being used plus the ones waiting to be deleted. Live disk space used shouldn’t account for the ones waiting to be deleted, and live SSTable Count shouldn’t account SSTable waiting to be deleted. #12717

  • Stability: Bootstrap fails during replace operation while starting “off-strategy compaction”. Huge amount of “Error applying view update” errors were received #12693. The cause is commit “repair: Reduce repair reader eviction with diff shard count” introduced in 2022.2.1

  • Stability: CQL compression might cause reactor stalls on buffer allocation #13437

  • Stability: coredumps were not being generated. A fix increase systemd coredump generation timeout #5430

  • Performance: Fix stalls caused quadratic behavior when inserting sstables into tracker on schema change #12499

  • Stability: abort_source::do_request_abort(std::optionalstd::exception_ptr): Assertion ‘_subscriptions’ failed. during shutdown #12512

  • CQL: scylla: types: is_tuple(): doesn’t handle reverse types. For example, a schema with reversed clustering key component; this component will be incorrectly represented in the schema CQL dump: the UDT will lose the frozen attribute. When attempting to recreate this schema based on the dump, it will fail as the only frozen UDTs are allowed in primary key components. #12576

  • Stability: commitlog: segment recycling breaks on segment file removal #12645

  • Workload Prioritization improvements and bug fix:

    • Stability: removing a service level during an sstable load could lead to reading deleted memory and exit
    • Stability: some requests can ‘leak’ into the default service level just after authentication
    • Stability: a bug in the service level controller, introduced in 2022.1.4, might give the wrong priority to a task, resulting in timeouts.
  • Incremental Compaction Strategy (ICS) improvements and bug fixes:

    • Make ICS reshape more efficient for off-strategy compaction, including large data sets. This fix a regression in replace node operation, which uses Repair Base Node Operation (RBNO)
    • Crash on compaction completion when ICS ends up with a run containing staging and non-staging files. Error log: “scylla: sstables/sstables.cc:2744: