Last week in scylladb.git master (issue #188; 2023-07-16)

This short report brings to light some interesting commits to scylladb.git master from the last week. Commits in the c41f0ebd2a…a93fd2b162 range are covered.

There were 104 non-merge commits from 17 authors in that period. Some notable commits:

A bug involving incorrect cross-shard access while performing a nodetool scrub command was fixed.

Usually, repair reconciles a shard’s data on one node with the data on the same shard in other nodes. When the number of shards on different nodes doesn’t match, repair has to pick small ranges from all shards on the remote nodes. This adds significant overhead which is most pronounced when there is little or no data in the table. This is common in tests and slows then down, so we now have an optimization for the little-data case.

A complication causing problems in bootstrap handling very recent changes in topology was fixed.

A recent change extending the scope of CQL Data Definition Language (DDL) transactions caused a significant performance regression, so it was reverted.

A production installation of ScyllaDB locks all memory so we don’t experience high latency due to page faults. However, this only applies from the first time the memory is accessed; the first access can still experience stalls, made larger by using transparent huge pages. To fix this, a Seastar update adds a prefault thread that attempts to access all memory ahead of the database, taking the latency hit on this new thread rather than user queries. This will be visible as increased CPU consumption during the first few seconds (up to a minute on large machines) during process start.

We now tune the Linux kernel’s caching of inodes (in-memory structure representing file metadata) to favor evicting inodes quickly. This aims to reduce kernel memory fragmentation when there are large numbers of sstables, as most files comprising an sstable aren’t accessed after the process starts.

When compaction completes, it reports the throughput it achieved. We now base it on the input bytes read rather than output bytes, as the latter gives incorrect results for overwrite or expiring workloads.

The messaging service, responsible for inter-node communication, now initializes transport-layer security (TLS) earlier, to account for the failure detector pinging the its own node.

Resharding is a process where an sstable is split into several sstables, each wholly-owned by a single shard. A recent change to integrate resharding into the task manager was found to crash the system, so it was reverted.

Cleanup is a process where an sstable is rewritten to discard all partitions that no longer belong to the node (for example, after bootstrap). It has gained an optimization where we skip over the unnecessary partitions rather than reading and discarding them.

Alternator, ScyllaDB’s implementation of the DynamoDB API, implemented the error path of the size() function incorrectly. This is now fixed.

Recently, we started merging adjacent schema and topology changes when controlled by Raft, but this merging had a subtle bug leading to incorrect merging. This is now fixed.

The system uses a reader_concurrency_semaphore to limit the number of concurrent reads, as each read can consume large amounts of memory when merging sstables. Repair has its own allocation of concurrent reads. We now limit the scope of a read more carefully, to allow new reads to issue more quickly.

A recent regression involving a crash in decommission was fixed.

There is now a configuration item to control the size of stream plans, as a fraction of the total number of token ranges to stream,

DateTieredCompactionStrategy was removed. Users should move to TimeWindowCompactionStrategy. It is already illegal to create new DateTieredCompactionStrategy tables.

See you in the next issue of last week in scylladb.git master!