Last 3 weeks in scylla-cluster-tests.git master (issue #33; 2024-01-05)

This brief report highlights some notable commits to scylla-cluster-tests.git master from the past three weeks, specifically within the 9016c20b…f5bc3186 range.

During this period, we had 89 non-merge commits from 13 authors. Here are some key updates:

The Operator’s NodeConfig CRD is now reused for static volume provisioner, enabling the configuration of SSD disks into RAID arrays. This change allows us to simplify the SCT code and eliminate redundant pods in EKS and GKE.

We have started testing Scylla ARM docker images with functional tests and one longevity in EKS environment. The EKS module can now detect ARM instance types and select the appropriate image for VM.

Developers can add a config file that enables tablets in Jenkins jobs, making it easier to enable this feature in any SCT test/longevity.

The Scylla-operator upgrade test has been enhanced with new checks, including verifying new ‘UID’ values, ‘status.conditions’, and scylla-operator images.

The monitoring branch was updated to 4.6 by default.

We can now run the must-gather binary on any possible arch on K8S, allowing us to collect logs from ARM instances.

The disrupt_memory_stress was disabled to avoid uncontrolled situations with this nemesis.

Please take note of the new issues and pull-requests templates when working with GitHub, and adhere to the new guidelines.

In some cases, such as twcs, we have a table with many tombstones and our method for getting a list of ks/cf with data could be inaccurate. The cfstats utility is now used, which may be slower but is more accurate.

The list of nemesis used in longevities is now cycled, so it never ends and fills the entire test run. This eliminates the need to fine-tune cases based on their running length.

We’ve been defaulting to syslog-ng for some time, and rsyslog was filling the disk even when not needed, complicating node config. We have removed rsyslog from SCT.

Improvements were made to NodeBootstrapAbortManager , correcting timeouts for operations to prevent races between threads and ensure all operations finish as expected.

We’ve introduced a new script that allows us to exclude incorrect performance results from ES history by modifying the result’s job-name.

The scale-cluster test has been stabilized by reducing the load, as the previous load was too heavy for the disk to manage.

The audit test cases now default to syslog. This change was made because the table option was causing issues and could lead to query failures when auditing with CL=ONE.

To streamline our workflow, any created or reopened SCT issues are now automatically added to the QA board, as we primarily work with boards.

In certain scenarios, we may need to adjust timeouts relative to the number of nodes. To facilitate this, we’ve started to collect the number of database nodes when measuring operation times.

Stay tuned for the next issue of last week in scylla-cluster-tests.git master!