Unexpected Reshards When Changing Root Volumes

We are working on a migration to Ubuntu from CentOS, and ran into an issue with nodes trying to reshard, because the CPU Set settings have changed since they were created.

Background:

We’ve run into an issue when trying to upgrade nodes from CentOS to Ubuntu. The gist of what we are doing is using AWS EC2’s “Root Volume Replacement” to swap the CentOS root volume out for a new Ubuntu one, and handle any configuration before restarting the node. This has worked in all of our test clusters and all but 1 of our production clusters.

That production cluster is using i3en.6xlarge instance types, with only 1 other cluster using i3en.xlarge (the rest are on i3*). The original AMI for most of the nodes on this cluster are an old 4.4.8 Scylla AMI, that has just be upgraded over the years. It was the last CentOS AMI available, so we were stuck there for a while. 2 nodes on the cluster have been rebuilt recently, and are running on a custom CentOS AMI that is based on 5.1+ Scylla AMIs. We have 1 node that has had the root volume replaced, using the above method.

Problem:

The node that had the root volume replacement is undergoing a reshard of all of the data. This was unexpected, because no other cluster has done this. Digging into things, we found that the offending node went from 24 cores to 22 cores in Prometheus metrics. CPU-2 seems to match other clusters, so the concern was more with why this cluster was using 24 cores to begin with.

The following configurations were found on all of the 4.4.8 AMI based nodes:
/etc/scylla.d/cpuset.conf

# DO NO EDIT
# This file should be automatically configure by scylla_cpuset_setup
#
# CPUSET="--cpuset 0 --smp 1"
CPUSET="--cpuset 0-23 "

/etc/scylla.d/perftune.yaml

cpu_mask: '0x00ffffff'
mode: mq
nic:
- eth0
tune:
- net

While the following was found on the newer nodes (2 new plus the root volume replaced node):
/etc/scylla.d/cpuset.conf

# DO NO EDIT
# This file should be automatically configure by scylla_cpuset_setup
#
# CPUSET="--cpuset 0 --smp 1"
CPUSET="--cpuset 1-11,13-23 "

/etc/scylla.d/perftune.yaml

cpu_mask: '0x00ffffff'
irq_core_auto_detection_ratio: 16  ### Missing on one of the newer rebuilds, not the replacement
irq_cpu_mask: '0x00001001'
nic:
- eth0
tune:
- net
- system
tune_clock: true

I’ve dug around and can’t find a why, but I assume that something changed with how the perftune is run on the i3en instances between 4.4 and 5.1.

My question boils down to, what is the recommended course of action here? This cluster is 39 nodes, with 400TB+ data on it, so resharding most of the nodes will probably take weeks, with a node being down the whole time. Is there a clean way we can keep the old CPU set for the old nodes for now? Will that have any negative impact?

Our main concern is getting off of CentOS, so if we can do that quickly and worry about the reshards over a long time period that would be preferable, assuming that doesn’t cause any major issues.

Another interesting side effect of this change was that the new node was under some super heavy load complaining about schema updates. We saw messages like the following to every other node in the cluster, and on every node in the cluster trying to hit the changed node:

migration_manager - Pulling schema from 10.*.*.*
migration_manager - Requesting schema pull from 10.*.*.*

I assumed this was related to a schema mismatch, just without a failure. Running nodetool describecluster showed the same UUID version across the cluster, with no changes.

The node was still working, but had high latency and overall performing way worse than the other nodes. We decided to go with a rolling restart, as I know that can fix schema mismatches, and that resolved the issue.

This is mostly tangential to the main problem here, I am more just posting it since it is an outcome that I didn’t see explicitly stated anywhere. It seems that the change in CPU count and/or following reshard messed with the schema consolidation.

As in the upgrade manual for 5.1, point to:

You can save the previous cpu set file, and it should still work

This change was done to reserve some of the cpus for network interface IRQ, and using same CPUs might affect performance of scylla

Also the recommend way of upgrading is one release at a time, jumping between 4.4 → 5.1 might introduced compatibility issues, and isn’t tested procedure.

Any change to number of CPU used means resharding, I don’t think there’s a way around it.

Thanks for the documentation, that is helpful. Can you clarify some things though?

As using different modes across one cluster is not recommended

What’s the downside of this? Is this just a performance hit? As we are currently running with 3 nodes in sq_split (assuming that is the default) and the rest in mq mode.

Also, the steps in the case of a changed cpuset (which I expect for our nodes) doesn’t do anything with the perftune.yaml backup, will Scylla recreate that with the restart?

Just making sure I know all the ins and outs of this method, as this is only impacting our production cluster and we don’t have a good way to test this process in another cluster.


To clarify on the version change, the nodes were upgraded one release at a time, it was the underlying AMI that jumped from 4.4 to 5.2. So, Scylla node setup and perftune were not rerun until we were recreating the root volumes of the nodes for this migration.

Having a mixed setup, in which every node uses different cpuset is not just about performance

But it’s a less common setup, that is less tested, there were a few issues with that mixed setup
In previous releases.

If there an actual reason for doing such a mix, one could, I wouldn’t recommend doing so just cause of some upgrading issue.

As for inplace upgrade of the AMI root disk, it’s also something that isn’t tested
We recommend in place upgrade of scylla and the system,
Or replacing node with fresh nodes with newer AMI

Ok thanks for the info Israel. I was wondering for the urgency on the mixed setup, and how long we would be safe to keep it mixed. It sounds like it’s just scary in the past and is just an unknown area for configuration.

We will do the resharding, but we have apps that read from the cluster that are very read latency sensitive. They will fall behind if latency gets worse than 3 or 4 ms. Having a node down does that to us easily, so our plan is to do a few reshards each weekend and get towards the desired state over time.

As for the upgrade process, the other options just seem worse in our opinion. In place OS changing sounds like torture, and replacing each node with a new one and decommissioning the old ones means shuffling the 100s of TB of data around the network and incurring massive costs. We understand it is untested with Scylla though.