We are working on a migration to Ubuntu from CentOS, and ran into an issue with nodes trying to reshard, because the CPU Set settings have changed since they were created.
We’ve run into an issue when trying to upgrade nodes from CentOS to Ubuntu. The gist of what we are doing is using AWS EC2’s “Root Volume Replacement” to swap the CentOS root volume out for a new Ubuntu one, and handle any configuration before restarting the node. This has worked in all of our test clusters and all but 1 of our production clusters.
That production cluster is using
i3en.6xlarge instance types, with only 1 other cluster using
i3en.xlarge (the rest are on
i3*). The original AMI for most of the nodes on this cluster are an old 4.4.8 Scylla AMI, that has just be upgraded over the years. It was the last CentOS AMI available, so we were stuck there for a while. 2 nodes on the cluster have been rebuilt recently, and are running on a custom CentOS AMI that is based on 5.1+ Scylla AMIs. We have 1 node that has had the root volume replaced, using the above method.
The node that had the root volume replacement is undergoing a reshard of all of the data. This was unexpected, because no other cluster has done this. Digging into things, we found that the offending node went from 24 cores to 22 cores in Prometheus metrics. CPU-2 seems to match other clusters, so the concern was more with why this cluster was using 24 cores to begin with.
The following configurations were found on all of the 4.4.8 AMI based nodes:
# DO NO EDIT # This file should be automatically configure by scylla_cpuset_setup # # CPUSET="--cpuset 0 --smp 1" CPUSET="--cpuset 0-23 "
cpu_mask: '0x00ffffff' mode: mq nic: - eth0 tune: - net
While the following was found on the newer nodes (2 new plus the root volume replaced node):
# DO NO EDIT # This file should be automatically configure by scylla_cpuset_setup # # CPUSET="--cpuset 0 --smp 1" CPUSET="--cpuset 1-11,13-23 "
cpu_mask: '0x00ffffff' irq_core_auto_detection_ratio: 16 ### Missing on one of the newer rebuilds, not the replacement irq_cpu_mask: '0x00001001' nic: - eth0 tune: - net - system tune_clock: true
I’ve dug around and can’t find a why, but I assume that something changed with how the perftune is run on the i3en instances between 4.4 and 5.1.
My question boils down to, what is the recommended course of action here? This cluster is 39 nodes, with 400TB+ data on it, so resharding most of the nodes will probably take weeks, with a node being down the whole time. Is there a clean way we can keep the old CPU set for the old nodes for now? Will that have any negative impact?
Our main concern is getting off of CentOS, so if we can do that quickly and worry about the reshards over a long time period that would be preferable, assuming that doesn’t cause any major issues.