Originally from the User Slack
@rfurmanski: Hi, I was trying to enable consistent topology updates in my cluster (scylla open source 6.0.3 everywhere, 3 dcs, 26 nodes) and it is stuck on build_coordinator_state
. What might be a problem?
@Piotr_Smaroń: CC @Kamil_Braun
@rfurmanski: I verified all prerequisites and started upgrade via curl (http://127.0.0.1:10000/storage_service/raft_topology/upgrade)
after migrating from 5.4.7 to 6.0.3.
I followed the procedure described in here: https://opensource.docs.scylladb.com/branch-6.0/upgrade/upgrade-opensource/upgrade-guide-from-5.4-to-6.0/enable-consistent-topology.html
Enable Consistent Topology Updates | ScyllaDB Docs
@Kamil_Braun: > What might be a problem?
Anything could be a problem. Without logs, it’s impossible to say.
cc @Piotr_Dulikowski (author of the upgrade procedure)
@rfurmanski: sure. Plz let me know what to provide. I don’t see anything wrong in logs. Just
wrz 10 14:27:28 ams-scylla1 scylla[2458175]: [shard 0:strm] api - Requested to schedule upgrade to raft topology
wrz 10 14:27:28 ams-scylla1 scylla[2458175]: [shard 0:strm] raft_topology - requesting to start upgrade to topology on raft
wrz 10 14:27:28 ams-scylla1 scylla[2458175]: [shard 0:strm] raft_topology - upgrade to raft topology has started
wrz 10 14:27:28 ams-scylla1 scylla[2458175]: [shard 0:strm] raft_topology - upgrade to topology on raft is scheduled
@Kamil_Braun: First try the recommendations from https://opensource.docs.scylladb.com/branch-6.0/upgrade/upgrade-opensource/upgrade-guide-from-5.4-to-6.0/enable-consistent-topology.html#what[…]tuck
If nothing works (including rolling restart) – I recommend opening an issue and attaching logs from all of your nodes from the moment you started the upgrade procedure, ±1h
Also nodetool status
output from one of the nodes
Enable Consistent Topology Updates | ScyllaDB Docs
@Piotr_Dulikowski: > upgrade to topology on raft is scheduled
After this, the topology coordinator should have started. Can you see a “start topology coordinator fiber” log message on any of the nodes?
@rfurmanski: yes I see this on some of the nodes, but definitely not on all of them
@Piotr_Dulikowski: It’s expected - the topology coordinator fiber runs on the current raft leader, so this won’t be printed on all nodes. So, most likely the upgrade process has started but got stuck.
I think the best way forward would be to do what @Kamil_Braun suggested, so that we can analyze logs in more detail ourselves and look for more clues on what made it stuck.
@rfurmanski: looks like raft still sees 2 recently removed nodes:
[shard 0:strm] raft_topology - topology change coordinator fiber got error exceptions::unavailable_exception (Cannot achieve consistency level for cl ALL. Requires 28, alive 26)
nodetool status gives 26. these 2 nodes were removed before upgrade to 6.0.3
how to convince raft that these 2 nodes are removed?
ahhh! After changing replication settings for system tables upgrade was successful. Thank you guys!
@Piotr_Smaroń: @rfurmanski what exactly have you changed, can you please share the details?
@rfurmanski:
scyllamaster@cqlsh> alter keyspace system_auth WITH replication = {'class': 'NetworkTopologyStrategy','ams3': '6', 'ams4': '8', 'ash': '8','trn': '4'} AND durable_writes = 'true';
scyllamaster@cqlsh> alter keyspace system_distributed WITH replication = {'class': 'NetworkTopologyStrategy
','ams3': '6', 'ams4': '8', 'ash': '8','trn': '4'} AND durable_writes = 'true';
scyllamaster@cqlsh> alter keyspace system_traces WITH replication = {'class': 'NetworkTopologyStrategy','am
s3': '6', 'ams4': '8', 'ash': '8','trn': '4'} AND durable_writes = 'true';
previously in ams3 I had 8 nodes
after executing these commands raft upgrade went through
@Kamil_Braun: ah, so it was probably trying to migrate from system_auth to the new auth v2 tables
and it was trying to read system_auth using CL=ALL
@rfurmanski: yes
@Kamil_Braun: we should probably improve that error message
[shard 0:strm] raft_topology - topology change coordinator fiber got error exceptions::unavailable_exception (Cannot achieve consistency level for cl ALL. Requires 28, alive 26)
so it’s more clear which table it is trying to query