Have a little problem. Trying to learn how to deal with and fix clustering issues in scylla. (currently 5.2.2) I created a cluster and then I added 2 nodes from a different datacenter into it (so currently datacenter1 = 3 nodes, datacenter2 = 2 nodes, both files use local=true) I made a configuration error in the cassandra-rackdc.properties file when I tried to add the third node to datacenter 2 and added the tag as datacenter1 instead. The node wouldn’t add, just stalled. I tried to readd and it first said that it already existed. After trying to figure this out, i wiped the data and tried again and then it gave me the error that it couldn’t resolve ip addresses for node with id and it gave multiple ID’s that weren’t in the cluster. Now I can’t add any nodes. I created a new node, attempted to add it to the cluster and it just froze. I attempted to reboot and try again and now it’s saying that it already exists in the cluster. However, it doesn’t show up in nodetool status or system.peers.
How can I fix this issue so I can add nodes to the cluster?
Thank you for this. I read through this and at the end it says "If removenode returns an error like:
nodetool: Scylla API server HTTP POST to URL ‘/storage_service/remove_node’ failed: std::runtime_error (removenode[12e7e05b-d1ae-4978-b6a6-de0066aa80d8]: Host ID 42405b3b-487e-4759-8590-ddb9bdcebdc5 not found in the cluster)
and you’re sure that you’re providing the correct Host ID, it means that the member was already removed and you don’t have to clean up after it.
However, Here’s the output from attempts to remove a node.
10.0.137.180
nodetool info
ID : dd530cfb-7f4e-42c3-be6a-ce093d263b96
Gossip active : true
nodetool: Scylla API server HTTP GET to URL ‘/storage_service/rpc_server’ failed: Not found
See ‘nodetool help’ or ‘nodetool help ’.
nodetool decommission
nodetool: Scylla API server HTTP POST to URL ‘/storage_service/decommission’ failed: std::runtime_error (local node is not a member of the token ring yet)
See ‘nodetool help’ or ‘nodetool help ’.
10.0.130.77 (one of the nodes in the cluster)
nodetool describecluster
Cluster Information:
Name: Veeps
Snitch: org.apache.cassandra.locator.GossipingPropertyFileSnitch
DynamicEndPointSnitch: disabled
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
6e0ec14b-1d4b-305b-9bf9-d420da03eb45: [10.0.137.180]
7f3ff0e1-96d7-3145-8883-f6125b5522f6: [10.0.130.77, 10.0.137.241]
nodetool removenode dd530cfb-7f4e-42c3-be6a-ce093d263b96
nodetool: Scylla API server HTTP POST to URL ‘/storage_service/remove_node’ failed: std::runtime_error (removenode[bf9d0aed-d300-48e0-ad70-8637d11972d6]: Node dd530cfb-7f4e-42c3-be6a-ce093d263b96 not found in the cluster)
See ‘nodetool help’ or ‘nodetool help ’.
ok so it looks like actually stopping scylla-server on .180 removed it from that list. However what’s your thoughts on the other 2 listed? Those were already offline.
Hm, I was convinced that you have Raft enabled in your cluster (due to the error you said you were getting – that it cannot resolve IP address), but apparently not.
Do both nodes return this result when you connect to cqlsh with them? (Empty system.raft_state table)
What is your conf/scylla.yaml? Do you have the consistent_cluster_management flag set there?
Yeah, that’s part of where I was confused as well. I have consistent_cluster_management set to true on both nodes and has been from day 1. Both nodes return that result. Another thing to mention is this query :select value from system.scylla_local where key = ‘raft_group0_id’; doesn’t return anything either because that key doesn’t exist in scylla_local. That comes from the link you sent me earlier in the post.
Ok, now i feel a little dumb. I went into recovery mode night before last when trying to fix this. I have backed out of this and now this is the result of that query:
key | value
----------------------±-------------------------
group0_upgrade_state | use_post_raft_procedures
WDYM by “backed out”? Have you finished the recovery procedure? If you did anything from the procedure after “enter recovery mode step” (like truncated some Raft tables), then you need to finish it. If you only entered recovery mode, but then set group0_upgrade_state back to use_post_raft_procedures without removing any other data etc., then I guess everything should be fine.
So if you did anything else besides just entering recovery mode – you should finish the recovery procedure (starting by going into recovery mode again), and then proceed.
Otherwise we should be able to proceed now.
If we proceed, then the next step is to check the results of select * from system.raft_state again.
So looks like nodetool status is consistent with raft state, there are no “ghost members” anymore.
Try to boot the new node again. Make sure you don’t use old work directories from previous boot attempts though – clear the old data from node which failed to boot if you do it on the same machine.
If it again gets stuck on “fail to resolve IP” please save the host ID, we’ll have to determine what node this host ID belongs to.
Ok, did the following (noting it here for posterity)
Removed all the data from 10.0.137.180
Ensured cassandra-rackdc.properties was set to datacenter1/rack1
Ensured the seeds were correct in scylla.yaml and all the parameters matched (consistent_cluster_management, etc)
Started scylla
Now the node has successfully joined datacenter1. I am now going to create 3 new nodes and attempt to join datacenter2 to this cluster.