Rebuilt Node Missing from Raft State, Not Starting

Ok, so it’s been a week, I’m going to give the full timeline of events, in case something that I don’t think is relevant actually is.

  • A node (let’s call it Node A) of our cluster filled disk space 100% (we missed alarms)
  • Stopped writing applications, restarted node, it compacted a bit of space, was able to stay up and stable, so we bootstrapped a new node into that rack (AZ) (call it Node B)
  • Node B came up, we tried to run cleanups on the Node A, eventually got what seemed like a good chunk of free space, so started applications
  • Node A filled up again while running the cleanups, crash/restart a few times, EC2 becomes unresponsive, so we soft reboot
  • AWS ate the RAID/NVMe’s on Node A, so we just backed out entirely and aimed to stop/start to rebuild it from scratch in place with replace_node_first_boot
  • Failed the next day due to the Node B having a random restart that it recovered from
  • Restart the replace from scratch on Node A (a restart of the service)
  • Repeat of the last 2 the next day
  • We try to continue the rebuild, having to change the node id in the config, since it got past that point in the rebuild. Seems to find the data on disk and repair from there fine

This morning:

  • Node A completed it’s bootstrap
  • Shortly after crashes, doesn’t come back up because replace_node_first_boot is still set
  • Remove that and restart the service, it never gets to “Serving”, stalling on “starting system distributed keyspace” in the systemd status
  • It eventually just stops writing to the logs after a ton of compactions and repairs

That brought me to this afternoon and reading through this documentation on failed membership changes.

I ran both of the queries on a good node to get the cluster state, both returned 10 rows (which matches our node count currently). I saw the below state:

cqlsh> select * from system.raft_state;

 group_id                             | disposition | server_id                            | can_vote
--------------------------------------+-------------+--------------------------------------+----------
 d6f14d80-5cbc-11ef-9c2c-d7f4c1c778e5 |     CURRENT | 361040b2-f2ce-46ee-ae8f-5ccb7a516dae |     True
 d6f14d80-5cbc-11ef-9c2c-d7f4c1c778e5 |     CURRENT | 49546c8d-6dcf-4980-9c86-5ef1e8c53fa7 |     True
 d6f14d80-5cbc-11ef-9c2c-d7f4c1c778e5 |     CURRENT | 49deb87f-adbf-4524-9728-9d49aa58e36b |    False
 d6f14d80-5cbc-11ef-9c2c-d7f4c1c778e5 |     CURRENT | 5b6298da-0961-4b71-8ca2-99d753722fa5 |     True
 d6f14d80-5cbc-11ef-9c2c-d7f4c1c778e5 |     CURRENT | 7c458569-23c4-4caa-a2a8-f9faf069f3ec |     True
 d6f14d80-5cbc-11ef-9c2c-d7f4c1c778e5 |     CURRENT | 92d1ab67-0faa-4120-b26a-106af738b355 |     True
 d6f14d80-5cbc-11ef-9c2c-d7f4c1c778e5 |     CURRENT | 9ca8bf78-0167-4f58-91a7-9d2f1e5cf19f |     True
 d6f14d80-5cbc-11ef-9c2c-d7f4c1c778e5 |     CURRENT | ad69d877-3262-41a5-88ef-ca973bddefce |     True
 d6f14d80-5cbc-11ef-9c2c-d7f4c1c778e5 |     CURRENT | b7316efa-e8e8-4031-abfa-05943c753bda |     True
 d6f14d80-5cbc-11ef-9c2c-d7f4c1c778e5 |     CURRENT | bf03fa3a-42e4-416a-b84c-b0704a801c12 |     True

(10 rows)
cqlsh> select * from system.cluster_status;

 peer     | dc      | host_id                              | load          | owns     | status | tokens | up
----------+---------+--------------------------------------+---------------+----------+--------+--------+------
 10.*.*.* | us-east | b7316efa-e8e8-4031-abfa-05943c753bda | 5947483845182 | 0.093779 | NORMAL |    256 | True
 10.*.*.* | us-east | 7c458569-23c4-4caa-a2a8-f9faf069f3ec | 6040102510775 | 0.105843 | NORMAL |    256 | True
 10.*.*.* | us-east | ad69d877-3262-41a5-88ef-ca973bddefce | 5278908551309 | 0.098916 | NORMAL |    256 | True
   NODE B | us-east | 9ca8bf78-0167-4f58-91a7-9d2f1e5cf19f | 4520013810048 | 0.093432 | NORMAL |    256 | True
   NODE A | us-east | 36593f45-cba6-4368-a287-b5e692b8bb7e |               | 0.112803 | NORMAL |    256 | True
 10.*.*.* | us-east | 49546c8d-6dcf-4980-9c86-5ef1e8c53fa7 | 6003262766104 | 0.103674 | NORMAL |    256 | True
 10.*.*.* | us-east | 92d1ab67-0faa-4120-b26a-106af738b355 | 5755106310906 | 0.100487 | NORMAL |    256 | True
 10.*.*.* | us-east | 361040b2-f2ce-46ee-ae8f-5ccb7a516dae | 5902387877491 | 0.103925 | NORMAL |    256 | True
 10.*.*.* | us-east | bf03fa3a-42e4-416a-b84c-b0704a801c12 | 5890694964817 | 0.093611 | NORMAL |    256 | True
 10.*.*.* | us-east | 5b6298da-0961-4b71-8ca2-99d753722fa5 | 5617537828836 |  0.09353 | NORMAL |    256 | True

49deb87f-adbf-4524-9728-9d49aa58e36b was only showing up in the raft_state and was not in nodetool status or nodetool gossipinfo, and only referenced in some log warnings about not being able to hit it. So, it matched the definition of a ghost member, so I ran a nodetool removenode on it, which quickly completed and removed it from the above table.

I gave Scylla on Node A a restart a while afterwards, as nothing on it changed and there were no logs coming out of it. It’s back to where it was before, just logging a ton about compactions, but still stuck at starting system distributed keyspace.

So, the current state:

  • 10 nodes showing as UN in nodetool status
  • 10 nodes in system.cluster_status as up = True, with just Node A missing the load column
  • 9 nodes in system.raft_state, missing Node A

Is there some way to get Node A back into Raft?

I rebuilt Node A again, supplying 36593f45-cba6-4368-a287-b5e692b8bb7e as the id, and had no hiccups during the bootstraps. I had identified that Node B had a regression of our configuration and had enable_node_aggregated_table_metrics enabled still, where it will cause massive memory contention for us, since these nodes have tons of keyspaces/tables.

I’d still be curious if there is a quicker way to resolve this than waiting another day for a rebuild, but it’s not urgent anymore.

What version of Scylla are you using?

This cluster is using version 5.4.9

What is the host ID of node A? It prints it during startup.
By “rebuilt” do you mean that you bootstrapped the node from scratch, performing a replace operation?

Was 49deb87f-adbf-4524-9728-9d49aa58e36b perhaps the host ID of node A at some point?

In general, it appears you were using replace-with-same-IP operation, as I understand from your description you continued to use the same AWS instance where previous incarnation of node A was, so it was the same IP.

Unfortunately replace-with-same-IP is full of problems, and I recommend against using it in the future. If you lose disk from an instance, either drop that instance (and start a new one) or if it’s possible, restart it with different IP (I’m not familiar with AWS enough to know if changing instance IP is possible like that).

You can learn more details from this issue: Failure during replace-with-same-IP leaves the node without `STATUS` application_state (permanently), and `token_metadata` inconsistent (until restart) (applies to gossiper / "node-ops" based topology changes) · Issue #19975 · scylladb/scylladb · GitHub

@kbr, The core steps in the op are essentially the Setup RAID Following a Restart instructions from the Scylla docs. After losing the ephemeral storage of an i3 instance, it says to recreate the RAID volume and use replace_node_first_boot with the old Host ID.

I was aware that using IP addresses to replace nodes was discouraged, and host IDs were much more reliable. But are you saying that the “Setup RAID Following a Restart” process is still replace-with-same-ip because the IP is unchanged, in spite of using Host IDs?

Yes @prtolo is correct, we used the replace_node_first_boot option and have for a while. I think the replace by IP is actually disabled and yells at you to use the ID if attempted (we found that out when it stopped working).

I do believe 49deb87f-adbf-4524-9728-9d49aa58e36b was an old ID of Node A, and so that was why I tried removing it, hoping the new ID would join the raft in it’s stead.

Whether you use host ID or IP to define the node you’re replacing in configuration matters little here. Internally Scylla will translate between them.

The issue I linked to happens when the IP address of the node that is being replaced, and the node that is replacing it, is the same. And in 5.4 and earlier Scylla releases, and corresponding Enterprise releases, unfortunately, this operation has some quirks and handles failures non-gracefully (cluster is put into a weird state that may require deep knowledge and manual steps to recover from).

The “Setup RAID Following a Restart” guide is precisely a guide to perform a replace operation of a node with the same IP address.

I think we should update the documentation to recommend against it, or include a step to change the IP address of the instance which lost its disk (if it is possible; if not, use a new instance).

1 Like

CC @Anna with regards to documentation