Ok, so it’s been a week, I’m going to give the full timeline of events, in case something that I don’t think is relevant actually is.
- A node (let’s call it Node A) of our cluster filled disk space 100% (we missed alarms)
- Stopped writing applications, restarted node, it compacted a bit of space, was able to stay up and stable, so we bootstrapped a new node into that rack (AZ) (call it Node B)
- Node B came up, we tried to run cleanups on the Node A, eventually got what seemed like a good chunk of free space, so started applications
- Node A filled up again while running the cleanups, crash/restart a few times, EC2 becomes unresponsive, so we soft reboot
- AWS ate the RAID/NVMe’s on Node A, so we just backed out entirely and aimed to stop/start to rebuild it from scratch in place with
replace_node_first_boot
- Failed the next day due to the Node B having a random restart that it recovered from
- Restart the replace from scratch on Node A (a restart of the service)
- Repeat of the last 2 the next day
- We try to continue the rebuild, having to change the node id in the config, since it got past that point in the rebuild. Seems to find the data on disk and repair from there fine
This morning:
- Node A completed it’s bootstrap
- Shortly after crashes, doesn’t come back up because
replace_node_first_boot
is still set - Remove that and restart the service, it never gets to “Serving”, stalling on “starting system distributed keyspace” in the systemd status
- It eventually just stops writing to the logs after a ton of compactions and repairs
That brought me to this afternoon and reading through this documentation on failed membership changes.
I ran both of the queries on a good node to get the cluster state, both returned 10 rows (which matches our node count currently). I saw the below state:
cqlsh> select * from system.raft_state;
group_id | disposition | server_id | can_vote
--------------------------------------+-------------+--------------------------------------+----------
d6f14d80-5cbc-11ef-9c2c-d7f4c1c778e5 | CURRENT | 361040b2-f2ce-46ee-ae8f-5ccb7a516dae | True
d6f14d80-5cbc-11ef-9c2c-d7f4c1c778e5 | CURRENT | 49546c8d-6dcf-4980-9c86-5ef1e8c53fa7 | True
d6f14d80-5cbc-11ef-9c2c-d7f4c1c778e5 | CURRENT | 49deb87f-adbf-4524-9728-9d49aa58e36b | False
d6f14d80-5cbc-11ef-9c2c-d7f4c1c778e5 | CURRENT | 5b6298da-0961-4b71-8ca2-99d753722fa5 | True
d6f14d80-5cbc-11ef-9c2c-d7f4c1c778e5 | CURRENT | 7c458569-23c4-4caa-a2a8-f9faf069f3ec | True
d6f14d80-5cbc-11ef-9c2c-d7f4c1c778e5 | CURRENT | 92d1ab67-0faa-4120-b26a-106af738b355 | True
d6f14d80-5cbc-11ef-9c2c-d7f4c1c778e5 | CURRENT | 9ca8bf78-0167-4f58-91a7-9d2f1e5cf19f | True
d6f14d80-5cbc-11ef-9c2c-d7f4c1c778e5 | CURRENT | ad69d877-3262-41a5-88ef-ca973bddefce | True
d6f14d80-5cbc-11ef-9c2c-d7f4c1c778e5 | CURRENT | b7316efa-e8e8-4031-abfa-05943c753bda | True
d6f14d80-5cbc-11ef-9c2c-d7f4c1c778e5 | CURRENT | bf03fa3a-42e4-416a-b84c-b0704a801c12 | True
(10 rows)
cqlsh> select * from system.cluster_status;
peer | dc | host_id | load | owns | status | tokens | up
----------+---------+--------------------------------------+---------------+----------+--------+--------+------
10.*.*.* | us-east | b7316efa-e8e8-4031-abfa-05943c753bda | 5947483845182 | 0.093779 | NORMAL | 256 | True
10.*.*.* | us-east | 7c458569-23c4-4caa-a2a8-f9faf069f3ec | 6040102510775 | 0.105843 | NORMAL | 256 | True
10.*.*.* | us-east | ad69d877-3262-41a5-88ef-ca973bddefce | 5278908551309 | 0.098916 | NORMAL | 256 | True
NODE B | us-east | 9ca8bf78-0167-4f58-91a7-9d2f1e5cf19f | 4520013810048 | 0.093432 | NORMAL | 256 | True
NODE A | us-east | 36593f45-cba6-4368-a287-b5e692b8bb7e | | 0.112803 | NORMAL | 256 | True
10.*.*.* | us-east | 49546c8d-6dcf-4980-9c86-5ef1e8c53fa7 | 6003262766104 | 0.103674 | NORMAL | 256 | True
10.*.*.* | us-east | 92d1ab67-0faa-4120-b26a-106af738b355 | 5755106310906 | 0.100487 | NORMAL | 256 | True
10.*.*.* | us-east | 361040b2-f2ce-46ee-ae8f-5ccb7a516dae | 5902387877491 | 0.103925 | NORMAL | 256 | True
10.*.*.* | us-east | bf03fa3a-42e4-416a-b84c-b0704a801c12 | 5890694964817 | 0.093611 | NORMAL | 256 | True
10.*.*.* | us-east | 5b6298da-0961-4b71-8ca2-99d753722fa5 | 5617537828836 | 0.09353 | NORMAL | 256 | True
49deb87f-adbf-4524-9728-9d49aa58e36b
was only showing up in the raft_state
and was not in nodetool status
or nodetool gossipinfo
, and only referenced in some log warnings about not being able to hit it. So, it matched the definition of a ghost member, so I ran a nodetool removenode
on it, which quickly completed and removed it from the above table.
I gave Scylla on Node A a restart a while afterwards, as nothing on it changed and there were no logs coming out of it. It’s back to where it was before, just logging a ton about compactions, but still stuck at starting system distributed keyspace
.
So, the current state:
- 10 nodes showing as
UN
innodetool status
- 10 nodes in
system.cluster_status
asup = True
, with just Node A missing theload
column - 9 nodes in
system.raft_state
, missing Node A
Is there some way to get Node A back into Raft?