Originally from the User Slack
@Joakim_Lindqvist: Hey
I am attempting to test scylla 6.0.1 on a new cluster and running into an issue were my nodes seems stuck starting up in phase starting the view builder
.
This node that is starting up has ip 192.168.2.239
and is set to be the seed node.
This is on AWS using the scylla amis in us-east-1, using i4i.large
instances. This is meant as a small dev deployment thus the small scale.
The scylla monitoring is reporting the status as Starting
. I have given this node about 2 hours and it still hasn’t moved.
Scylla.yaml
# Generated by Scylla Machine Image at 2024-06-25 09:30:46.129293
# See '/etc/scylla/scylla.yaml.example' with the full list of supported configuration
# options and their descriptions.
api_address: 127.0.0.1
api_doc_dir: /opt/scylladb/api/api-doc/
api_port: 10000
api_ui_dir: /opt/scylladb/swagger-ui/dist/
auto_bootstrap: true
batch_size_fail_threshold_in_kb: 1024
batch_size_warn_threshold_in_kb: 128
broadcast_rpc_address: 192.168.2.239
cas_contention_timeout_in_ms: 1000
cluster_name: jupiter-dev
commitlog_segment_size_in_mb: 32
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000
commitlog_total_space_in_mb: -1
enable_tablets: true
endpoint_snitch: GossipingPropertyFileSnitch
listen_address: 192.168.2.239
maintenance_socket: ignore
murmur3_partitioner_ignore_msb_bits: 12
native_shard_aware_transport_port: 19042
native_transport_port: 9042
num_tokens: 256
partitioner: org.apache.cassandra.dht.Murmur3Partitioner
read_request_timeout_in_ms: 5000
rpc_address: 0.0.0.0
rpc_port: 9160
schema_commitlog_segment_size_in_mb: 128
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "192.168.2.239"
strict_is_not_null_in_views: true
write_request_timeout_in_ms: 2000
cassandra-rackdc.properties
prefer_local=true
dc=us-east-1
rack=us-east-1b
Journal log
@dor: I am not sure but maybe it’s related to Raft not finding a majority, as it’s a single node cluster. It should work but maybe it’s an issue. If you’ll add 2 more nodes, you’ll see whether I was right or not
@Joakim_Lindqvist: I attempted to add one more node and that didn’t seem to help. I can add another one to see if it fixes itself.
The 3rd node is up, doesn’t seem to be making progress.
@dor: hmm, sorry, I didn’t see anything special in the logs. I assume the config is completely fresh (minus edits)
@Joakim_Lindqvist: Yeah the config is all generated by the machine image except for my edits to the seed ip
@avi: @Kamil_Braun please help
@Kamil_Braun: > I am not sure but maybe it’s related to Raft not finding a majority, as it’s a single node cluster.
“majority” is always calculated from total number of nodes. If your total is 1 then your majority is 1. If total is 3 then majority is 2 etc.
So single node clusters work as usual.
I suspect that you tried to first boot the node with a different seed by mistake, then stopped it and changed the seed. But the old seed was persisted in the cluster discovery algorithm which now requires to contact that seed in order to proceed
OR you concurrently tried to start a second node which contacted this node and inserted itself as one of participants of cluster discovery algorithm, and then shut down. But it’s also one of the required contact points now.
In any case the node looks to be stuck inside cluster discovery algorithm.
stop it, delete everything (workdir i.e. data, commitlog, etc.), make sure you have correct seed setup (see documentation), try again
@Joakim_Lindqvist: I did initially start it with 127.0.0.1
as the seed ip, yes.
I did start the second node after a while (as I suspected a similar issue to Dor were we might need more nodes for it to start) but it should have started up in time for that.
I will tear it all down and try again and see if I can repro it. Thanks for the suggestion!
I have tested tearing this down and setting up a new cluster. This time with a single node, this is set to have 127.0.0.1
as its seed node (seems to be what happens by default when I do not specify any seed ips).
There are no other nodes attempting to connect.
I am still seeing the same issue with it stopping at the same point.
@Kamil_Braun: You cannot use 127.0.0.1
if your listen_address
/ broadcast_address
etc. is different
you must use the same IP everywhere
https://opensource.docs.scylladb.com/stable/operating-scylla/procedures/cluster-management/create-cluster.html
Create a ScyllaDB Cluster - Single Datacenter (DC) | ScyllaDB Docs
so if you used listen_address: 192.168.2.239
then you must use
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "192.168.2.239"
but now you have to teardown again. If you use seeds: "127.0.0.1"
and then change later to seeds: "192.168.2.239"
it won’t work because the cluster discovery algorithm already persisted 127.0.0.1
and will continue trying to contact it (and never finish)
@Joakim_Lindqvist: I didn’t explicitly set 127.0.0.1 as the seed or listen address, I set the seed to empty string which seems to then default to that ip.
The machine image generated yaml sets the listen address to the ip of the machine.
But I guess this may have been a result of my changes to how our bootstrapping works were I do not allocate a ec2 ip ahead of time for our machines and instead try to update it after the machine has started and the ip has been allocated. But I am fairly sure we have been able to setup new clusters like this in 5.x (we never tried this with 4.x as we had predefined ips for those).
But this gives me some data to go on, I will try and see if I can rework our bootstrapping to work better without having to rely on predefined ips (as that results in special case handling of this initial seed node which doesn’t really make sense with how seeds are not really a thing anymore once the cluster starts up).
I resolved this now. Avoiding specifying a seed ip for the first node, e.g. a machine image user data like this:
{
"scylla_yaml": {
"cluster_name": "jupiter-dev",
"endpoint_snitch": "GossipingPropertyFileSnitch"
},
"post_configuration_script" : "<ommited>"
}
Resolves this issue as the machine image defaults to the private ip of the ec2 instance.
So this was really a problem in our ec2 setup, I am unclear on why this was not a issue for the new clusters we have setup with 5.x but fundamentally we were not following the setup recommendations and doing so resolves the issue.