Originally from the User Slack
@ahmed_grati: Hello,
We’re running ScyllaDB with 1 data center and 3 nodes across 3 AZs with a replication factor of 2. These nodes are spot instances. The version of Scylla is 5.2.9
We started experiencing many timeout errors when one of the nodes gets de-scheduled because the underlying Kubernetes nodes are de-scheduled also. It takes around 10 minutes to re-schedule another node (both Kubernetes and Scylla nodes). After the Scylla node was re-scheduled, we started seeing this kind of timeout errors:
INFO 2024-04-04 09:31:27,374 [shard 1] rpc - client ip:port msg_id 409399: exception "Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE." in no_wait handler ignored
WARN 2024-04-04 09:31:28,202 [shard 1] storage_proxy - Failed to apply mutation from 172.20.79.206#1: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
WARN 2024-04-04 09:31:29,147 [shard 1] storage_proxy - Failed to apply mutation from 172.20.79.206#1: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
INFO 2024-04-04 09:31:29,279 [shard 1] rpc - client ip:port msg_id 409696: exception "Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE." in no_wait handler ignored
INFO 2024-04-04 09:31:29,284 [shard 1] rpc - client ip:port msg_id 409695: exception "Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE." in no_wait handler ignored
INFO 2024-04-04 09:31:29,287 [shard 1] rpc - client ip:port msg_id 409694: exception "Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE." in no_wait handler ignored
INFO 2024-04-04 09:31:29,312 [shard 1] rpc - client ip:port msg_id 409697: exception "Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE." in no_wait handler ignored
INFO 2024-04-04 09:31:29,315 [shard 1] rpc - client ip:port msg_id 409698: exception "Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE." in no_wait handler ignored
INFO 2024-04-04 09:31:29,317 [shard 1] rpc - client ip:port msg_id 409700: exception "Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE." in no_wait handler ignored
WARN 2024-04-04 09:31:29,319 [shard 1] storage_proxy - Failed to apply mutation from 172.20.79.206#1: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
INFO 2024-04-04 09:31:29,319 [shard 1] rpc - client ip:port msg_id 409699: exception "Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE." in no_wait handler ignored
WARN 2024-04-04 09:31:29,323 [shard 1] storage_proxy - Failed to apply mutation from 172.20.79.206#1: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
WARN 2024-04-04 09:31:29,326 [shard 1] storage_proxy - Failed to apply mutation from 172.20.79.206#1: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
WARN 2024-04-04 09:31:29,328 [shard 1] storage_proxy - Failed to apply mutation from 172.20.79.206#1: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
WARN 2024-04-04 09:31:29,329 [shard 1] storage_proxy - Failed to apply mutation from 172.20.79.206#1: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
WARN 2024-04-04 09:31:29,336 [shard 1] storage_proxy - Failed to apply mutation from 172.20.79.206#1: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
WARN 2024-04-04 09:31:29,336 [shard 1] storage_proxy - Failed to apply mutation from 172.20.79.206#1: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
WARN 2024-04-04 09:31:29,344 [shard 1] storage_proxy - Failed to apply mutation from 172.20.79.206#1: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
WARN 2024-04-04 09:31:29,350 [shard 1] storage_proxy - Failed to apply mutation from 172.20.79.206#1: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
WARN 2024-04-04 09:31:29,355 [shard 1] storage_proxy - Failed to apply mutation from 172.20.79.206#1: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
WARN 2024-04-04 09:31:29,356 [shard 1] storage_proxy - Failed to apply mutation from 172.20.79.206#1: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
WARN 2024-04-04 09:31:29,377 [shard 1] storage_proxy - Failed to apply mutation from 172.20.79.206#1: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
INFO 2024-04-04 09:31:33,379 [shard 1] reader_concurrency_semaphore - (rate limiting dropped 13 similar messages) Semaphore _read_concurrency_sem with 2/100 count and 236107/198096977 memory resources: timed out, dumping permit diagnostics:
permits count memory table/description/state
1 1 214K ourTableNameHere/data-query/active/blocked
1 1 17K system.paxos/data-query/active/used
47 0 0B system.paxos/data-query/waiting
49 2 231K total
Total: 49 permits with 2 count and 231K memory resources
It should be noted that this happens only when we use LWT.
My assumption is that since the replication factor is 2, that means that every request should be sent to all of the other nodes in the data center (since we have only 3 nodes), and since the node is de-scheduled these transactions would be put in a hinted handoff. And when the node gets re-scheduled it needs to handle both live and hinted transactions which can increase the size of the queue and cause a timeout issue for old requests.
Again this is just an assumption and I’m reaching out to you searching for an explanation and a solution for that since this is a blocker for us to use ScyllaDB in our production.
Thanks,
@avi: RF=3 is needed for LWT (and spot instances are very dangerous)
@ahmed_grati: @avi Can you elaborate more on why spot instances are very dangerous for Scylla? (I saw some blog posts on ScyllaDB for people deploying it on Spot instances)
@avi: They’re dangerous because AWS will take them away
@ahmed_grati: Yes I know, but how does that influence Scylla?
and is that related to the issue that we faced?
@avi: Ah it’s only dangerous if you use instances with local storage
@ahmed_grati: Nope, we’re using EBS volumes
@avi: EBS is okay (but can lose availability)
@ahmed_grati: are timeouts related to spot instaces?
or you have another explanation?
@avi: If you lose quorum you’ll get timeouts
@ahmed_grati: Thanks, @avi for being responsive. Just one last question, Is there any way/solution to run Scylla on the spot instance without having timeouts? From what I understood, running on spot would decrease the availability and it would also engender timeouts. Please correct me if I’m wrong.
@avi: Loss of availability = timeouts
@ahmed_grati: @avi more on this, I’m still seeing the timeout error even though no node has been de-scheduled and the load is 20%. Any clue?
@avi: Check the advanced dashboard in metrics to see if I/O or CPU is overloaded