I’m running into an issue with my Rust-based system using ScyllaDB and would appreciate some guidance from more experienced folks in this area.
System Overview:
I’ve built a ScyllaClient abstraction in Rust that:
Initializes a single Session via OnceCell<Arc<ScyllaClient>>
Prepares all queries up front during initialization
Exposes methods like delete_from_simulation_result_le() that internally use session.execute_*(...)
The system includes two main binaries:
data_manager
Spawns 2 threads, each polling a Redis queue in an infinite loop
When work arrives, it fetches the Scylla client (ScyllaClient::get()) and executes 3–7 queries
simulator
Performs CPU-heavy simulations
Each simulation executes 8–10 Scylla queries
Previously ran ~25 million simulations per hour (before switching to Scylla from file-based storage)
Both programs use the same shared Scylla client implementation.
Problem:
The data_manager process occasionally hangs during a query execution. There’s no panic or error — it just stalls silently. Logging shows the last event before hanging is a Scylla query attempt.
I suspect this might be due to resource constraints on my local development setup (Dockerized Scylla cluster with 2 nodes, each using --smp 1, --memory 750M, and --developer-mode 1).
However, I’m surprised that a query could hang indefinitely with no error. I expected either:
An internal timeout
A failed future
A panic or log message from Scylla
Questions:
Is it expected behavior that Scylla queries can hang indefinitely, without error propagation?
Should I add a timeout/backoff/retry mechanism around all query executions (including execute_unpaged, execute_iter, batch, etc.)?
Is it better to reuse one session per binary, or is session-per-task sometimes more resilient under load?
Is there any specific logging or tracing I can enable on Scylla or the driver to diagnose query hangs?
I realize this might be a side-effect of my limited dev environment, but I’d like to understand best practices around timeouts and retry strategies when using Scylla in high-throughput systems.
Definitely not. There are two kinds of timeouts involved in statement execution: server-side timeout and client-side timeout. While the driver may have no client-side timeout set (though the default is 30 seconds), the server-side timeout is always set (5 seconds by default) and should make a request that runs for too long return a {Read,Write}Timeout error. This is thus unexpected that your application hangs indefinitely.
Both retry and timeout mechanisms are already there (backoff is not yet there, but planned for the future):
for retries, see RetryPolicy. DefaultRetryPolicy already does some retry logic; the logic, however, is only triggered on:
an error returned from the DB (which we know doesn’t happen in your case),
the connection broken (not your case, too).
for timeouts, see ExecutionProfileBuilder::request_timeout(). By configuring an execution profile, you can set many things, including timeouts, policies, etc.
in your case, you could also benefit from enabling speculative execution (SpeculativeExecutionPolicy), which is off by default. Keep in mind that you need to set the statement as idempotent in order to allow speculative execution.
The answer is simple. Every Session has a significant overhead:
it owns the connection pools,
it keeps a lot of cluster metadata up-to-date.
For this reason, it’s recommended to use a single Session. For customising various workloads, use multiple ExecutionProfiles in one Session.