Query Hangs – Advice on Timeout/Retry Strategy

Hi everyone,

I’m running into an issue with my Rust-based system using ScyllaDB and would appreciate some guidance from more experienced folks in this area.

System Overview:

I’ve built a ScyllaClient abstraction in Rust that:

  • Initializes a single Session via OnceCell<Arc<ScyllaClient>>
  • Prepares all queries up front during initialization
  • Exposes methods like delete_from_simulation_result_le() that internally use session.execute_*(...)

The system includes two main binaries:

  1. data_manager
  • Spawns 2 threads, each polling a Redis queue in an infinite loop
  • When work arrives, it fetches the Scylla client (ScyllaClient::get()) and executes 3–7 queries
  1. simulator
  • Performs CPU-heavy simulations
  • Each simulation executes 8–10 Scylla queries
  • Previously ran ~25 million simulations per hour (before switching to Scylla from file-based storage)

Both programs use the same shared Scylla client implementation.

Problem:

The data_manager process occasionally hangs during a query execution. There’s no panic or error — it just stalls silently. Logging shows the last event before hanging is a Scylla query attempt.

I suspect this might be due to resource constraints on my local development setup (Dockerized Scylla cluster with 2 nodes, each using --smp 1, --memory 750M, and --developer-mode 1).

However, I’m surprised that a query could hang indefinitely with no error. I expected either:

  • An internal timeout
  • A failed future
  • A panic or log message from Scylla

Questions:

  1. Is it expected behavior that Scylla queries can hang indefinitely, without error propagation?
  2. Should I add a timeout/backoff/retry mechanism around all query executions (including execute_unpaged, execute_iter, batch, etc.)?
  3. Is it better to reuse one session per binary, or is session-per-task sometimes more resilient under load?
  4. Is there any specific logging or tracing I can enable on Scylla or the driver to diagnose query hangs?

I realize this might be a side-effect of my limited dev environment, but I’d like to understand best practices around timeouts and retry strategies when using Scylla in high-throughput systems.

It absolutely does not do anything near the ~25 million simulations. It already hangs far below ~1000 simulations.

Definitely not. There are two kinds of timeouts involved in statement execution: server-side timeout and client-side timeout. While the driver may have no client-side timeout set (though the default is 30 seconds), the server-side timeout is always set (5 seconds by default) and should make a request that runs for too long return a {Read,Write}Timeout error. This is thus unexpected that your application hangs indefinitely.

Both retry and timeout mechanisms are already there (backoff is not yet there, but planned for the future):

  • for retries, see RetryPolicy. DefaultRetryPolicy already does some retry logic; the logic, however, is only triggered on:
    • an error returned from the DB (which we know doesn’t happen in your case),
    • the connection broken (not your case, too).
  • for timeouts, see ExecutionProfileBuilder::request_timeout(). By configuring an execution profile, you can set many things, including timeouts, policies, etc.
  • in your case, you could also benefit from enabling speculative execution (SpeculativeExecutionPolicy), which is off by default. Keep in mind that you need to set the statement as idempotent in order to allow speculative execution.

The answer is simple. Every Session has a significant overhead:

  • it owns the connection pools,
  • it keeps a lot of cluster metadata up-to-date.

For this reason, it’s recommended to use a single Session. For customising various workloads, use multiple ExecutionProfiles in one Session.

  • For the driver, see the logging example in the repo.
  • ScyllaDB logs can help to diagnose if the node crashes. To see them, use docker log <container_id>.

Please let me know if the above helps, and don’t hesitate to share your findings on what caused the hanging.

1 Like