Originally from the User Slack
@Marko_Ćorić: Ok, quick question. Any chance we can increase timeout on DELETE request
DELETE FROM game_rounds WHERE player_id = "XYZ" AND transaction_time < '2022-01-01 00:00:00.00000';
WITH TIMEOUT 500s
is not available here. We are using Rust driver scylla = "0.12.0"
.
This is error message:
DbError(WriteTimeout { consistency: Quorum, received: 1, required: 2, write_type: UnloggedBatch }, "Operation timed out for abc.game_rounds - received only 1 responses from 2 CL=QUORUM.")
@Piotr_Smaroń: have you tried setting requests timeout in the RUST driver?
From cqlsh you would do something like:
$ cqlsh --request-timeout=3600
cc @Piotr_Grabowski
@Marko_Ćorić:
let execution_profile = ExecutionProfile::builder()
.consistency(Consistency::Quorum)
.request_timeout(Some(Duration::from_secs(500)))
.build();
tried to add it to execution profile, but no success. We are using caching session, maybe that’s problem?
just give up after 2 sec
2024-04-25T11:22:07.189790Z INFO reporting_rs::scylla_db::game_rounds
2024-04-25T11:22:09.248792Z ERROR reporting_rs
@Piotr_Smaroń: 2s looks suspicious, can you run this query with tracing enabled?
@Marko_Ćorić: yup, lemme get logs
2024-04-25T11:41:30.949374Z TRACE Request{kind="unprepared" contents="DELETE FROM game_rounds WHERE nick_name = ? AND transaction_time < ?"}:Executing query{node=10.22.65.69:9042}: scylla::transport::session: Execution started
2024-04-25T11:41:30.949506Z TRACE Request{kind="unprepared" contents="DELETE FROM game_rounds WHERE nick_name = ? AND transaction_time < ?"}:Executing query{node=10.22.65.69:9042}: scylla::transport::connection_pool: Selecting random connection
2024-04-25T11:41:30.949651Z TRACE Request{kind="unprepared" contents="DELETE FROM game_rounds WHERE nick_name = ? AND transaction_time < ?"}:Executing query{node=10.22.65.69:9042}: scylla::transport::connection_pool: Available connections="10.22.65.69:19042"
2024-04-25T11:41:30.949779Z TRACE Request{kind="unprepared" contents="DELETE FROM game_rounds WHERE nick_name = ? AND transaction_time < ?"}:Executing query{node=10.22.65.69:9042}: scylla::transport::connection_pool: Found connection for the target shard shard=2
2024-04-25T11:41:30.949944Z TRACE Request{kind="unprepared" contents="DELETE FROM game_rounds WHERE nick_name = ? AND transaction_time < ?" shard=2 shard=2}:Executing query{node=10.22.65.69:9042}: scylla::transport::session: Sending connection=10.22.65.69:19042
2024-04-25T11:41:30.950111Z TRACE scylla::transport::connection: Sending 1 requests; 307 bytes
2024-04-25T11:41:30.952708Z TRACE scylla::transport::connection: Sending 1 requests; 75 bytes
2024-04-25T11:41:32.954707Z TRACE Request{kind="unprepared" contents="DELETE FROM game_rounds WHERE nick_name = ? AND transaction_time < ?" shard=2 shard=2 request_size=41 request_size=41}:Executing query{node=10.22.65.69:9042}: scylla::transport::session: Query failed last_error=Database returned an error: Not enough nodes responded to the write request in time to satisfy required consistency level (consistency: Quorum, received: 0, required: 2, write_type: UnloggedBatch), Error message: Operation timed out for abc.game_rounds - received only 0 responses from 2 CL=QUORUM.
2024-04-25T11:41:32.954855Z TRACE Request{kind="unprepared" contents="DELETE FROM game_rounds WHERE nick_name = ? AND transaction_time < ?" shard=2 shard=2 request_size=41 request_size=41}:Executing query{node=10.22.65.69:9042}: scylla::transport::session: retry_decision="DontRetry"
2024-04-25T11:41:33.013485Z ERROR reporting_rs: GameRoundModel::remove_game_rounds error DbError(WriteTimeout { consistency: Quorum, received: 0, required: 2, write_type: UnloggedBatch }, "Operation timed out for abc.game_rounds - received only 0 responses from 2 CL=QUORUM.")
Error: "Database returned an error: Not enough nodes responded to the write request in time to satisfy required consistency level (consistency: Quorum, received: 0, required: 2, write_type: UnloggedBatch), Error message: Operation timed out for abc.game_rounds - received only 0 responses from 2 CL=QUORUM."
@Piotr_Smaroń: oh, I rather meant to enable query tracing - https://opensource.docs.scylladb.com/stable/using-scylla/tracing.html - by running
TRACING ON;
but on 2nd thought, IDK if that’d yield any info, since the query fails/times out
Tracing | ScyllaDB Docs
@Marko_Ćorić: ah, ok
@Michał_Chojnowski: There are two kinds of a timeout:
• server-side timeout, which happens when replica nodes don’t respond to the coordinator in time
• client-side timeout, which happens when the coordinator doesn’t respond to the client in time
What you see here is a server timeout — the coordinator reported that only one replica (probably itself) has responded within the configured time, but at least two were needed to satisfy the quorum.
.request_timeout
in the Rust driver only sets the client-side timeout. If you make it larger, the client will only wait longer for the coordinator — but the coordinator will still just give up after 2 seconds.
To change the coordinator timeout, you can either add USING TIMEOUT xyz
in the query, or you can change it globally in the server config. (E.g. via the write_request_timeout_in_ms
entry in scylla.yaml
.
It is set to write_request_timeout_in_ms: 2000
by default, hence 2 seconds.
@Marko_Ćorić: can this help maybe?
[shard 3] querier - Read 0 live rows and 18897 tombstones for abc.game_rounds partition key "17958:nyx" {{1903035228094413201, pk{0008616263313739353800036e7978}}} (see tombstone_warn_threshold)
should I go for “one by one” delete?
Get data with HistoryCollector
and then remove them in loop?
@Michał_Chojnowski I can’t add USING TIMEOUT
in delete query, tried that first
@Michał_Chojnowski: > should I go for “one by one” delete?
No, a range delete should usually perform better than thousands of single-row deletes. Though it depends on the details.
What Scylla version are you using? Lingering tombstones have been a major source of performance problems for a long time, but there have been improvements over time.
@Marko_Ćorić: That’s true, but can’t go further.
5.2.13-0.20240103.c57a0a7a46c6
that’s version we are using atm
atm I’m using manual remove when tombstone “hit”, apply query to record day-by-day (run query from DataGrip), but that takes time … also, sometimes rollback happens, so removed records are not removed.
@Michał_Chojnowski: You can do USING TIMEOUT
, but the syntax is that USING
has to be before the WHERE
.
I.e.:
delete from ks.t using timeout 500s where key = x
will work
@Marko_Ćorić: hmmm, ok lemme try
@Michał_Chojnowski: I just checked that USING TIMESTAMP
works.
But increasing the timeout is only going to hide the underlying performance problem. (It’s hard to say what it is without the details). A delete shouldn’t take 2s in the first place. (Unless there is a materialized view involved, because then a single statement might have to delete any number of partitions).
@Marko_Ćorić: jesus, this is solution. Thanks a lot m8.
yes, USING TIMEOUT
before WHERE
do the job
I tried at the end of query, following documentation for SELECT
we have materialized view, bit it’s not used in this query
@Michał_Chojnowski: If an increased timeout is a good enough solution for you, then good.
But if the real problem (which causes things to happen slow enough that they exceed the timeout) is about having too many tombstones, then you might have to do something to deal with the tombstones.
Unfortunately, isn’t that very straightforward — you’d have to run repair, adjust tombstone GC settings, flush and compact, and then maybe even clean the cache, to flush all old tombstones out.
I see that https://opensource.docs.scylladb.com/stable/kb/tombstones-flush.html has some info about that. (But it seems somewhat outdated, because since Scylla 5.0 you’d probably want to use ALTER TABLE <http://ks.cf|ks.cf> WITH tombstone_gc = {'mode':'repair'} ;
before the repair instead of altering gc_grace_seconds
after the repair).
Fiddling with tombstone GC settings requires you to know what you’re doing though, otherwise you might accidentally undelete data.
But again, it’s hard to know what the performance problem is without taking a close look.
@Marko_Ćorić: We just had to remove old data, so that’s good.
That will help us with large partitions
@Michał_Chojnowski: > I tried at the end of query, following documentation for SELECT
Can’t blame you for that; it’s weird that SELECT and DELETE have a different ordering between WHERE and USING.
I wonder how the syntax ended up like that. I guess the author of USING TIMEOUT
just didn’t notice the inconsistency.
Maybe the grammar should be relaxed in a future release so that either ordering works.