Thanks for the response.
Are you sure that max_size_of_hints_in_progress applies to the hint size on disk? I think it really means the size of hints that are currently being persisted locally (according to my limited understanding of manager::end_point_hints_manager::store_hint). Please correct me if I understand this code wrong:
bool manager::end_point_hints_manager::store_hint(schema_ptr s, lw_shared_ptr<const frozen_mutation> fm, tracing::trace_state_ptr tr_state) noexcept {
try {
// Future is waited on indirectly in `stop()` (via `_store_gate`).
(void)with_gate(_store_gate, [this, s = std::move(s), fm = std::move(fm), tr_state] () mutable {
++_hints_in_progress;
size_t mut_size = fm->representation().size();
shard_stats().size_of_hints_in_progress += mut_size;
return with_shared(file_update_mutex(), [this, fm, s, tr_state] () mutable -> future<> {
return get_or_load().then([this, fm = std::move(fm), s = std::move(s), tr_state] (hints_store_ptr log_ptr) mutable {
commitlog_entry_writer cew(s, *fm, db::commitlog::force_sync::no);
return log_ptr->add_entry(s->id(), cew, db::timeout_clock::now() + _shard_manager.hint_file_write_timeout);
}).then([this, tr_state] (db::rp_handle rh) {
auto rp = rh.release();
if (_last_written_rp < rp) {
_last_written_rp = rp;
manager_logger.debug("[{}] Updated last written replay position to {}", end_point_key(), rp);
}
++shard_stats().written;
manager_logger.trace("Hint to {} was stored", end_point_key());
tracing::trace(tr_state, "Hint to {} was stored", end_point_key());
}).handle_exception([this, tr_state] (std::exception_ptr eptr) {
++shard_stats().errors;
manager_logger.debug("store_hint(): got the exception when storing a hint to {}: {}", end_point_key(), eptr);
tracing::trace(tr_state, "Failed to store a hint to {}: {}", end_point_key(), eptr);
});
}).finally([this, mut_size, fm, s] {
--_hints_in_progress;
shard_stats().size_of_hints_in_progress -= mut_size;
});;
});
} catch (...) {
manager_logger.trace("Failed to store a hint to {}: {}", end_point_key(), std::current_exception());
tracing::trace(tr_state, "Failed to store a hint to {}: {}", end_point_key(), std::current_exception());
++shard_stats().dropped;
return false;
}
return true;
}
I was getting this error on the client side of my write-operations. Since hints are not reliable anyway, I would expect them to be silently dropped, instead of making writes fail. The code throws a overloaded-exception, which is pretty hard:
if (cannot_hint(all, type)) {
get_stats().writes_failed_due_to_too_many_in_flight_hints++;
// avoid OOMing due to excess hints. we need to do this check even for "live" nodes, since we can
// still generate hints for those if it's overloaded or simply dead but not yet known-to-be-dead.
// The idea is that if we have over maxHintsInProgress hints in flight, this is probably due to
// a small number of nodes causing problems, so we should avoid shutting down writes completely to
// healthy nodes. Any node with no hintsInProgress is considered healthy.
throw overloaded_exception(_hints_manager.size_of_hints_in_progress());
}
Can’t the hint be simply dropped if it exceeds the threshold?