Read_concurrency_semaphore & p99 read latency

Currently chasing p99 read latency in our applications that queries time-series data on period of time.

Useful information :
Scylla Open Source 5.1.14-0.20230716.753c9a4769be
16 core / 128 GB RAM
RAID0 SSD 3 TB
3 nodes

Schema definition :

CREATE TABLE trades_v1.tick_v1_desc (
    exchange_code text,
    symbol text,
    hour timestamp,
    datetime timestamp,
    id text,
    amount double,
    collected_at timestamp,
    origin text,
    price double,
    side tinyint,
    vwp double,
    PRIMARY KEY ((exchange_code, symbol, hour), datetime, id)
) WITH CLUSTERING ORDER BY (datetime DESC, id DESC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
    AND comment = ''
    AND compaction = {'class': 'TimeWindowCompactionStrategy', 'compaction_window_size': '1', 'compaction_window_unit': 'HOURS'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.0
    AND default_time_to_live = 259200
    AND gc_grace_seconds = 0
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99.0PERCENTILE';

Application (using golang gocqlx driver) query are pretty straight forward, it queries data by aggregation (always 1 row returned, using basic agg. functions) on a given time-range, from current time minus P to current time, where P is in the range 1min - 15min. (so always query most recent data)

P99 read latency on the scylla dashboard is low, (1ms - 10ms).
On the gocql latency report, P95 is between 1ms-10ms BUT the p99 is very high : from 700ms to 1s !

I’ve tried many things : code profile, tracing, but it seems that somehow ONE request amongst several take forever to run (we issue a batch of read every 1 second to aggregate data).

I’ve noticed those exceptions in the scylla-server logs :

[shard 11] reader_concurrency_semaphore - Semaphore _read_concurrency_sem with 4/100 count and 109559/179264552 memory resources: timed out, dumping permit diagnostics:
                                                  permits        count        memory        table/description/state
                                                  3        3        74K        trades_v1.tick_v1_desc/data-query/inactive
                                                  1        1        33K        trades_v1.tick_v1_desc/data-query/active/used
                                                  1        0        0B        trades_v1.tick_v1_desc/data-query/waiting
                                                  1        0        0B        trades_v1.tick_v1_desc/data-query/waiting
                                                  144        0        0B        trades_v1.tick_v1_desc/data-query/waiting
                                                  1        0        0B        trades_v1.tick_v1_desc/data-query/waiting
                                                  1        0        0B        trades_v1.tick_v1_desc/data-query/waiting
                                                  1        0        0B        trades_v1.tick_v1_desc/data-query/waiting
                                                  1        0        0B        trades_v1.tick_v1_desc/data-query/waiting

However it’s not in direct correlation in terms of timestamp with those p99 spikes in the application i.e spikes are every 5/10 s whereas those exceptions occurs less frequently.

To add more context in terms of volumetry, it’s very low, I expected scylla to handle it easily,
6k writes / s , 2k read / s

It’s an isolated cluster so I can pinpoint the root cause, our prod cluster handles more req/s ( ~ 50/130k writes and 8 / 20k reads)

If anyone have an idea of where to look to eliminate those p99 latency spike that would be very helpful.

Thanks

The window size of 1hour seems quite low. When using TWCS, ScyllaDB never compacts sstables across windows, so a small window size can lead to a huge number of sstables piling up and adversely impacting reads that have to touch multiple windows. Also, just the sheer amount of sstables will take ever larger share of memory, squeezing out cached content.

What timeout for reader_concurrency_semaphore?
What is result for queries in the dumping permit diagnostics? Are the queries dropped?

Only recent data (max 15min in the past) gets queried here, I’ve changed the window size to 4 hours but that not to affect anything

Currently inspecting slow queries logging to try to better understand what is going on

Sorry but could you rephrase your question please ? Not sure to fully understand what are you asking.

The same as the timeout of the query. This is set in the config.

Timed out queries are dropped, the others proceed as usual.

For example for shard with read_request_timeout_in_ms: 5000:

  1. Data search has begun for first read request: 00:00:00
  2. The second request was received for data search: 00:00:00.010.
  3. The third request was received for data search: 00:00:01.
    Do I understand correctly, if the first request was not completed in five seconds, then for the second request there is 10 ms for execution, since it was just waiting for 4990 ms in the queue for execution and there is 1 second for the third request with the same reason?

Yes, if the reads are cache reads, then this is what happens. Once a read goes to disk, another read can get started in parallel.

Timeout is 5s for both server & client.
Queries does not seem to be dropped, As I explained client receives results but latency is very high (between 500ms and 900ms).
I see those rogue queries in my application log, it’s only 1/40 every 1 second.
The workload in the test cluster has been reduced to the strict minimum for the application so we have only between 2k/3k writes and very few reads : 200 ops/s.

Problem still persists and I can find anything helpful in the slow query log.

Do you do deletes? The typical explanation for such rouge reads used to be tombstones. This was fixed in 5.2, upgrading to 5.2 might solve your problem.
Another typical explanation is stalls. Do you see any stalls in the logs?

Data is never deleted, only TTL’ed, we insert it (never updates) in an append-only fashion.
The only information that is given in the logs are those “reader_concurrency_semaphore” one.

TTL’d data also generates a tombstone when deleted. How long is your TTL and gc grace period respectively?

Also, how does the queries look like, especially the one which times out?

You can find schema definition in the initial post,
The hour field is time bucket of the datetime hour.

Queries being run look all the same :

		SELECT count(id) as count, price as close, max(price) as high, min(price) as low, sum(cast(amount as decimal)) as volume, sum(cast(vwp as decimal)) as svwp, datetime as lastFetched
		FROM table
		WHERE exchange_code = ? and symbol = ? and hour = ? and datetime > ? and datetime < ?

Where the time range is always current_time / current_time - X sec. (so we are always requesting the same hour bucket, maybe we are in a hot-partition scenario ?)

I’m testing a new version where I only request the last second, and add truncated second to a field into the cluster key so I can GROUP BY second, but it’s giving me pretty much the same results.

For the record we have a distinct (exchange_code, symbol) of ~ 13k.

I’m thinking about using a consistent hash value of the id field, modulo # of shard (16) and add it to the partition key for better distribution and avoid hot partition/ shard load.

Hot partition is a possibility indeed. This query will read an entire partition, which should be fine on its own.

You can look for hot partitions by going to the Detailed dashboard in monitoring and switching to by Instance,shard mode. If you see a particular shard/node combo being significantly more load than others, that is a good sign that there is a hot partition involved.