Read_concurrency_semaphore & p99 read latency

Currently chasing p99 read latency in our applications that queries time-series data on period of time.

Useful information :
Scylla Open Source 5.1.14-0.20230716.753c9a4769be
16 core / 128 GB RAM
RAID0 SSD 3 TB
3 nodes

Schema definition :

CREATE TABLE trades_v1.tick_v1_desc (
    exchange_code text,
    symbol text,
    hour timestamp,
    datetime timestamp,
    id text,
    amount double,
    collected_at timestamp,
    origin text,
    price double,
    side tinyint,
    vwp double,
    PRIMARY KEY ((exchange_code, symbol, hour), datetime, id)
) WITH CLUSTERING ORDER BY (datetime DESC, id DESC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
    AND comment = ''
    AND compaction = {'class': 'TimeWindowCompactionStrategy', 'compaction_window_size': '1', 'compaction_window_unit': 'HOURS'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.0
    AND default_time_to_live = 259200
    AND gc_grace_seconds = 0
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99.0PERCENTILE';

Application (using golang gocqlx driver) query are pretty straight forward, it queries data by aggregation (always 1 row returned, using basic agg. functions) on a given time-range, from current time minus P to current time, where P is in the range 1min - 15min. (so always query most recent data)

P99 read latency on the scylla dashboard is low, (1ms - 10ms).
On the gocql latency report, P95 is between 1ms-10ms BUT the p99 is very high : from 700ms to 1s !

I’ve tried many things : code profile, tracing, but it seems that somehow ONE request amongst several take forever to run (we issue a batch of read every 1 second to aggregate data).

I’ve noticed those exceptions in the scylla-server logs :

[shard 11] reader_concurrency_semaphore - Semaphore _read_concurrency_sem with 4/100 count and 109559/179264552 memory resources: timed out, dumping permit diagnostics:
                                                  permits        count        memory        table/description/state
                                                  3        3        74K        trades_v1.tick_v1_desc/data-query/inactive
                                                  1        1        33K        trades_v1.tick_v1_desc/data-query/active/used
                                                  1        0        0B        trades_v1.tick_v1_desc/data-query/waiting
                                                  1        0        0B        trades_v1.tick_v1_desc/data-query/waiting
                                                  144        0        0B        trades_v1.tick_v1_desc/data-query/waiting
                                                  1        0        0B        trades_v1.tick_v1_desc/data-query/waiting
                                                  1        0        0B        trades_v1.tick_v1_desc/data-query/waiting
                                                  1        0        0B        trades_v1.tick_v1_desc/data-query/waiting
                                                  1        0        0B        trades_v1.tick_v1_desc/data-query/waiting

However it’s not in direct correlation in terms of timestamp with those p99 spikes in the application i.e spikes are every 5/10 s whereas those exceptions occurs less frequently.

To add more context in terms of volumetry, it’s very low, I expected scylla to handle it easily,
6k writes / s , 2k read / s

It’s an isolated cluster so I can pinpoint the root cause, our prod cluster handles more req/s ( ~ 50/130k writes and 8 / 20k reads)

If anyone have an idea of where to look to eliminate those p99 latency spike that would be very helpful.

Thanks

The window size of 1hour seems quite low. When using TWCS, ScyllaDB never compacts sstables across windows, so a small window size can lead to a huge number of sstables piling up and adversely impacting reads that have to touch multiple windows. Also, just the sheer amount of sstables will take ever larger share of memory, squeezing out cached content.