I am trying to learn better data modelling with Scylla and wanted to understand what would be the optimal setup for a time-series use-case..As an example, let’s say I have data coming in from multiple sensors (each sensor has a unique ID) for a few hours everyday (at 20k rps).. I need to store the sensor data itself in a table and also store 10s, 1min, 10min, 1hour, 1d, 1w, 1month, 1year aggregates, (first, last, min, max, avg) – which the ingestor will compute and INSERT.
For the raw sensor data, I think a good setup would be to have sensor_raw table with PRIMARY KEY ((sensor_id, date), timestamp) (order by desc) ?
For the aggregates,
i could have a single aggregates table with PRIMARY KEY ((sensor_id, period), timestamp) (order by desc).. but i believe this would be bad since the partition will keep growing forever.
I could have a single aggregates table with PRIMARY KEY ((sensor_id, period, date), timestamp) … this is probably better from a partition size perspective. but for higher periods like 1d, 1w, 1month, etc, client side logic would be very messy and i would have to force a single compaction strategy for all partitions..
Other option is to have multiple aggregates table like aggregates_<period> with PRIMARY KEY ((sensor_id, bucket), timestamp) structure where the bucket for lower timeframes is yyyy-mm-dd, and for higher timeframes like 1d, 1w is yyyy-mm etc.. This way, for lower timeframes, the compaction can be time-tiered. and for higher timeframes it can be size-tiered etc..
What is the best practice here?
P.S.: This is a repost from Slack (general channel)
For optimal time-series data modeling in Scylla handling sensor data at high write rates with multiple aggregation levels, best practices include:
For raw sensor data:
Use a table like sensor_raw with primary key PRIMARY KEY ((sensor_id, date), timestamp) ordered descending by timestamp. This partitions data by sensor and date, avoiding excessively large partitions.
For aggregate data:
Avoid a single aggregates table partitioned only by (sensor_id, period) because partitions grow indefinitely.
Using (sensor_id, period, date) as partition key is better for partition size management but complicates client logic and forces uniform compaction strategy.
A recommended approach is to create multiple aggregate tables for different periods, e.g., aggregates_10s, aggregates_1min, aggregates_1d, each with a primary key like PRIMARY KEY ((sensor_id, bucket), timestamp).
Define the bucket differently per aggregation period: for finer granularities like seconds or minutes, use daily buckets (e.g., yyyy-mm-dd), while for coarser granularities like daily, weekly, monthly, use larger buckets such as yyyy-mm or yyyy.
This allows tailoring compaction strategies (time-tiered for frequent writes, size-tiered for less frequent ones) per aggregation table, optimizing read/write performance and data retention.
In summary, multiple aggregation tables segmented by appropriate time buckets are preferable for time series in Scylla since they provide better partition size control, flexible compaction strategies, and cleaner client-side logic for higher-level aggregates
For anything ≤ 1 min aggregation, I’m using monthly bucketed partitions (small tier). For our use case, since data input isn’t 24/7, each partition should stay under ~15 MB even at 5 s aggregation.
From 1 min up to (but not including) 1 day, I use yearly buckets (medium tier).
1 day aggregations go into a century bucket with a hardcoded key 2000-01-01 (large tier).
So, three tables total, with primary key ((sensor_id, aggregation_size, bucket), timestamp)
I’m using TWCS for all bucket tables. The compaction interval is roughly aligned with the bucket period. Does that make sense, or would a daily compaction interval be better?
Also, any tips for choosing the caching settings?
Right now, I’ve set caching = { 'keys': 'ALL' }.
Are there best practices for rows_per_partition (set to ALL or a specific number)? I couldn’t find clear details in the OSS docs.
It’s good that you are using TimeWindowCompactionStrategy (TWCS) and bucketing for partition size control. However, high query latencies for 1-minute aggregates can be due to several factors:
1. Compaction Window/Partition Alignment
Your compaction window is set to 31 days but your buckets for 1-min aggregations are yearly—this mismatch might leave many SSTables open for reads, especially for recent data.
Try aligning your compaction window to the granularity of your bucket (i.e., if your bucket is per year, consider setting compaction_window_size to 365, compaction_window_unit to DAYS for yearly buckets; for monthly buckets, use 30-31 days). This helps TWCS drop expired SSTables more efficiently and reduces SSTable fan-out on reads.
2. Partition Size
If your partitions (with yearly buckets for 1-min data) grow too large, any read (such as “last 1000 rows”) may need to scan too many SSTables and tombstones.
Monitor partition size with nodetool cfstats or Scylla Monitoring to see if you should reduce bucket duration even further.
3. Caching Settings
Setting 'rows_per_partition': '500' is a reasonable start for “last 1000 rows” access, but ALL isn’t always best for every read pattern if partitions get large.
Experiment with increasing ‘rows_per_partition’ to match your most common query size or consider switching to partition-level caching if reads always load different ranges.
4. Speculative Retry and Latency Spikes
Setting speculative_retry to '99.0PERCENTILE' is aggressive—check your actual latency profile with Scylla Monitoring and only use this if requests are timing out frequently.
5. Bloom Filter and Read Statistics
Make sure you have enough RAM for bloom filters and row cache, especially if you are running queries on wide partitions.
6. SSTable Expiry and Query Filtering
If old SSTables aren’t expiring due to TTL misconfiguration or an alignment mismatch, your queries may hit many files per read.
Summary Table: Possible Tuning Points
Setting
Current Value
Recommended for 1-min Aggregates
compaction_window_size
31
Align with bucket (e.g., 365)
compaction_window_unit
DAYS
Match bucket (DAYS)
bucket granularity
Year
Consider shorter (e.g., month)
rows_per_partition
500
Match typical query batch size
caching
ALL
Consider PARTIAL if partitions big
speculative_retry
99.0PERCENTILE
Default/monitor actual needs
In short:
Align the compaction window and bucket length.
Check partition sizes and reduce the bucket if queries still slow down.
Match caching rows to your “last N rows” queries.
Monitor with Scylla Monitoring for hot partitions or pending compactions.
Thank you for the detailed response. Really appreciate it!
Compactions:
My 1minute aggregations are in monthly bucket only (small tier table with monthly is used for <=1minute aggregations). And their compaction is set to 31 Days. And other tables also have their compaction window set to align their bucketing window.
But with cfstats i can see “high” number of SSTables (>100 is considered high i believe?).. For example, right now for the medium tier table (yearly bucket), I see SSTable count: 334; SSTables in each level: [334/4]..
Is compaction not running frequently enough? (Running major compaction manually drops this significantly once).
What should I consider as a healthy sstable count per table here?
Speculative Retries:
Would it be a good idea to remove this on the server side and use the client side speculative execution option as needed?? (We use the Rust scylla driver).
Is there any metric in scylla monitoring that indicates how many speculative executions are being done by coordinator?
Caching, TTL & Partition Sizes
I am experimenting with rows_per_partition – will continue on this and see what works for us.
We do not use TTL on the table since our data archival process has certain business logic as well.
In the previous schema of the db we had, there was no time bucketing and many partitions had reached ~100MB size and we were seeing latencies increasing day by day.. Now, maximum possible partition size is at ~20MB.
Healthy SSTable Count: For TWCS tables, 100+ SSTables isn’t unusual for active, recent buckets, but sustained high counts (e.g., 334) may indicate compaction backlog, especially if data is uniformly loaded. It’s normal for older buckets (with mostly expired or cold data) to eventually compact down to fewer files. Try monitoring compaction throughput and queue length via Scylla Monitoring—what matters most is that compactions catch up regularly, especially for recent data.
Manual Major Compaction: If major compaction reduces SSTable count sharply, frequent manual runs shouldn’t be needed unless you see prolonged backlog or latency spikes—ideally, TWCS should handle active buckets automatically.
Guidelines: There isn’t a strict “healthy” number; aim for <100 SSTables per active bucket for most read patterns, but higher counts are okay if read latency, bloom filter hits, and compaction metrics are all healthy.
Speculative Retry
Client vs. Server: Using client-side speculative retries (via Rust driver) is a solid strategy—this places low-latency control closer to your application. If you aren’t facing frequent server-side timeouts, consider disabling server-side speculative_retry and tuning this in your client instead.
Metric Visibility: Scylla Monitoring tracks server-side speculative executions via the scylla_coordinator_speculative_retry metric. For most setups, watch for spikes in this metric that align with your app’s traffic and tail latency events.
Caching, TTL, Partition Size
Experimenting with rows_per_partition: Definitely keep adjusting—ideally, set this to match your “last N rows” query size for fastest cache-based lookup.
TTL handling: It’s fine to avoid TTL if business archival needs require it; just ensure you don’t accumulate obsolete data in cold buckets.
Partition Size: Reducing partition size to ~20 MB (from 100 MB) is excellent—you should see improved query performance and compaction health.
I have applied all these changes (disable server-side speculative retry, tune caching, etc.)
But i still see very big spikes in some cases. One suspicion I have right now is if the connection is getting overloaded – We have 600rps inbound api requests which turns into 1.6k rps actual queries per second to Scylla DB due to the bucketing. Is it possible that this is choking the single connection to each node that we have?
It is possible that the connection saturation and latency spikes you observe are related to the single connection per node setup in your session pool, which can become a bottleneck under high query rates. Although the official documentation recommends a pool size of 1, this is typically for lighter workloads. For higher throughput scenarios like yours—600 inbound API requests translating to 1.6k queries per second—it is recommended to increase the connection pool size to multiple connections per node to better distribute load and avoid connection choking.
You can configure this in your Rust driver setup like so:
SessionBuilder::new()
.known_nodes(nodes)
.default_execution_profile_handle(profile_handle)
.connection_pool_size(4) // Increase pool size as needed
.build()
.await
.context("failed to setup session")
Additionally, it is crucial to use the official Scylla Rust driver to ensure proper connection pooling, shard awareness, and optimized concurrency handling tailored for ScyllaDB’s architecture. Non-official or generic drivers may lack these optimizations, leading to bottlenecks. The official Scylla Rust driver is fully asynchronous, shard-aware, and designed for high-performance ScyllaDB access.
Increasing the connection pool size beyond 1 is advisable to handle your high query rate and avoid choking a single connection.
Verify you are using the official Scylla Rust driver to leverage the best connection management and performance features.
Monitor client and server metrics to tune connection pool size and concurrency for your workload.
This approach should help reduce latency spikes and improve throughput under load. If problems persist, sharing detailed metrics and client logs will help diagnose further.
Increasing the connection pool size will allow for better concurrency and load balancing across the node, directly addressing the observed latency spikes crucial for your high-throughput time series application