Originally from the User Slack
@Igor_Q: hello, everyone
i have a running scylla cluster storing some features in a table (id, field, data, updated_at) with primary key (id, field)
this is great for saving realtime updates by (id, field) and querying the whole wide row by (id)
now what i want to do is to also upload batch data (daily aggregates). upsert by (id, field) works fine here, too, but the problem is, these data are meant to be fully refreshed, so i need some way to expire old records from the table
i have considered the following solutions:
• implement full scan as described in https://www.scylladb.com/2017/03/28/parallel-efficient-full-table-scan-scylla/ with bypass caching
. this can be used to expire records from all aggregate uploads by (field, updated_at) and also to collect some data metrics at the same time. but i am concerned about performance penalty. there’s probably non-negligible limitation on how often i can run full scans
• use secondary index to filter by field
, but essentially run full scan on all records with matching field
• set short TTL (3-6 hours) for records updated from aggregates and constantly reupload them to refresh the TTL. the downside is, if constant reuploading breaks, we lose the data from scylla
is there something i am missing? what would you suggest?
@avi: Full scan is a good solution here, especially with workload prioritization pushing it to only use idle time