I have the following schema that I tried to tailor to optimize data deletion by leveraging partitioning on the expiry and bucket fields. I thought this approach was helpful when I wanted to bulk delete data based on the expiry date and possibly distribute the data across different buckets for balanced storage and efficiency, as everything in a specific partition will be marked as deleted(or not, at least I thought it would.)
Queries are offloaded to materialized views designed for particular access patterns that use id and expiry compared to the date they were made. I am not sure whether this is recognized in the realm of distributed databases that the base table’s purpose is to optimize deletion and not directly used for retrieval of its data? I need to deal with time-based data need to be purged after a certain period, even if the data is never used through the materialized view.
I appreciate any advice or suggestions.
Thank you
CREATE TABLE records (
expiry date, e..g,. '2024-02-06'
bucket smallint, // derived from id when inserting records. It will be like hash(id) % 1000 for 1000 buckets
id text, // id which will be unique
some_relevant_id tinyint, // some id which belongs to id
PRIMARY KEY((expiry, bucket), id, segment_id)
);
CREATE MATERIALIZED VIEW records_by_id AS
SELECT *
FROM records
WHERE expiry IS NOT NULL
AND bucket IS NOT NULL
AND id IS NOT NULL
AND some_relevant_id IS NOT NULL
PRIMARY KEY (id, expiry, bucket, some_relevant_id)
WITH CLUSTERING ORDER BY (expiry ASC, bucket ASC, some_relevant_id ASC);