Data Modeling question, choosing the Partition Key, Clustering Key, Indexing and performance impact

Guy · May 22, 2024, 5:06am

Originally from the User Slack

@Mosca_Careca: This might be a really easy and dumb question, but it’s my first time modeling with CQL so please keep that in mind

Let’s say I have a Table about Books. Each book has certain attributes and one of those is the price. I want to be able to efficiently query by price.
If I was querying by something like tags, I was taught that I could create a secondary table like “books_by_tag” where the PK is the tag name, instead of the book ID and each tag has the books associated with it, some sort of FK in SQL.
The thing is, I cannot do that using a float. How would I go about querying by price? Is the only way available to use an index on the price attribute? If so, wouldn’t that be “bad” because it’s potentially scanning a large portion of the dataset?

@Felipe_Cardeneti_Mendes: Normally price would be a clustering key of some partition grouping all entries you want to retrieve (ie: tag in your example).
However, if you want to retrieve all items matching a specific price (this is a bit weird - I don’t see people or websites where you enter “give me all books which match $1 USD”), then an index is the way.

@Mosca_Careca: So you’re saying usually it would be something like:
CREATE TABLE books (
    book_id UUID,
    title TEXT,
    author TEXT,
    price FLOAT,
    PRIMARY KEY (book_id, price)
);
where I could then query like so:
SELECT * FROM books WHERE price >= X AND price <= Y;
?
Yeah sorry, index was not what I meant to say. Like you said, it makes no sense to use an index here, because I want price ranges and not exact pricing.

@Felipe_Cardeneti_Mendes: > PRIMARY KEY (book_id, price)
No, this isn’t good, because a book_id uniquely identify a book. And thus it can have only one price at a time.

You probably want to follow the books_by_tag approach, where you have:
CREATE TABLE books_by_category (
    category_id uuid,
    category_name text,
    book_id uuid 
    book_price decimal,
    PRIMARY KEY(category_id, book_price)
); 
And this will allow you to retrieve books from within a category where the price matches a range:

SELECT book_id FROM books_by_category WHERE category_id = ? AND price >= ? AND price <= ?;

@Mosca_Careca: Ok makes sense. In my previous approach I would be creating one partition per book right?

I understand the approach you’re taking, I just don’t understand why that’s needed. I can see that adding a category to encompass larger groups of books is good. Could I add an “all” category and query from there?
I also have a question about how much redundancy should be present. Say I have the books table and the books_by_tag table. Is it a good idea to have all the info of a book both in the books table and in the books_by_tag table or is it enough to have just the book_id of a book in the books_by_tag table?

@Felipe_Cardeneti_Mendes: Actually, the PRIMARY KEY would be
PRIMARY KEY(category_id, book_price, book_id) – as there can be books falling under the same price. But this creates the need of delete/insert books whenever a price changes. So there are some tradeoffs even with that approach.

Another way would be to simply get rid of the book_price and do an ALLOW FILTERING instead if the group is small (as it should be).

> Could I add an “all” category and query from there?
You can, but the efficiency of it will depend on how many books to scan. Why not break all into smaller queries per group?
For your last question, you probably want the minimum data you need in auxiliary tables. For example, you don’t need to hold the book description on books_by_tag, because ultimately you would present that data to the end user only after he clicked on the book. However, you may want to present the path for a thumbnail, as the usual case for retrieving a price range involves listing
You can also make it as small as possible and for every book_id matched you simply read from the main table, there’s room for you to play on what works best for you

@Mosca_Careca: Uhm but then, on your approach you’re assuming that the categories are finite, which is not the case for user input categories (like in my case, I have user input tags). I still don’t totally understand PRIMARY KEYS in CQL.
I have to admit that I didn’t understand much of this last approach you said. I’ll have to investigate further and get back to it later.

> You can, but the efficiency of it will depend on how many books to scan. Why not break all into smaller queries per group?
I have just shy of 1M entries in that table. Wouldn’t breaking into smaller groups force me to execute multiple queries, one per tag or category and then unite the result sets? Is that more efficient than a big query?

> You can also make it as small as possible and for every book_id matched you simply read from the main table, there’s room for you to play on what works best for you
A gain could be made by saving the need to have this secondary query, but then the schema wouldn’t be as “universal”

@Felipe_Cardeneti_Mendes: ah 1M should be ok.

Topic		Replies	Views
Querying by non partition key column, creating an index ScyllaDB data-model , secondary-index	0	117	March 31, 2024
Tradeoff between allow filtering and creating an index ScyllaDB data-model	3	683	December 26, 2022
How to optimally design data model to select by date, which couldn't be PK? ScyllaDB	1	338	February 9, 2023
Data modeling issues ScyllaDB data-model , open-source	4	263	March 25, 2024
Using IN in CQL Queries, Performance ScyllaDB data-model , cql	0	357	October 5, 2023

Data Modeling question, choosing the Partition Key, Clustering Key, Indexing and performance impact

Related topics