Originally from the User Slack
@Mosca_Careca: This might be a really easy and dumb question, but it’s my first time modeling with CQL so please keep that in mind
Let’s say I have a Table about Books. Each book has certain attributes and one of those is the price. I want to be able to efficiently query by price.
If I was querying by something like tags, I was taught that I could create a secondary table like “books_by_tag” where the PK is the tag name, instead of the book ID and each tag has the books associated with it, some sort of FK in SQL.
The thing is, I cannot do that using a float. How would I go about querying by price? Is the only way available to use an index on the price attribute? If so, wouldn’t that be “bad” because it’s potentially scanning a large portion of the dataset?
@Felipe_Cardeneti_Mendes: Normally price would be a clustering key of some partition grouping all entries you want to retrieve (ie: tag in your example).
However, if you want to retrieve all items matching a specific price (this is a bit weird - I don’t see people or websites where you enter “give me all books which match $1 USD”), then an index is the way.
@Mosca_Careca: So you’re saying usually it would be something like:
CREATE TABLE books (
book_id UUID,
title TEXT,
author TEXT,
price FLOAT,
PRIMARY KEY (book_id, price)
);
where I could then query like so:
SELECT * FROM books WHERE price >= X AND price <= Y;
?
Yeah sorry, index was not what I meant to say. Like you said, it makes no sense to use an index here, because I want price ranges and not exact pricing.
@Felipe_Cardeneti_Mendes: > PRIMARY KEY (book_id, price)
No, this isn’t good, because a book_id
uniquely identify a book. And thus it can have only one price at a time.
You probably want to follow the books_by_tag
approach, where you have:
CREATE TABLE books_by_category (
category_id uuid,
category_name text,
book_id uuid
book_price decimal,
PRIMARY KEY(category_id, book_price)
);
And this will allow you to retrieve books from within a category where the price matches a range:
SELECT book_id FROM books_by_category WHERE category_id = ? AND price >= ? AND price <= ?;
@Mosca_Careca: Ok makes sense. In my previous approach I would be creating one partition per book right?
I understand the approach you’re taking, I just don’t understand why that’s needed. I can see that adding a category to encompass larger groups of books is good. Could I add an “all” category and query from there?
I also have a question about how much redundancy should be present. Say I have the books
table and the books_by_tag
table. Is it a good idea to have all the info of a book both in the books
table and in the books_by_tag
table or is it enough to have just the book_id
of a book in the books_by_tag
table?
@Felipe_Cardeneti_Mendes: Actually, the PRIMARY KEY would be
PRIMARY KEY(category_id, book_price, book_id)
– as there can be books falling under the same price. But this creates the need of delete/insert books whenever a price changes. So there are some tradeoffs even with that approach.
Another way would be to simply get rid of the book_price
and do an ALLOW FILTERING
instead if the group is small (as it should be).
> Could I add an “all” category and query from there?
You can, but the efficiency of it will depend on how many books to scan. Why not break all into smaller queries per group?
For your last question, you probably want the minimum data you need in auxiliary tables. For example, you don’t need to hold the book description on books_by_tag
, because ultimately you would present that data to the end user only after he clicked on the book. However, you may want to present the path for a thumbnail, as the usual case for retrieving a price range involves listing
You can also make it as small as possible and for every book_id
matched you simply read from the main table, there’s room for you to play on what works best for you
@Mosca_Careca: Uhm but then, on your approach you’re assuming that the categories are finite, which is not the case for user input categories (like in my case, I have user input tags). I still don’t totally understand PRIMARY KEYS in CQL.
I have to admit that I didn’t understand much of this last approach you said. I’ll have to investigate further and get back to it later.
> You can, but the efficiency of it will depend on how many books to scan. Why not break all into smaller queries per group?
I have just shy of 1M entries in that table. Wouldn’t breaking into smaller groups force me to execute multiple queries, one per tag or category and then unite the result sets? Is that more efficient than a big query?
> You can also make it as small as possible and for every book_id
matched you simply read from the main table, there’s room for you to play on what works best for you
A gain could be made by saving the need to have this secondary query, but then the schema wouldn’t be as “universal”
@Felipe_Cardeneti_Mendes: ah 1M should be ok.