Column design and UDT recommendations for a specific problem

Hi, I’m learning about Cassandra/Scylla and I have a question. Lets say I’m storing books and each book has N fields, like title, alt title, synopsis, etc… Each of this fields is a string and has an associated language (or “unknown”/null). What would be the most efficient way to store this kind of data? I was thinking about two solutions: The first one consists of a UDT column with: title map<text, frozen set<text>>, desc map<text, frozen set<text>>... the key is the language, for example “en” or “English” and the set is all variants (a language may have a couple). I would add a field into the UDT for every possible field I can think of (title and desc are 2 of 16) like the example below:

raw_scraps.fields( title frozen<map<text, list<text>>>, alt_title frozen<map<text, list<text>>>, synopsis_or_description frozen<map<text, list<text>>>, background_or_context frozen<map<text, list<text>>>, status frozen<map<text, list<text>>>, publication frozen<map<text, list<text>>>, author frozen<map<text, list<text>>>, artist frozen<map<text, list<text>>>, serialization frozen<map<text, list<text>>>, tag_genre frozen<map<text, list<text>>>, tag_theme frozen<map<text, list<text>>>, tag_rating frozen<map<text, list<text>>>, tag_demographic frozen<map<text, list<text>>>, tag_format frozen<map<text, list<text>>>, tag_uncategorized frozen<map<text, list<text>>>, content_url frozen<map<text, list<text>>>, original_language frozen<list<text>>)

The other idea consists of a single map<text, frozen fields> where the text is the language and fields is a UDT with 16 title set<text>, desc set<text>... inside.

A naive approach would be to store the language string for every possible field, so a UDT with 16 pairs of language+field, but then this would consume a lot of disk. A book may have 10-20 fields in 1-4 languages.

Notice that I’m mostly considering disk storage, since I expect scylla to use less resources for the language string if I group fields by their language in an associative collection, but I want to know the general approach for performance, either cpu and disk and how would you implement this.

Thanks!

My first preference us to use clustering keys, not collections.

e.g.

CREATE TABLE books (
    id uuid,
    attribute_language text,
    title text,
    synopsis text,
    artist set<text>,
    tag set<text>,
    PRIMARY KEY (id, attribute_language)
);

This places just once instance of attribute_language per (book, language) pair, rather then once per (book, language, attribute) triplet. It also allows selecting just one language, but that may not be in your requirements.

Also: prefer set<> to list<>. Use frozen<> where you can, but be aware it’s less suitable for updating.

Your suggestion might not work for my usecase since not all fields have the same set of languages. For example, I may have 2 titles in english, 1 in spanish and 4 in chinese but 1 synopsis in english 3 in french and 1 in russian.

So your suggestion can be suitable but kind of painful to work with specially when doing queries: first I’ll change title and synopsis to set and for every entry I would need N (being N the number of different languages in total) rows. For every field in a row I would be storing the subset of texts for that language, and I would use null/empty set when there are no texts for a given field in a language.

This works perfectly if each field uses the same or almost the same languages, but fails when that can’t be guaranteed like in my case :C maybe with this new information you might have any other suggestion?

PD, I’ve read about set being suggested in place of list but for my usecase I wont be updating the data. Only insert, read and delete, (afaik list degrades on update), not sure about disk space on list vs set.

thanks :smiley:

Note sure what the problem is - if there are to titles in some language, leave it NULL. It consumes almost no space.

list vs. set are almost the same. Lists only cause trouble if unfrozen.

Anyway, your original proposal is fine too, it just duplicates language for each field and leads to messy query results, but that’s not a problem for a machine.