Hi, I’m learning about Cassandra/Scylla and I have a question. Lets say I’m storing books and each book has N fields, like title, alt title, synopsis, etc… Each of this fields is a string and has an associated language (or “unknown”/null). What would be the most efficient way to store this kind of data? I was thinking about two solutions: The first one consists of a UDT column with: title map<text, frozen set<text>>, desc map<text, frozen set<text>>...
the key is the language, for example “en” or “English” and the set is all variants (a language may have a couple). I would add a field into the UDT for every possible field I can think of (title and desc are 2 of 16) like the example below:
raw_scraps.fields( title frozen<map<text, list<text>>>, alt_title frozen<map<text, list<text>>>, synopsis_or_description frozen<map<text, list<text>>>, background_or_context frozen<map<text, list<text>>>, status frozen<map<text, list<text>>>, publication frozen<map<text, list<text>>>, author frozen<map<text, list<text>>>, artist frozen<map<text, list<text>>>, serialization frozen<map<text, list<text>>>, tag_genre frozen<map<text, list<text>>>, tag_theme frozen<map<text, list<text>>>, tag_rating frozen<map<text, list<text>>>, tag_demographic frozen<map<text, list<text>>>, tag_format frozen<map<text, list<text>>>, tag_uncategorized frozen<map<text, list<text>>>, content_url frozen<map<text, list<text>>>, original_language frozen<list<text>>)
The other idea consists of a single map<text, frozen fields> where the text is the language and fields is a UDT with 16 title set<text>, desc set<text>...
inside.
A naive approach would be to store the language string for every possible field, so a UDT with 16 pairs of language+field, but then this would consume a lot of disk. A book may have 10-20 fields in 1-4 languages.
Notice that I’m mostly considering disk storage, since I expect scylla to use less resources for the language string if I group fields by their language in an associative collection, but I want to know the general approach for performance, either cpu and disk and how would you implement this.
Thanks!