Data modeling question: save space using lookup tables?

I am in the process of finalizing my proof-of-concept of switching from PostgreSQL to ScyllaDB for my commercial project that I’ve been working on for the past 3 years.
I have a working code for the data migration and most of the main operations that the project currently does for the existing PostgreSQL database.
I am a bit stuck on one dilemma.

My database has been growing steadily by approximately 21-28 GB every day in the past year, which is roughly 50-70 million rows per day.
Each row contains 12 numeric and timestamp fields, but also 4 text fields, each value varying between 1 and 6 UTF-8 characters long.
Those text fields contain values from 2 different lists of strings, each of which contains fewer than 300 unique string values.
1 of those 4 fields is going to be part of the composite partition key (the other part of the partition key is a date) in my ScyllaDB table.

In other words, a substantial percentage of my database contains repeated values in the NoSQL/denormalized schema.

Obviously, in a relational database those fields would contain foreign keys to one of the two lookup tables.

Although space is not an issue yet, but I anticipate it to become an issue in less than half a year.

Would you recommend that I create and use (in the client side code) the lookup tables, or would I be better off letting ScyllaDB split the table between multiple servers based on partitions?
I estimate that the cost of renting some extra servers (if the schema is not optimized for space savings) is roughly equivalent (or at least is comparable to) the cost of dealing with the increased code complexity if the schema is optimized for space.

So, each row is something close to 0.3 - 0.5 Kb in size.
How significant would database space savings be if I replace roughly 12 UTF8 characters with 4 short-integer foreign key values, but add a database overhead of 2 extra tables and all related compaction and other maintenance etc.?

I do not have nearly enough experience with ScyllaDB to figure this out, and I would greatly appreciate a nudge in either of these 2 directions.

Thank you very much in advance.

Can you please present here the current schema plan? It’s hard to extract the tables from the text above.

Hi,
I replaced the real column names with more readable ones and sorted the list by type, which also added to readability:

create table if not exists mykeyspace.mytablename
(
key_1_string text,
key_2_string text,
id int,
long1 bigint,
long2 bigint,
long3 bigint,
long4 bigint,
text1 text,
text2 text,
text3 text,
text4 text,
text5 text,
text6 text,
text7 text,
text8 text,
text9 text,
double1 double,
double2 double,
double3 double,
double4 double,
int1 int,
int2 int,
int3 int,
int4 int,
primary key ((key_1_string, key_2_string), id)
)
with comment = ‘One huge table with per-key1-per-key2 partitioning.’
and caching = {‘keys’: ‘ALL’, ‘rows_per_partition’: ‘ALL’}
and compaction = {‘class’: ‘SizeTieredCompactionStrategy’}
and compression = {‘sstable_compression’: ‘org.apache.cassandra.io.compress.LZ4Compressor’}
and dclocal_read_repair_chance = 0
and speculative_retry = ‘99.0PERCENTILE’;

In this modified schema, the values of key_2_string (which is part of the composite partitioning key) and text1, text2, and text3 look like ‘A’ or ‘BB’ or ‘ABC’ or ‘DDDD’ and sometimes 5 characters like ‘ABCDE’.
All characters are Latin and case-insensitive (so I keep them all as capital letters).
Those 4 values in each row can be replaced with a foreign key (with or without the real FK relationship, depending on whether I am using PostgreSQL or ScyllaDB), and those keys’ values will always fit in an int16 type.
The remaining 6 TEXT columns always have 1-character values, so there is no need to optimize them.

The questions I am asking for help with, are:

  1. How significant would database space savings be if I replace roughly 12 UTF8 characters with 4 short-integer foreign key values, but add a database overhead of 2 extra tables and all related compaction and other maintenance etc.?

  2. Is there a chance that a query that retrieves a whole partition (with paging, of course: some partitions have 2.5-3 million rows each) will speed up considerably because there will be less data (fewer bytes) to read and transfer?

Thank you very much!

EDITED: I added a query performance-related question because currently my C+±based proof-of-concept code using ScyllaDB C++ driver working with a pretty decent local (VirtualBox) virtual machine (6 Gb DDR4 RAM, 4 CPU cores, the VM is dedicated to ScyllaDB and has nothing else except the OS) performs only about 5%-10% better than the same query running on PostgreSQL in a very “apples-to-apples” scenario.
I am trying to optimize for speed first, and hopefully for storage (but that is not necessary, although desirable).
Thank you!

  1. How significant would database space savings be if I replace roughly 12 UTF8 characters with 4 short-integer foreign key values, but add a database overhead of 2 extra tables and all related compaction and other maintenance etc.?

I think a lookup table makes sense if your data somehow follows a gaussian distribution, and the data is immutable.

One aspect to keep in mind (and perhaps play with before trying any optimizations) is that data will get compressed only within a chunk. By default, chunk_length_in_kb is 4kB. See: Data Definition | ScyllaDB Docs

On disk SSTables are compressed by block (to allow random reads). This defines the size (in KB) of the block. Bigger values may improve the compression rate, but increases the minimum size of data to be read from disk for a read.

Therefore, the higher the chunk size, the better compression rates you may get - however reads will require more IOPS to retrieve a block. This may be fine (or not), specially if you need to frequently scan through several rows as in you mentioned in your question (2).

https://www.scylladb.com/2017/08/01/compression-chunk-sizes-scylla/ also has a very nice table near the end, plus some previous benchmarks showing some results across different chunk sizes.

Depending on your latency requirements, you may even play further with different compression algorithms. Notably here, the main tradeoff will be on CPU utilization. Compression in ScyllaDB, Part One - ScyllaDB discusses the basics of it, and part 2 at the bottom will discuss results and the impact of different compression algorithms using different chunk sizes. You may want to experiment with these to further find a good balance.

One last thing that isn’t clear is whether you frequently update text1-text9 (or text1-text3, or text4-text9). If the answer is no, then you may also want to try an UDT? See If You Care About Performance Use UDT's

  1. Is there a chance that a query that retrieves a whole partition (with paging, of course: some partitions have 2.5-3 million rows each) will speed up considerably because there will be less data (fewer bytes) to read and transfer?

As you may guess, considerably is hard to say. Though yes, it should enhance the query times.

ScyllaDB C++ driver working with a pretty decent local (VirtualBox) virtual machine (6 Gb DDR4 RAM, 4 CPU cores, the VM is dedicated to ScyllaDB and has nothing else except the OS) performs only about 5%-10% better than the same query running on PostgreSQL in a very “apples-to-apples” scenario.

Nitpick: There are indeed very little room for gains on commodity hardware. Reason it is relatively simple to maximize utilization with the aforementioned specs. You should expect a much higher % on our recommended instances. (~8G/vCPU and locally attached SSDs). So consider making it apple-to-apples in a more realistic scenario ;-))

I don’t even know where to begin, to thank you for your reply…
Every paragraph… heck, every sentence is eye-opening and more than helpful. I have books on my shelf which have less of useful info than your reply.

The most humbling thing for me is: your reply contains answers to questions that I did not even know I had to ask.

Thanks to multiple pieces of advice that you gave me, it looks like I have my work cut out for me and just need to do several things now.

The only thing I want to add is that my work on the local virtual machines includes tests only, and the production server is hardware+OS/dedicated with 64 Gb DDR4 RAM, and now I understand (thanks to your comment) why the speed improvement from PostgreSQL was less than I had expected.

THANK YOU VERY MUCH!!!

1 Like