Need help with "transaction" modeling

cmandra · October 28, 2023, 6:20pm

I apologize if I have the wrong category, this is my first topic

Preface
I have a relational sql database that works pretty well, but scaling it is a real pain
So scylla came to my attention, in which I saw a potential salvation.

To the point
I studied the documentation of scylla, and figured out how to migrate everything but one item:
In my relational database, I use transactions to allocate usage to data.
To oversimplify, it looks like this:

BEGIN;

SELECT id, line FROM table_with_important_rows WHERE usages < 123 AND concurrent_usages < 10 FOR UPDATE LIMIT 50;
// this blocks 50 matching rows, until they are updated, or the transaction is rolled back

UPDATE table_with_important_rows SET usages = usages + 1 AND concurrent_usages = concurrent_usages + 1 WHERE id IN (...)
// and this consequently increases the usage counts, and releases the blockage

COMMIT;

After reading probably all the articles on the subject of LWT and filtering in scylla, here’s the best I’ve come to:

CREATE TABLE test.table_with_important_rows (
    id uuid,
    usages int,
    concurrent_usages int,
	line text,
    PRIMARY KEY (id));

CREATE INDEX resource_usages_idx ON test.table_with_important_rows((id), usages);

CREATE INDEX resource_concurrent_usages_idx ON test.table_with_important_rows((id), concurrent_usages);

But the problem still remains. In order to “grab the rows” I need, I have to do a read first,
because the secondary index doesn’t support update and delete operations, but even if it did,
there’s no way to limit the rows that will be updated. In addition, the secondary index cannot be used with the
counter type, which means there must be a read to increment, or decrement the value.
The problem is that I don’t see any way to perform the read and subsequent update atomically. If don’t do it atomically, there’s no way to avoid data corruption.

The only solution I see is to restrict simultaneous work with one partition at the service level,
and work with the database without blocking.

How can I solve my problem at the database level, and if there is a solution, how resource intensive is it?
This is a very popular table in my database, and there will be a lot of similar operations. I would say this is the main type of load.
Will scylla perform well in this type of operation?

Thanks in advance for the answer

felipemendes · October 30, 2023, 5:32pm

Hi @cmandra, welcome to the ScyllaDB community!

I don’t understand the data modeling you’ve came up with. Neither of the two indexes you created will allow you to retrieve without specifying id as in your SQL example. Therefore, using an index seems redundant as pretty much every partition within your proposed base table will always contain a single row. Thus, if you are always specifying id as your filtering clause (which is unlike the SQL transaction you demonstrated), you could well use ALLOW FILTERING instead.

That said, your table also contains a column of type text (line), and this seems to be the main impediment for you to use a counter table, and not the actual Indexes you created.

Lastly, we finally get to your main problem. At this point, I assume (given your previous feedback) that you had a chance to read and navigate through the code in Getting the Most out of Lightweight Transactions in ScyllaDB, where we demonstrate how a transfers table is used for a client to claim owning that transfer while using TTL.

Happy to discuss further.

cmandra · October 30, 2023, 7:57pm

Hi. Thank you very much for your reply!
Yes, when I got to the tests I realized that I was wrong in some assumptions.
Also the code of my table was incorrect, here is a more correct version:

CREATE TABLE test.table_with_important_rows (
	resource_id uuid,
    id uuid,
    usages int,
    concurrent_usages int,
	line text,
    PRIMARY KEY (resource_id, id));

Sections will be partitioned by resource id, that’s about 1k-100k records, so ALLOW FILTERING doesn’t apply here.

I read this article. Great implementation that looks like a solution to the problem,
but my main problem is with the read by condition. I need the following condition to be satisfied:
usages < 123 AND concurrent_usages < 10
But I don’t see a way to accomplish this with database tools without resorting to redundant reads.
Here’s the next thing I came up with:

CREATE TABLE test.table_with_important_rows (
	resource_id uuid,
    id uuid,
    usages int,
    concurrent_usages int,
	line text,
    PRIMARY KEY (resource_id, usages, concurrent_usages, id));

Actually, this post doesn’t make any more sense initially, as I’ve encountered another problem, if I try to
to apply the WHERE < clause to usages and concurrent_usages which are now part of the primary key I get an error:
PRIMARY KEY column "concurrent_usages" cannot be restricted as preceding column "usages" is not restricted
and there seems to be no way around this restriction.

But, if this restriction could be bypassed, I would probably do something like this
SELECT id FROM test.table_with_important_rows WHERE resource_id = 00000000-0000-0000-0000-0000-000000000000 AND usages < 123 AND concurrent_usages < 10 LIMIT 100
DELETE FROM test.table_with_important_rows WHERE resource_id = 00000000-0000-0000-0000-0000-000000000000 AND usages != -1 AND concurrent_usages != -1 AND id IN (...) IF usages < 123 AND concurrent_usages < 10
Oh… That won’t work either
One or more errors occurred. (PRIMARY KEY column 'concurrent_usages' cannot have IF conditions)

I think I’ve taken a wrong turn somewhere, please let me know if you see a solution to my problem.

Thanks for your time!

felipemendes · October 31, 2023, 12:26am

You can not have usages nor concurrent_usages part of your clustering key. This happens because key columns can not be updated. Therefore, your latter table isn’t going to work as you noticed.

The first table you’ve shown seems to be closer than what you actually want. a PRIMARY KEY (resource_id, id) will result in all id’s being grouped inside a resource_id partition. In other words, every id will be a distinct row for a given resource_id. If I understood you correctly, this is what you want to represent with regards to your data organization.

This brings us to the second part of your problem, related to your reads.

As you have already realized, the base table doesn’t have a clustering key of neither usages nor concurrent_usages, and therefore doesn’t allow you to efficiently run WHERE usages < ? AND concurrent_usages < ? in an efficient way (given that your previous input was the need to walk down potentially through 100k records).

You may overcome that by relying on a Materialized View. Like this:

CREATE MATERIALIZED VIEW test.usages_view AS
    SELECT *
    FROM test.table_with_important_rows
    WHERE resource_id IS NOT null AND id IS NOT null AND usages IS NOT null AND concurrent_usages IS NOT null
    PRIMARY KEY (resource_id, usages, id)

This will allow you to run queries like: SELECT * FROM x WHERE resource_id = ? AND usages < ?

What about concurrent_usages? Well, you filter through it:

SELECT * FROM usages_view WHERE resource_id = ? AND usages < ? AND concurrent_usages < ? ALLOW FILTERING

The database will efficiently lookup your usages restriction and then, simply filter the matching concurrent_usages until it reaches your LIMIT. Bingo!

And yes, you can have more than one view. In particular, if you have a need to restrict only by concurrent_usages, very likely another view would be more efficient rather than relying on the ALLOW FILTERING trick here.

However, it is hard to say for certain what will work best and what not without actually testing with your data.

After that is done, you go back to your base table and issue the mutation statements you need.

Note: You have probably realized by now that ScyllaDB doesn’t do any sort of locking. That said, reads to your view may be inconsistent if concurrent clients modify the base table data. Ideally you want to ensure only a single client can update a given resource_id at a time, given this data modeling, and that way you can ensure you are conflict-free.

If you would like to discuss more on potential other solutions and are willing to share additional details of your use case, feel free to DM me.
Full disclosure, I am a ScyllaDB employee.

Topic		Replies	Views
Modeling "transactional" behavior, modifying two tables ScyllaDB lwt , data-model , migration , acid	1	138	July 11, 2024
Data modeling issues ScyllaDB data-model , open-source	4	252	March 25, 2024
Implement transaction in client side ScyllaDB lwt , data-model , ttl , secondary-index , consistency	1	37	October 28, 2024
Data Modeling question with a high consistency requirement, using LWT, MV, SI or frozen maps? ScyllaDB lwt , data-model , secondary-index , frozen-data	0	80	June 2, 2024
Materialized Views and Indexing, filtering columns by range, ALLOW FILTERING ScyllaDB data-model , materialized-views , secondary-index	0	140	May 8, 2024

Need help with "transaction" modeling

Related topics