I currently employ a relational database CTE that iterates over parentage information to find all ancestors or all descendants of a given entity. Even on a small (150000) sample, it can take over an hour to return a result. I want to query 10 to 20 million entities and return a result in less than a second. I cannot get anywhere close by iterating in response to a user query. I think Scylla may come to the rescue by storing the mass of information produced by iterating as entities are added. Then user queries don’t have to wait for the iteration. Is this possible with Scylla?
I am considering the following approach:
Each group of inter-related entities will have its own table (probably needing a Compound Primary Key?):
CREATE TABLE relations.group (
entity_id INT,
ancestor_id INT,
measure DECIMAL,
PRIMARY KEY (entity_id, ancestor_id)
WITH CLUSTERING ORDER BY (ancestor_id DESC);
If I understand it right, this table will in some cases have as many as 8 million rows. Each row will potentially have 1000s of columns, each representing an ancestor associated with a high precision measure.
If it didn’t hurt performance or functionality, I would possibly add a few columns of entity metadata, but the functionality really just depends on this core.
A table will need to service the following types of query:
Ancestor query:
SELECT *
FROM relations.group
WHERE entity_id = 101010
ORDER BY measure DESC;
I’ll want to paginate the result set in which I hope to show every ancestor of the entity in descending order of the measure.
Descendant query:
SELECT measure
FROM relations.group
WHERE ancestor_id = 101010;
If I added other columns about the entity, I would order the return set by one of these additional columns. Otherwise I would post process both of these SELECT
results with additional metadata from the relational database.
Finally, I would ideally input new entities with something like a relational database SELECT INTO statement, but it seems that can’t be done in CQL? In which case I was wondering if something along the lines of the following can be done on the database side, before running the INSERT:
SELECT ancestor_id, sum(measure)
FROM relations.group
WHERE entity_id IN (111111, 100000);
The idea being that each ancestor in common with the ancestors chosen would have its measure column added together.
That aside, one way or another, immediate ancestors would be selected and the common measure among their ancestors would be aggregated to put together the ancestors and associated measures for the new entity.
New entity creation:
INSERT INTO relations.group (entity_id, ancestor_id, measure)
VALUES (124234, 124344, 0.002543251651324851654654654654684351643435464684684656351)
IF NOT EXISTS;
With potentially thousands of ancestors per entitiy, there is doubtlessly a way to upload new entities more efficiently, I just haven’t been able to track it down yet.
Overall
I anticipate users being able to query among 10’s of millions of entities each with 1000’s of columns, and return associated measure info for either:
- one row with all columns (ancestor query), or
- one column with all associated rows (descendant query).
They also need to be able to add new entities with data for thousands of columns made up by aggregating a small number (at least two) of entities.
As long as results can be sorted and paged, no query should be onerous. However, the sort may occationally have to take into account up to 8 million rows in a descendants query.
My initial test case will be smaller, with about 200,000 entities each having 100’s to 1000’s of columns and running on a single virtual private server.
Obviously, iterating first generates a lot of data. I’m hoping ScyllaDB can handle it without keeping my users waiting long.
Relational DB’s definitely cannot handle the iterate per request approach in any kind of reasonable time at any necessary scale.