Editor application with a tree of objects - how to datamodel

Guy · March 13, 2024, 11:04am

Originally from the User Slack

@Arthur: I have a question, in a scenario where I am making an editor application, where projects are a tree of objects, and where I should be able to export the whole tree regularly, what would be the best way to model that?

My assumption would be that I would need to use a single table for all projects, and to use the project uid as the partition key - that takes care of the requirement of making the whole tree exportable often, since having everything in a project would be in a single partition means a single node can take care of reading it all, bypassing a lot of unnecessary round-trips. I am not sure though if that is not prone to the emergence of hot partitions - is it bad if a single partition becomes too big, since it will make a single node do a lot of work by itself, leaving little resources for other projects it might be responsible for?

Then, since I would need to identify each project item, I would use an ULID as a sorting key, since the project might be edited from two clients at once, this allows to have a guarantee of uniqueness and to get the items to be sorted from oldest to newest, which could be useful for some application features, and which i assume would be better for performance, since that means the latest created objects in the tree, which are the most likely to be a lively modified by the user, will be at the bottom of the partition, making it “easy” for Scylla to find? Really unsure about that though.

One thing I am really unsure about is that for this model to work I would have to have each field of each object type represented in the table’s rows, which could mean ending up with large rows, which is again bad modelling…

Have I gotten things backwards? Should I split up tables per item type instead to keep rows and partitions smaller, despite the fact it will also fragment my data for a single project all across the cluster, knowing that reading the entirety of a project would be a relatively frequent operation?

@Felipe_Cardeneti_Mendes: Yes for a single table holding all the projects, but then perhaps break it down into some granularity and incrementally export the tree contents. For example, GET /project/? could retrieve all nested structure (folders?), where . would always be your root and exist.

From there you can have projects with a small number of elements, or a lot, but in general bounded (ie: a project with 1000 folders would already be an outlier, but not necessarily an issue).

At this point the source client can already mirror the source structure, while you asynchronously fetch inner objects. For example, GET /project/?/identifier , would then sort all inner objects within a specific project and a identifier.

The good thing about this approach is that it allows you to asynchronously fetch from multiple sources at once, and you don’t care about the ordering as ultimately each object is unique. Even if the project has thousands of nested structures, you can just run thousands of queries concurrently, and have them spread across to maximize your cluster capacity.

We then get to the last part of your question: Identify each project item. The use of a ULID as a sorting key will definitely guarantee uniqueness, and you may control the sorting via the CLUSTERING ORDER BY clause (ideally you’d prefer it to be sorted from newest/oldest?). But then, whenever you traverse the tree you would always read not just the latest objects, but also the inactive ones (this alone brings some questions such as merging, losing updates, etc, but I’ll keep these aside for now). Worse: Unless you keep track of each object in its own separate table, you would get a timestamp-ordered list of changes of mixed objects! Consider a folder with 100 files. You may have 10K changes on top of 99 files, but 1 file was never modified since its creation, and you will only reach it at the top or bottom (depending on your clustering order) of your scan. What would you do with the remainder 9,9K rows you just read?

So perhaps an ideal approach is this:

Table projects uniquely identify a project, and have a sorting key identifying the structure within that project. PRIMARY KEY (project_id, path) , . always exist and is the parent of all others.

Table objects contains all fresh objects within a given path for a specific project. PRIMARY KEY((project_id, path), object_id). Note we do not store the timestamp as part of the key here. This allows a simple SELECT * FROM table WHERE project_id=? AND path=? to always retrieve the latest object_id.

Then you can have a 3rd table where you keep track of changes: PRIMARY KEY((project_id, path), object_id, ts). Now you have versioning, but the main “checkout” part of the process is much more lightweight.

I may have understood some things wrong on your use case, but hope this helps.