I’m running into issues while modeling the data for an existing application and I’m looking for help and/or additional insights on how to approach my use case.
I ran through the related Scylla university topics, but my model seems way different to the heartbeat monitoring use case there and I’m even starting to doubt whether Scylla is a good fit for this project.
The issue is that
my model has 3 different unique ID’s that I need to be able to query on, while two of them are optional.
I need to be able to query on a timestamp, but the timestamp value is a single value per partition key. I don’t want to break upserts and end up with multiple records with different timestamps for one partition key.
This is what my domain looks like:
USERS
ID (guid: unique + required) => PARTITION KEY
User App ID (guid: unique + optional)
User Device ID (guid: unique + optional)
Last Visit (timestamp: required)
Info (text: required)`
Here’s the queries I need to run:
SELECT
ID = ?
App ID = ?
Device ID = ?
Last Visit < ? (to prepare a delete by ID)
~full table scan~ (token ranges seem to make this feasible?)
INSERT/UPSERT UPDATE
ID = ?
DELETE
ID = ?
Last Visit < ? (if not possible, I would select on this and then delete by ID, as referenced above)
(The user count could run up to 50 million.)
Do I need to create more than one table to be able to contain the objects with all 3 "ID"s? If so, should I copy the data or just store the ID’s so I can fetch them from the main table?
You would use create index to create what is a secondary/table index that allows direct lookups. It’s not easy by default to do this, because the point is people need to learn that this is slow (relatively speaking), but when you need to do it you can do it.
create table person (
user_id text,
app_id text,
user_id_app_id text,
name text,
) with default_time_to_live = 3600;
create index person_user_id on person(user_id);
create index person_app_id on person(app_id);
create index person_user_id_app_id on person(user_id_app_id);
create index person_user_id_app_id on person(user_id, app_id);
You don’t need to manually search for and delete old data, just use the default_time_to_live to make the data magically disappear after the maximum age. This saves needing to scan your entire database regularly.
For a full table scan, you can do this using the primary key and limit, i.e. Fetch the first 1000 records, then look at the uid of record 999, to do the next query, ie:
select * from person where uid > {last_seen_id} limit 1000`
Assuming of course your querying ordered data. It just depends on what your primary key is and how its sharded.
I had no idea Scylla supports indexes, not sure how I missed that in the docs.
My post may have not made it entirely clear, but selects on app ID and device ID are also full table scans. So I wonder -especially given your remark about the perf- if an index will be fast enough for that. Do you have any idea if the index solution can cover more than single selects?
TLL would indeed be a good option for the cleanup of old data.
All three ID properties are GUID’s, so I don’t think it can be considered ordered data for which I can do greater than checks.
Additionally, do you think that approach would be faster than the token range scans?
Regarding querying rows in an ordered way, I’ve not had to do this very often, but one common solution is to use a timeuuid. It has some downsides, but it is a quick and easy way to make sure the record uuid’s can be scanned easily:
With regards to index performance, it’s not too much different to mysql, it simply means a second copy of your data is being stored. It’s not that much different to how mysql handles indexes.
The questions you are asking are standard cql (nosql) problems, and have fairly standard answers. Some Youtube video tutorials on how to design a nosql database might be helpful. (Thats how I learn’t all this stuff, back in the day there were some interesting ones, like “how would you model YouTube in Cassandra” that contained a whole lot of helpful insight into how to solve these types of problems in a distributed database)