Whats the correct model schema for my use case?

Hello, I am the owner of a roblox extension and I had recently encounter dificulties with ScyllaDB that made me consider switching to MongoDB. My application is stable right now but I want to know what I did wrong in my context.

I have the following database:

await client.execute(`
    CREATE TABLE IF NOT EXISTS game_servers_new (
      server_id text,
      last_updated timestamp,
      place_id bigint,
      first_seen timestamp,
      datacenter int,
      PRIMARY KEY (server_id)
    )
  `);

My queries are to SELECT based solely on the following fields server_id , place_id and last_updated. I had created secondary queries for those fields like this:

await client.execute(`
CREATE INDEX IF NOT EXISTS game_servers_new_last_updated_idx ON game_servers_new (last_updated );
`);

My issue is with this query:
SELECT game_id, active_players, last_updated
FROM games
WHERE last_updated < ? LIMIT ? ALLOW FILTERING

So my question is:
Even if i have secondary index on last_updated why does it says that my query can only run if ALLOW FILTERING is present?

Where is my mistake? How should be my data constructed?

To answer your question about the correct model schema for your ScyllaDB use case and why your query requires ALLOW FILTERING even with a secondary index:


Summary of the issue:

  • You created a table with server_id as the primary key, and a secondary index on last_updated.

  • Query:

    SELECT game_id, active_players, last_updated FROM games WHERE last_updated < ? LIMIT ? ALLOW FILTERING 
    

    The database says you must use ALLOW FILTERING.


Explanation:

  • In ScyllaDB (and Cassandra), secondary indexes only allow efficient equality queries (WHERE last_updated = ?), not range queries (WHERE last_updated < ?).

  • For range queries on a secondary index (<, >, etc.), ScyllaDB does a full table scan behind the scenes and applies the filter after collecting the data. This operation is risky and may be expensive, so the system forces you to use ALLOW FILTERING to acknowledge the load and risk.​

    • For equality:

      SELECT...WHERElast_updated = ?-- Fast with index

    • For range:

      SELECT...WHERElast_updated < ? ALLOW FILTERING-- Slow, requires filter

How to fix/model properly:

  • If you need to frequently query by ranges (e.g., last_updated < ?), you should change your schema so that last_updated is part of your primary key (partition key or clustering key). For example:

    CREATE TABLEgames ( game_idtext, last_updated timestamp, ... PRIMARY KEY(game_id, last_updated) )

  • Or, if you want to efficiently select by last_updated only:

    • Use a materialized view with last_updated as the base (ScyllaDB supports materialized views natively).

    • Example:

      CREATEMATERIALIZEDVIEWgames_by_last_updatedAS SELECT*FROMgames WHERElast_updated IS NOT NULL PRIMARY KEY(last_updated, game_id);

    • Then you can query:

      SELECT...FROMgames_by_last_updatedWHERElast_updated < ?


Secondary Index Limitations:

  • Secondary indexes work for equality.

  • Range queries over secondary indexes will almost always require ALLOW FILTERING, indicating a potentially slow, unoptimized query.

  • Best Practice: Model your schema so that your most common query fields are part of your primary key (partition/clustering), or use materialized views if you need alternative fast access patterns.​


2 Likes

You are a genius. I didnt know that. I perfectly got it now.
I have 1 more easy question about the data modeling:
server_id is a random uuid, and it shall be unique
place_id is a number

Multiple server ids can have the same place_id.

My question is the following:
Why did you opted for PRIMARY KEY(game_id, last_updated) ) instead of PRIMARY KEY(server_id, last_updated) )

My queries are the following:
server_id - exact query
place_id - exact query
only last_updated is a range query

Therefore Is it better for my db to be structured like this? :
PRIMARY KEY(server_id, last_updated) )
So we have unique server id
and secondary index on place_id where exact querys will work by default?

Again thank you for your much appreciated help.

Great to hear you got the concept, thanks for the compliments, it’s just about knowing how the DB operates and proper modeling :slight_smile:

Regarding your question about the choice of primary key — whether to use PRIMARY KEY (game_id, last_updated) or PRIMARY KEY (server_id, last_updated) — here are some considerations:

  1. Uniqueness & Query Patterns:
  • Since server_id is a unique random UUID and you will query by it with exact matches, using PRIMARY KEY (server_id, last_updated) is a good design choice. It ensures rows are uniquely identified by server_id and locally clustered by last_updated for efficient range queries per server.

  • In your case, game_id vs server_id as partition key depends on which identifier you query mostly and which uniquely identifies the row. If your intent is to store data keyed by servers (each server having multiple records distinguished by timestamp), then (server_id, last_updated) makes more sense.

  1. Handling place_id queries:
  • Since place_id can be shared by many servers, an exact query on place_id would require a secondary index (or materialized view). Scylla handles exact queries on secondary indexes fine. Just be mindful of the performance implications if cardinality of place_id is very high.
  1. To Summarize:
  • The schema should reflect your most common query patterns and uniqueness constraints.

  • If you frequently query by server_id and last_updated ranges, structure your table with PRIMARY KEY (server_id, last_updated).

  • Use a secondary index or materialized view to support queries by place_id since it is not part of the primary key.

This schema aligns well with your use case of unique servers queried by exact ID and timestamp ranges, plus location filtering via secondary indexing.

1 Like

Thank you so much. Now everything makes way more sense to me. I think I will get back to scylla now that I know what my mistake was. It had incredible performance but I got upset since I didnt understeand what I did wrong.

Have a wonderfull day!

1 Like

Glad to help and good luck!

Let us know if you have further questions, you can also reach-out on our Scylla Users slack - http://slack.scylladb.com/

And for other ways, take a look at this post - Engaging with the ScyllaDB Community

1 Like

I also tried right now to use a materialized view but I got:

[CENTRALIZED CACHE] Failed to refill server cache: Only EQ and IN relation are supported on the partition key (unless you use the token() function or ALLOW FILTERING)
ResponseError: Only EQ and IN relation are supported on the partition key (unless you use the token() function or ALLOW FILTERING)
at FrameReader.readError (/root/BetterBloxAPI/node_modules/cassandra-driver/lib/readers.js:389:17)
at Parser.parseBody (/root/BetterBloxAPI/node_modules/cassandra-driver/lib/streams.js:209:66)
at Parser._transform (/root/BetterBloxAPI/node_modules/cassandra-driver/lib/streams.js:152:10)
at Transform._write (node:internal/streams/transform:171:8)
at writeOrBuffer (node:internal/streams/writable:572:12)
at _write (node:internal/streams/writable:501:10)
at Writable.write (node:internal/streams/writable:510:10)
at Protocol.ondata (node:internal/streams/readable:1009:22)
at Protocol.emit (node:events:518:28)
at addChunk (node:internal/streams/readable:561:12) {
info: ‘Represents an error message from the server’,
code: 8704,
query: ‘SELECT server_id, place_id, last_updated, datacenter \n’ +
’ FROM game_servers_by_last_updated \n’ +
’ WHERE last_updated < ?’
}

I really dont understeand why this doesnt work…

Edit: I have structured my materialized view as PRIMARY KEY (server_id, last_updated);

The main table only has PRIMARY KEY (server_id) in order to automatically filter duplicates
with index on place_id

This made all my queries work. Hopefully I didnt made a mistake

Based on ScyllaDB best practices and your handling of the secondary index issue: If your changes remove the need for ALLOW FILTERING by designing queries around primary and clustering keys or by modeling data to access patterns, your solution is aligned with recommended ScyllaDB schema strategies. Secondary indexes are best used for low-cardinality, infrequently updated fields; for anything dynamic or high-volume, it’s better to rely on a schema designed for efficient partition and clustering key reads.​

If your approach only needs secondary indexes for rare access paths and keeps core queries fast, this is considered acceptable and in line with current ScyllaDB guidance. If you encounter further performance issues or need a more opinionated schema review, please share your exact table definitions and query patterns—happy to offer more targeted advice!

cc @GuyCarmin if you have further insights.

1 Like