How to get Node Sharding information using Scylla Java Driver?

porunov · July 13, 2023, 8:36pm

In JanusGraph we are currently working on adding a new feature to allow users execute SELECT query for multiple partition keys using IN operator.
Usually IN operator usage for different partitions is an anti-pattern (I think that’s why there is a default protection exists in ScyllaDB max-partition-key-restrictions-per-query: Scylla FAQ | ScyllaDB Docs ).
However, on JanusGraph side we want to group those partition keys based on either Token Ranges only (for Cassandra / AstraDB / Amazon Keyspaces) or based on Token Ranges plus Sharding information.
We often need to fetch specific columns for many partition keys together. As for now we are executing a separate asynchronous CQL query per each partition key. Often times it leads to channel conjunctions when we need to select those columns for let’s say thousands of keys which leads to thousands of CQL queries.
We found out that grouping keys by limited groups of keys (let’s say grouping 100 partition keys via IN operator) which belong to the same token range (i.e. keys which leave on the same Nodes) often results in slightly better performance on Cassandra side, as well as making some serverless deployments like AstraDB, where you pay 1 RRU for each CQL query which returns 4 KB of data way cheaper for running JanusGraph, because usually JanusGraph queries small amount of bytes which is much less than 4 KB.
We were able to make the implementation (see here: Group CQL keys into a single query [cql-tests] [tp-tests] by porunov · Pull Request #3879 · JanusGraph/janusgraph · GitHub) which groups partition keys based on Token Ranges. I.e. we convert a partition key into a Token and then search for it’s TokenRange using TokenMap. Any partition keys which belong to the same TokenRange are grouped together (into small groups, so that we have a chance to touch multiple replicas, but let’s omit replicas existence for simplicity and assume each TokenRange has only 1 Node).
The implementation works good for Cassandra, but we would like to have an improved implementation for ScyllaDB, because ScyllaDB routes queries not only by a RoutingToken / RoutingKey, but also by the Shard Id of that RoutingKey. Meaning that the request is routed directly to the correct CPU for processing.
As we group multiple partition keys together via IN operator, we have a small challenge of: “How do we determine if two partition keys, which belong to the same TokenRange are going to be routed to the same Shard?”.
One way I’m think of is simply getting all nodes for a partition key and if for two different partition keys ShardId matches for ALL the nodes - than it means that they can be grouped together. The problem is that I don’t know how exactly get ShardId if all I have is CQLSession, Node, and partition key.

If anyone knows how to properly get Shard Id (or shard infomration) using Node + partition key via ScyllaDB Java Driver, it would be really great if you can share that.
Any other concerns and / or suggestions are highly appreciated.

piotr · July 14, 2023, 1:53pm

In Java Driver 3.x, you can calculate the shard ID with the following code:

Host someHost = session.getCluster().getMetadata().getAllHosts().iterator().next();
// Unfortunately the driver (as of right now) doesn't expose its
// internal Token classes, so we have to re-implement it here:
Token token = new Token() {
    private Long value = 123L;
    @Override
    public DataType getType() {  return DataType.bigint(); }
    @Override
    public Object getValue() { return value; }
    @Override
    public ByteBuffer serialize(ProtocolVersion protocolVersion) { return TypeCodec.bigint().serialize(value, protocolVersion); }
    @Override
    public int compareTo(Token o) { return value.compareTo((Long) o.getValue()); }
};
int shardId = someHost.getShardingInfo().shardId(token);
System.out.println("shardId = " + shardId);

Unfortunately, since we’ve made some Token classes private in the driver, we have to re-implement Token in the snippet. The situation is worse in Java Driver 4.x, where ShardingInfo class is not public. I created this tracking issue for possibly making those APIs public: Expose API for calculating shard ID from token · Issue #232 · scylladb/java-driver · GitHub

The calculation of shard id is performed here in the driver: https://github.com/scylladb/java-driver/blob/d291df6b35f7903c0b2d935754aebcb5b35bcd81/driver-core/src/main/java/com/datastax/driver/core/ShardingInfo.java#L57-L67. Some of the values it uses (e.g. shardingIgnoreMSB) are read from the initial handshake the driver makes with ScyllaDB.

porunov · July 14, 2023, 2:07pm

Hi @piotr !
Thank you so much for chiming in! Sorry for confusion, but at JanusGraph we are using 4.x driver. So I think we will need to wait until the ticket you opened is implemented.
Appreciate you quick response!

Topic		Replies	Views
Multi DC cluster, query performance with batches, GOCQL driver ScyllaDB performance , unanswered , go-driver , batch , latency	1	26	April 21, 2025
[RELEASE] ScyllaDB Java Driver 4.18.0.1 Release Notes	0	96	July 17, 2024
How Does ScyllaDB Find the Node Containing the Data I Want? Knowledge Base	0	456	December 11, 2022
[RELEASE] Scylla 5.2.0 Release - part 2 Release Notes open-source , open-source-release , open-source-5-2	0	1858	May 4, 2023
Recommendations for partitioning imbalanced data ScyllaDB data-model , hot-partition	1	143	November 22, 2024

How to get Node Sharding information using Scylla Java Driver?

Related topics