Originally from the User Slack
@Yukkuri: Hi, in case there is no information available to provide partitioning am I understanding correctly that hashing clustering keys and dividing hash space into artificial partition keys is the right approach?
In case when load/write amounts can vary drastically from instance of application to instance of application (from one row every few weeks to hundreds of rows per second), dynamic hash space partitioning may also be required, also in my use-case those insert rates can gradually grow for any instance of application over time (but probably never substantially decrease, just keep the level for a while).
My current approach is to hold some partitioning generation counters and whenever load on an application instance exceeds certain treshold - introduce new subsequent generation, with more hash space partitions, copying all the data to be redistributed alongside new partitioning scheme, scheduling all write operations while redistribution is in progress for both old and new generations, and after redistribution is complete, old generation partitions can be range-deleted.
Basically (domain, generation, hash_bucket), apid
where domain
, origin of those apid
s can be derived from apid
and apid
is hashed into 64-bit value H
and hash_bucket
is determined using H / (H_max/(2^generation))
My questions are
• is periodic querying of system.large_partitions
for every node is a reliable/reasonable way to detect partitions growing outside current generation bounds, or could it be preferable to maintain own kind of metric regarding hash bucket fill per domain
per generation
?
• Is there anything to be aware about, maybe other approaches, some potential improvements?
@Felipe_Cardeneti_Mendes: yeap, large_partitions
is quite effective. Note it is NOT immediate however. large_partitions
are populated as part of compactions. large_rows
and other large_*
tables may be relevant for you.
I think the only aspect I don’t understand is why you would copy data around from one place to another. Why don’t you simply keep track of your generations/buckets under a separate table?
@Yukkuri: > yeap, large_partitions
is quite effective. Note it is NOT immediate however. large_partitions
are populated as part of compactions. large_rows
and other large_*
tables may be relevant for you.
Thanks.
Why don’t you simply keep track of your generations/buckets under a separate table?
That would require keeping track of every apid
from past generations there, which could end up being subject to own partitioning pressure; there is no way to know apid
’s generation without full-scan of that accounting table.
Say there is apid <https://example.com/object/foobar>
; and under current generation 0 allowing only one hash bucket (whole hash space) it’s hash_bucket
is 0.
Now if we increase generation to 1, allowing 2 partitions of hash space, it’s hash_bucket
may end up being either 0 or 1 depending on chosen hash function, despite having exact same hash value. To account for this, we must copy a row of (('<http://example.com|example.com>', 0, 0), '<https://example.com/object/foobar>')
to (('<http://example.com|example.com>', 1, $new_hash_bucket), '<https://example.com/object/foobar>')
so we can continue relying on hash space partitioning to unambiguously map apid
hash values to partitioning buckets. That requires extra dance around current/pending generations, to keep both in sync while migration is in progress, but otherwise seem to work.
Also one small change from what I described initially: it turned out to be beneficial to store combined generation
+hash_bucket
value in single column; since it also allows in-queries. In my case both values are fitting nicely in 128-bit uuid.
@Felipe_Cardeneti_Mendes: LGTM