Why use update_entry in view_updates?

Installation details
#ScyllaDB version: master
#Cluster size: 1
os (RHEL/CentOS/Ubuntu/AWS AMI): centos

I noticed that when a view update is generated, it determines whether to use create_entry or update_entry based on the existing and update data of the base. When both existing and update exist and are alive, update_entry is called. update_entry calculates the difference between existing and update. And create_entry will directly create the entire update data.

void view_updates::generate_update(
        data_dictionary::database db,
        const partition_key& base_key,
        const clustering_or_static_row& update,
        const std::optional<clustering_or_static_row>& existing,
        gc_clock::time_point now) {

    xxx
        if (existing && existing->is_live(*_base)) {
            if (update.is_live(*_base)) {
                update_entry(db, base_key, update, *existing, now);
            } else {
                delete_old_entry(db, base_key, *existing, update, now);
            }
        } else if (update.is_live(*_base)) {
            create_entry(db, base_key, update, now);
        }
        return;
    }

xxx

Suppose there is a 3-node 2-replica cluster, where nodes A and B are two replica nodes of a certain data. The data of replica A node is 1 and 2, and the data of replica B node is 1 and 3. Then executing repair on node A will generate a new data of 1, 2 and 3. Then update_entry will be used when generating a view. update_entry will calculate the diff and get 3. At this time, the view table of node A previously had data 1 and 2, plus the newly generated data 3, so the data read is 1, 2 and 3.

But assume that the base table data 1 and 2 of node A fail to propagate the view for some reason, then we will read 3 when reading the view table. Assume that the index table has not executed the repair of this token during this period. Because the consistency level of reading the index table is ONE in Alternator. Read repair will not be triggered at consistency level ONE.

Since the index table of the scylladb database only supports eventual consistency in Alternator, this is not a problem. However, can we replace update_entry with create_entry? This can reduce the generation of these intermediate data. What is the purpose of using update_entry ? To save some space overhead?

Hello @nyh and @Botond_Denes, can you help me answer this? Thanks!

There are several separate issues here, I’m not sure if this forum is the best place for it (the mailing list, scylla-dev@googlegroups.com, is probably a better venue for prolonged discussion threads).

First of all, the main reason why update_entry exists separately from a delete_entry/create_entry pair is because a delete and then create with the same timestamp will not work - the delete would win when the timestamp ties, and the data will disappear instead of being updated. This is why we need need this separate update_entry() function.

The second issue is why update_entry() needs to have this “optimization” where if we believe that the view already contains some data, we don’t write it again. In the specific case you mentioned, update_entry() (view row key is known and hasn’t changed), I think you’re right and the optimization isn’t necessary, although to be honest I don’t remember every detail so please use “git blame” on the relevant line of code and see if comes from a commit that explained why this optimization was added.

A third question is why update_entry() and other code can assume it begins with the view and base replicas having matching data - so we can read from the base table to decide what to do to the view table, assuming we know what’s there. Well, as I already noted elsewhere, we often don’t have any way NOT to make this assumption. Consider the case where the view’s key is not the same as the base’s - for example, the base key is p and the view key is p, x. Now imagine a write setting SET x=3 WHERE p=7. We need to not only write the new row in the view (7,3) we also need to delete the previous row, say (7,2) (if the previous value of x was 2) - but how will we know the previous value of x was 2? We need to read it from the base table, and then assume that the view table matches it and has this row (7.2) and delete that row - not any other row.
So Scylla needs to assume that “paired” base and view replicas match up. The fact this assumption can become wrong on unrepaired tables and because of other problems is not lost on us, but we don’t have good documentation of when exactly this can happen, or whether repairs of base and/or view tables can fix some of these problems - or whether it can fix all of them. This definitely needs more work, but I’m afraid that we might not be able to fix all these more and more obscure bugs without completely changing the materialized views algorithm and dropping the “paired replicas” approach.

Finally, you are right that reading an unrepaired table with a consistency of ONE - which is the only way to read a GSI in Alternator - is a problem and we will never enjoy read-repair. I don’t know what to do about this. In the past, we actually believed that the lack of read-repair was a good thing (see Disable read-repair in materialized-view tables · Issue #3933 · scylladb/scylladb · GitHub) but I no longer believe this to be the case.

1 Like