What is the impact of starting the data integrity check of the SStable file on the cluster?

I see that parameter enable_sstable_data_integrity_check can control whether to enable integrity checks on SSTables. Its default value is false, and turning it on will have an impact on performance.

I saw its implementation in the commit of sstables: introduce file interposer for integrity check. Its check occurs before writing the SSTable file, after writing the SSTable file, and after reading the SSTable file.

It should be inferred from this that it has had an impact on write performance. May I ask if it has any other impacts? Thanks!

It has a decent impact on performance.
It’s not recommended to enable it by default for a production cluster / node.
In case of a severe issue, a ScyllaDB developer may suggest to enable temporarily this in order to do some validations to the sstable.

If you want to have sstables integrity check once in a while, you can use “nodetool scrub” command with its options.

1 Like

The best way to get integrity checking for your data on disk is to enable Sstable compression, when creating your tables (can be enabled later via alter table as well). Sstable compression stores checksums next to the data in the Sstable data file, and these checksums are checked every time the data is read.

Tables have compression on by default, so unless you disabled compression for your tables when creating them, you already have integrity checking.

If you want to have sstables integrity check once in a while, you can use “nodetool scrub” command with its options.

Currently, this only checks checksums on compressed sstables We have plans to change this in the near future, such that nodetool scrub --mode=VALIDATE can be used to force a checksum check on all Sstables, compressed or not.

2 Likes

Hi!
I saw the addition of function scylla sstable scrub in 5.4, what is the difference between it and function nodetool scrub?

Currently, this only checks checksums on compressed sstables

And what is the meaning of compressed sstables?
Thanks!

I saw the addition of function scylla sstable scrub in 5.4, what is the difference between it and function nodetool scrub?

scylla-sstable scrub is just an off-line version of nodetool scrub, meaning that you don’t need a running ScyllaDB process to do the scrub. The use-case this was developed for is when sstables in a backup are found to be corrupt and the backup cannot be restored because ScyllaDB refuses the sstables. In this case, the sstables can be fixed, before loading them to ScyllaDB.

And what is the meaning of compressed sstables?

Simply, sstables for tables, for which sstable-compression is enabled. You can tell whether an sstable is compressed or not, by checking whether the CompressionInfo.db component file exists or not.

1 Like

If you want to have sstables integrity check once in a while, you can use “nodetool scrub” command with its options.

Hi, I would like to know what impact my long-term continuous scheduling of nodetool scrub will have on the cluster? Does it have the same impact on the cluster as compaction? Because I see in the code that it is a task scheduled by CompactionManager.

Yes, scrub is just a type of compaction and it will have the same effect as an ongoing compaction.

1 Like

How do you use Scrub in production to ensure data security? If I schedule repairs during the cycle, is it still necessary to schedule Scrub?

@bo_li please start a new discussion for follow up questions. That would make it easier for other community members to find.

Currently we only use scrub we we identify bad sstables, that scrub is able to fix. We also use nodetool scrub --mode=VALIDATE, to find bad sstables. Currently the scrub is quite limited in what it can fix and identify. But we have plans to develop scrub to be more capable, in detecting bad sstables and as a general tool to make sure bad sstables are identified and qurantined.

1 Like

ok, I have started another topic here.

1 Like