Unknown disk usage of /var/lib/scylla

When we spin up a fresh EC2 using scylla AMI, the disk usage for /var/lib/scylla 12GB, with no data at all on that node.

On a node with approx 3TB data the /var/lib/scylla usage is ~3.2TB(~200GB).

scyllaadm@ip-127.0.0.1:~$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/nvme1n1 6.9T 3.2T 3.8T 46% /var/lib/scylla

scyllaadm@ip-127.0.0.1:/var/lib/scylla$ du -sh *
87G commitlog
0 coredump
2.9T data
0 hints
1.0K logs
0 saved_caches
0 view_hints

What’s causing this extra disk space on the Scylla AWS AMI?

There are no snapshots, the data size from nodetool status and du -sh /var/lib/scylla does not match. This behaviour is observed on the multiple nodes or even on the multiple clusters.

When a new node is added to an existing cluster, it will start receiving data from other nodes right away, as part of its bootstrap. Even if the node is started up all alone, there are many internal tables created, although 12GB for these is too much.

As for the other case, the difference seems to mainly come from commitlog. This contains data that is currently held up in memtables. This will be removed once those memtables are flushed.

The node 1 I’ve mentioned is the single node cluster, is not part of any existing nodes.

For the other case I’ve shown, even if you add commitlog size, there is still gap of 200GB.