Originally from the User Slack
@Serhii_DevOps: Hello everyone, I’m not sure if it’s worth opening an issue, so I wanted to ask here first.
I deployed Scylla in Kubernetes clusters, and it’s performing very poorly compared to the local database. I wanted to ask if this is normal, but looking at the numbers, it seems like it’s not. Can anyone suggest what might be the problem or if I should go ahead and create an issue on GitHub?
Summary of Scylla Testing in Kubernetes
Server Specifications:
• Processor: Intel Core i9 14900K (8 cores x 3.2GHz up to 6.0GHz & 16 cores x 2.4GHz up to 4.4GHz)
• Memory: 128GB RAM DDR5
• Storage: 2x 2TB NVMe
• Software RAID: RAID 0
• File System: ext4
Kubernetes Infrastructure:
• System: Vanilla Kubernetes, Istio Service Mesh, Flannel, OpenEBS Storage Provider
• Tested Modes: Local tests, Kubernetes tests, Kubernetes tests with local databases.
• Testing showed that the results on the DigitalOcean Kubernetes cluster align with those on bare metal Kubernetes.
Test Description: A Go application was run to clear and populate the database by processing data from another source.
Test Results:
• Test 1:
◦ Insert Service: Minimum resources
◦ Scylla: 5 racks in two different data centers, each rack with 2 CPUs and 18GB RAM, 200GB storage
◦ Execution Time: 6 minutes 46 seconds
• Test 2:
◦ Insert Service: Minimum resources
◦ Scylla: 5 racks in two different data centers, each rack with 6 CPUs and 18GB RAM, 200GB storage
◦ Execution Time: 6 minutes 41 seconds
• Test 3:
◦ Insert Service: 4 CPUs, 4GB RAM
◦ Scylla: 5 racks in two different data centers, each rack with 6 CPUs and 18GB RAM, 200GB storage
◦ Execution Time: 5 minutes 45 seconds
• Test 4:
◦ Insert Service: 4 CPUs, 4GB RAM
◦ Scylla: 1 rack in a single data center, 6 CPUs and 18GB RAM, 200GB storage
◦ Execution Time: 6 minutes 20 seconds
• Test 5 (Local Installation):
◦ Insert Service: All services in Docker on a server with the same specifications (Single Scylla)
◦ Execution Time: 2.4 seconds
• Test 6:
◦ Insert Service: 4 CPUs, 4GB RAM
◦ Scylla: 1 rack in a single data center, 6 CPUs and 18GB RAM, 200GB storage
◦ Istio Service Mesh disabled
◦ Execution Time: 6 minutes 48 seconds
• Test 7:
◦ Insert Service in Kubernetes: 4 CPUs, 4GB RAM
◦ Scylla: One Scylla instance deployed in Docker on a server
◦ Execution Time: 10.5 seconds
Summary and Comparison:
• The local Docker installation (Test 5) produced the best result, taking only 2.4 seconds, significantly faster than all other tests.
• Kubernetes tests with Istio (Tests 1-4, 6) had longer execution times, ranging from 5 to 7 minutes.
• Disabling Istio Service Mesh (Test 6) did not significantly affect the result, as the execution time remained close to previous tests.
• The last test (Test 7), where Scylla was running in Docker on the server, showed a time of 10.5 seconds, which was an order of magnitude faster than Kubernetes.
Percentage Comparison:
• Local Docker Installation vs Kubernetes: 2.4 seconds vs 6 minutes 46 seconds = 99.4% faster
• Kubernetes without Istio vs with Istio: 6 minutes 48 seconds vs 6 minutes 41 seconds = 1.7% difference
• Docker (Single Scylla) vs Kubernetes (5 racks): 10.5 seconds vs 6 minutes 46 seconds = 97.4% faster
@dor: ScyllaDB expect not to over commit the disk and the cpu. Ideally, no container/process would run on the cpus that scylla will be scheduled on. Otherwise, huge latency will occur. So you should pin your containers. You can avoid that by providing a --overcommit flag to the scylla executable
In a similar way, Scylla expects to know the disk capacity, that’s why we have iotune script. The result is io.conf, the max capacity of a scylla nodde.
If you don’t use it, or used data which doesn’t match the hardware, huge latencies will occur.
So this setup is almost the worst case scenario for scylla.
Best would be to just run a single test and optimize it with the above suggestions, share the results and continue from there
lstio also makes it worse…
@Serhii_DevOps: Thank you very much for your response. Everything you mentioned sounds quite reasonable.
However, I’m a bit unclear about your comments regarding Istio. Based on the tests, the results are as follows:
Kubernetes without Istio vs. with Istio: 6 minutes 48 seconds vs. 6 minutes 41 seconds = 1.7% difference.
Disabling the Istio Service Mesh (Test 6) did not significantly affect the results, as the execution time remained close to previous tests.
So, did you mean that it causes some slight delays, or are you suggesting not to use it at all?
Additionally, I’d like to ask for your opinion as a Scylla expert. In our case, where we need multiple clusters across different regions, would you recommend deploying Scylla on dedicated servers, or should we continue trying to optimize it in Kubernetes to achieve comparable performance?
@dor: I misread the lstio results, I agree that 1.7% is a neglectable difference. It will effect latency more but let’s ignore it for the time being
You can run Scylla in k8s and we have an operator project too. As long as you follow the tunning it’s fine. Ideally, you either run a single container per host or you run multiple containers, just make sure they are pinned and configured accordingly.
@Serhii_DevOps: Hello @dor, thank you very much for your help.
Unfortunately, the Kubernetes cluster configuration didn’t help much.
So, I set up a Scylla node on a separate server.
Processor:
Single Intel E-2276G
(6c x 3.8GHz)
Memory:
64GB RAM DDR4
Storage:
1x 1TB NVMe
I split the disk into two 300GB partitions for Scylla.
The Ansible role successfully reformatted them for Scylla’s data.
Unfortunately, as a result, we only gained a 2-minute faster processing time.
So, if previously it took over 6 minutes, now it takes over 4 minutes.
Tomorrow, I will continue testing and further investigating this issue. But if you have any suggestions, I would appreciate it. I just don’t think the latency between servers could increase the processing time this much.
@dor: Did you run scylla setup? Why did you split the disk?
@Serhii_DevOps: I’m installing Scylla with ansible role
Because this server has 1tb nvme disk like 1 device
But for Ansible role require to have two devices which formatted to xfs and merged to one md0 device
hey @dor
After numerous tests and experiments, I can conclude that our current issue does not with Scylla. It is configured and working perfectly.
Thank you very much for your help!
@dor: Thanks for the report. It’s hard to figure out what is the problem. Yes, k8s can be complex but I don’t know whether this is a high op/s use case, real time or batch and why it didn’t work
@Serhii_DevOps: Yes, I completely agree, especially since the issue itself is quite strange.
What takes 5 seconds locally took over 6 minutes in multi-region clusters. There were many potential culprits, ranging from latency and Scylla (that’s our first experience with scylla) to Redis and Citus.
After numerous test cases, I managed to achieve a 5-second response time with Scylla across multiple regions. This suggests that the issue is not so much with latency or Scylla but rather with Redis or Citus.
So now I’ll focus on those. Once again, thank you so much! Also, moving Scylla from Kubernetes to a standard installation will be beneficial in the future.
@dor: Got it, thanks for the answer.