Performance issues with cluster

I’m having some performance issues with my cluster. It seems like some queries are taking way too long.

How can I troubleshoot this?

One of the first things you should do is make sure you have Monitoring in place. Take a look at the Monitoring dashboards, and understand what’s happening with your cluster.

Another useful feature is Tracing. This feature allows you to debug queries that perform poorly.
Tracing enables analyzing of internal data flows in a cluster. It’s useful for observing the behavior of specific queries. It can help you look into network issues, slow queries, data transfers, and more.
There are two types of tracing, client-side tracing and server-side tracing (probabilistic tracing). In addition, there is Slow Query Logging.

To use client-side tracing:
cqlsh> TRACING ON

Which returns:
Now Tracing is enabled

Continue with normal queries. Each query would now return the result as before, but also a tracing session.
This is useful if you have a query that you suspect is causing problems and want to examine it.
The tracing data is stored in the system_traces keyspace, which can also be queried directly, for example:

cqlsh> select * from system_traces.sessions where session_id=227aff60-4f21-11e6-8835-000000000000;

cqlsh> select * from system_traces.events where session_id=227aff60-4f21-11e6-8835-000000000000;

Probabilistic Tracing randomly chooses a request to be traced with some pre-defined probability. This is set, per node, using the nodetool settraceprobability command. See more about it here.
To set this for an entire cluster, use the command on all nodes.
For example, to trace %0.01 of all the queries in the node, use:
nodetool settraceprobability 0.0001

Notice that this has an impact on performance, so use it carefully in production.
Some example values for the settraceprobability command:

  • 1 = 100% (all of the queries are being tracked)
  • 0.1 = 10%
  • 0.01 = 1%
  • 0.001 = 0.1%
  • 0.0001 = 0.01 %

So what number should you set? That depends on your workload and on the ops/second in your cluster. Typically you want to turn this on for a specific time window, so don’t forget to turn this off. For example, to collect information for 5 minutes only and then disable tracing (value 0 stops collecting information):

Connect to the cluster and on each node run:
nodetool settraceprobability 0.001; sleep 5m; nodetool settraceprobability 0

Traces are stored in the system_traces keyspace for 24 hours. The keyspace consists of two tables:

  • Sessions table contains a single row for each tracing session.
  • Events table contains a single row for each trace point.

So in the above example, you can retrieve the traced sessions or event data using the following statements:
SELECT*FROM system_traces.sessions
SELECT*FROM system_traces.events

Slow Query Logging captures queries that take more time than the given threshold.
This is useful if you’re not sure what’s happening in the cluster and you want to find out what’s causing performance issues.

Whenever you use tracing, remember it has a performance impact. Don’t enable it by default and use it for small periods of time.

More information about Tracing is available in this ScyllaB University lesson and in the Docs, including more advanced topics like Lightweight slow-queries logging mode, Large Partition Tracing, finding Hot Partitions, and more.