Probabilistic Tracing with ScyllaDB

I’ve been exploring different debugging techniques to improve the performance and reliability of my applications, specifically those using ScyllaDB.

Recently, I came across the concept of Probabilistic Tracing, but I’m not entirely clear on how to effectively implement and utilize it.

Could someone explain how Probabilistic Tracing can be used for debugging purposes? Additionally, I would appreciate some concrete examples demonstrating its application in real-world scenarios.

Tracing is a tool to debug and analyze internal flows in the cluster. useful for observing behaviors of specific queries and troubleshooting network issues, data transfer, and data replication problems. There are different types of tracing that you can use:

  • User-defined CQL query - One example of such a flow is CQL request processing. You can start tracing by placing a flag inside a CQL query.
  • Probalistic Tracing - randomly chooses a request to be traced with some defined probability.
  • Slow Query Logging - records queries with handling time above the specified threshold.

The tracing documentation section includes instructions and examples for debugging common issues (e.g., Hot Partition, large partition, slow queries).

Tracing rows are inserted in the system_traces.sessions and system_traces.events tables when probabilistic tracing is enabled. In order to minimize the impact of this procedure on cluster performance, the following must be considered when enabling it:

  • 1 = 100% (all of the queries are being tracked)
  • 0.1 = 10%
  • 0.01 = 1%
  • 0.001 = 0.1%
  • 0.0001 = 0.01 %

It is important to remember that this is a node-level analysis, so if you want to get info for the entire cluster, the command must run for all cluster nodes.

The following steps are an example of collecting information for 5 minutes and then disabling the tracing (value 0 stops collecting information):

  1. Connect to the cluster via ssh

  2. For each node, open tmux (Getting Started · tmux/tmux Wiki · GitHub) to run the command listed below.

nodetool settraceprobability 0.001; sleep 5m; nodetool settraceprobability 0

  1. Retrieve traced sessions or event data using the following statements:

SELECT*FROM system_traces.sessions

SELECT*FROM system_traces.events

Analysis

The commands below expect a CSV file with a “;” delimiter and the parameters in the 6th column.

That can be verified with:

head -2 sessions.csv

session_id;client;command;coordinator;duration;parameters;request;request_size;response_size;started_at;username

5bf9e2b0-bd06-11ed-9fe1-000000000005;172.18.11.33;QUERY;172.21.2.55;619;"{'consistency_level': 'ONE', 'page_size': '5000', 'param[0]': '676858', 'param[1]': 'US', 'query': 'SELECT country_id,source_content_id,similar_content_id,similar_

program_type,rank FROM content_knn_addtolist_1678202913010 WHERE source_content_id=:source_content_id AND country_id=:country_id;', 'serial_consistency_level': 'SERIAL'}";Execute CQL3 prepared query [90396392c416623ea59ed5c27f675922];51;4431;2023-03-07 16:37:32.891+0000;scylla

Some examples:

  • How to get sorted output with tables names hit the most by INSERT statements:

awk 'BEGIN{FS=";"} {print $6}' sessions.csv | sed "s/.*'query': '//" | sed 's/\\n//g' | grep -i 'INSERT INTO' | sed 's/.*INSERT INTO//i' | sed 's/(.*//i' | sort | uniq -c | sort -nr

  • How to get sorted output with tables names hit the most by SELECT statements:

awk 'BEGIN{FS=";"} {print $6}' sessions.csv | sed "s/.*'query': '//" | sed 's/\\n//g' | grep -i 'select' | sed 's/.*FROM//i' | sed 's/.WHERE.*//i' | sort | uniq -c | sort -nr

  • How to get all SELECT queries in queries.txt file:

awk 'BEGIN{FS=";"} {print $6}' sessions.csv | grep -i 'SELECT' | sed "s/.*'query': '//" | sed 's/\\n//g | sort | uniq > queries.txt

  • Same as above, but with count numbers for the queries:

awk 'BEGIN{FS=";"} {print $6}' sessions.csv | grep -i 'SELECT' | sed "s/.*'query': '//" | sed 's/\\n//g' | sort | uniq -c | sort -nr > queries.txt

More info: