Spark Connector, getting specific data without full table scans

Guy · April 12, 2024, 9:04am

Originally from the User Slack

@Varun_Nagrare: Hi all, I’m using Spark Cassandra Connector 2.5.2 with Spark 2.4.7. I wanted to know if there is a way to avoid full table scan and get only the required partitions? I have a dataframe with my partition keys and I want to get the data of only those partition keys. I want to know if I can get only that specific data without Spark doing a full table scan as it is also overloading Scylla ultimately giving timeout errors.

@dor: There should be a way but I really don’t know spark, we have a 3 blog series about it, you may need to tweak some code

@Varun_Nagrare: @dor Can you please provide link to that blog series? I’m trying the tweaks provided on the connector configuration page, but it’s still scanning the whole table and overloading Scylla

@dor: https://www.scylladb.com/2018/07/31/spark-scylla/

@Varun_Nagrare: @Botond_Dénes Can you please help? Most of my errors got solved by your guidance.

@Botond_Dénes: Unfortunately I also don’t know anything about Spark, @Lubos is our Spark guy, he may be able to help.

@Varun_Nagrare: @Lubos Requesting your help
@Botond_Dénes Nevermind, I found the solution. I had to filter my partition keys in a for loop and then union the list of dataframes so it only reads the specific partitions into a single dataframe and does not scan the whole table and also Scylla doesn’t overload. This method made my 7+ hours application runtime come down to just 10-15 mins only.

Topic		Replies	Views
Copying ScyllaDB data to S3, using Spark, performance optimization ScyllaDB performance , sstable , backup-restore	0	32	November 17, 2024
How to bulk fetch data from ScyllaDB? ScyllaDB cdc , kafka , elasticsearch	2	499	December 13, 2022
How to find all of the database's partitions efficiently, full table scan ScyllaDB data-model , performance , cql	0	107	September 1, 2024
[RELEASE] ScyllaDB Java Driver 4.18.0.1 Release Notes	0	67	July 17, 2024
Maximizing read throughput of table scan using shard-awareness Knowledge Base	1	255	August 8, 2023

Spark Connector, getting specific data without full table scans

Related topics