Originally from the User Slack
@Varun_Nagrare: Hi all, I’m using Spark Cassandra Connector 2.5.2 with Spark 2.4.7. I wanted to know if there is a way to avoid full table scan and get only the required partitions? I have a dataframe with my partition keys and I want to get the data of only those partition keys. I want to know if I can get only that specific data without Spark doing a full table scan as it is also overloading Scylla ultimately giving timeout errors.
@dor: There should be a way but I really don’t know spark, we have a 3 blog series about it, you may need to tweak some code
@Varun_Nagrare: @dor Can you please provide link to that blog series? I’m trying the tweaks provided on the connector configuration page, but it’s still scanning the whole table and overloading Scylla
@dor: https://www.scylladb.com/2018/07/31/spark-scylla/
@Varun_Nagrare: @Botond_Dénes Can you please help? Most of my errors got solved by your guidance.
@Botond_Dénes: Unfortunately I also don’t know anything about Spark, @Lubos is our Spark guy, he may be able to help.
@Varun_Nagrare: @Lubos Requesting your help
@Botond_Dénes Nevermind, I found the solution. I had to filter my partition keys in a for loop and then union the list of dataframes so it only reads the specific partitions into a single dataframe and does not scan the whole table and also Scylla doesn’t overload. This method made my 7+ hours application runtime come down to just 10-15 mins only.