Spark single application consumes all resources – Good or Bad for your cluster ?

While working with Spark, I hear it so many times when client or my team “complaints” that single Spark job is taking all resources. So is it bad for your cluster ? Whether to consider this as bad or good totally depends on the type of applications you are running.

There can be two scenarios :

You have memory available in the cluster and spark is not using it. So the cluster is under utilised.
You have multiple applications running on the cluster and first application consumes most of the resources. So the cluster does not have efficient parallelism.

Why Single Application takes all resources ?

Once the dynamic allocation (spark.dynamicAllocation.enabled) of executors in spark property was introduced I have observed that most of the people keep it enabled. This is actually good especially when you are not well aware of resource requirements for your running applications. Now when we start a spark session , it asks for initial number of executors and below three properties determine it:

spark.dynamicAllocation.initialExecutors
spark.dynamicAllocation.minExecutors
spark.executor.instances

The maximum value among above 3 properties shall determine the number of initial executors which an application will get on start. So if you think you are manually setting the value of “num-executors” or “spark.executor.instances” then you shall be able to control the initial number of executors then you need some correction. It shall work fine if dynamic allocation is set False. But with it enabled then you have to set all the above 3 properties to make it work as expected.

How to set the Spark properties ?

If the user does not configure any spark property then it takes the default value which is either by default mentioned in “spark-defaults.conf” file present at /etc/spark/conf or default value as per Spark version you are using in the cluster. If you make any changes to spark-defaults.conf file then it won’t be reflected on already created session. So abort the job and start a new session to use updated configuration.

What is the best number of initial executors ?

There is no definite answer to this question. But I can share the manner in which I generally configure it. If I am running “heavy lifting” jobs which require more memory however the number of jobs is less then I generally put it as a “minimum” big number depending on the yarn configuration and overall vcores available in the cluster. Say 30 initial executors for below 2 properties :

spark.dynamicAllocation.initialExecutors 30
spark.dynamicAllocation.minExecutors 30

Now when I will set value of “num-executors” in my spark-submit job I will keep it at least 30 or more. So if I know my job need more threads to run quickly then I may pass num-executors as 50. In this manner, my single job which I run will take most of the resources in the cluster. And this is perfectly fine considering the type of work I want to complete via Spark. So with this property set , my application will always have at least 30 executors. As I mentioned earlier this is good if you want to run few jobs which require more memory.

However this approach is not good if you want to run many applications which are not very memory intensive but requires good parallelism.

So for this scenario I generally set below two properties to smaller number like 3

spark.dynamicAllocation.initialExecutors 3
spark.dynamicAllocation.minExecutors 3

Again I will put num-executors to at least 3 or more if I know that application may require more memory. And in this manner I can have more jobs running in parallel.

Hope the above 2 scenarios would have helped you in determining what configuration is better for your applications.

Bonus point which you should consider while setting number of executors

I was working on a project recently to extract data from RDBMS to HDFS via Spark JDBC connection. Now in JDBC you can specify the number of partitions you want to use while reading data from RDBMS table. In this case, since the tables were small I kept num-executors to 5 but ran multiple jobs in parallel. The problem I faced now was not with Spark but with my database. Since I was running many application in parallel and each application was creating multiple connections to database, it became bottle-neck for the performance. So in such case , better to give more memory to executor and call for lesser number of executors.

Hence we can say that single spark application consuming all the resources can be good or bad as per the task you are trying to accomplish using Spark. If you don’t have many applications to run in parallel or you want to run sequential jobs then it is very good to utilise all the memory available in cluster. Else if you want more parallelism then you can keep above mentioned properties to less value and share resources equally among parallel jobs.

Thank you for sharing this. I will give it a try as well.

Might be interesting to add a PySpark dialect to SQLglot https://github.com/tobymao/sqlglot https://github.com/tobymao/sqlglot/tree/main/sqlglot/dialects

try something like df.withColumn("type", when(col("flag1"), lit("type_1")).when(!col("flag1") && (col("flag2") || col("flag3") || col("flag4") || col("flag5")), lit("type2")).otherwise(lit("other")))

It will be great if you can have a link to the convertor. It helps the community for anyone starting…

I am wondering if there is a way to preserve time information when adding/subtracting days from a datetime. If I…

Why Single Application takes all resources ?

How to set the Spark properties ?

What is the best number of initial executors ?

However this approach is not good if you want to run many applications which are not very memory intensive but requires good parallelism.

Bonus point which you should consider while setting number of executors

Leave a ReplyCancel reply