How to set optimised Value for Spark Executor memory ? The Spark memory configuration especially the spark executor memory is the most important aspect of working on Spark. Without any changes to the existing code, these configurations can make or break the performance of your application.
One question which I ask myself everyday while working on Spark is what is the best value for executor memory for this spark application. And really there is no single answer which can fit all situations.
In this post I am sharing an online tool which you can use to generate "Balanced" values for 3 main spark configurations:
Try the online utility now !!!
Generate Optimised Spark Configuration for your cluster
Balanced Executor Configuration
Tiny Executor Configuration
Big Executor Configuration
Note: These are just the recommendations and you may have to test it as per your environment.
I have followed a very standard process to come up with these numbers. I have kept 4 GB per node for Operating System and other services that runs on the node. Also have kept executor-cores to maximum value of 5. This is in accordance with the study done by cloudera few years back. This value gives balance between I/O and compute. The process is explained in detail in this post.
The kind of task you are running in Spark should be considered to determine values for these parameters. In most of the applications, "Balanced" approach should work fine. This allows you to have good parallelism and also allows you to have enough executor memory to complete the task successfully.
In some cases, you may want to go for BIG executor when you want to run a complex job on the cluster and want to allocate all the memory to that one application.
How to set spark memory configurations ?
You can set the values in /etc/spark/conf/spark-defaults.conf file to add/edit values : spark.executor.cores , spark.executor.memory.
You can also pass spark memory configuration at run time to spark-submit command
spark-submit --executor-memory 6G --executor-cores 4 mypyspark_app.py
You may want to check below 2 posts that give more insights into setting configurations in Spark :
5 settings for better Spark environment
PySpark script example and how to run pyspark script
The spark configuration (spark.dynamicAllocation.enabled) decides how many executors to be allocated to any application at run-time. The default value is TRUE and I prefer to use this configuration as TRUE only. The reason being is in a cluster you run different types of tasks with varying loads. This configuration decides how many executors to allocate depending on requirement and resources available. Now the num-executors configuration does not hold much significance due to this property.
Let me know your comments on this online tool. Hope it helps.