[EMR] 5 settings for better Spark environment

I have been working on Spark for many years now. Initially I started with working on on-premises Hadoop cluster using CDH or HDP. In the past few years, I have been working a lot on EMR primarily for Spark or PySpark tasks. In this post, I would like to share some of the general settings I prefer to modify to make Spark environment more task friendly. Please be mindful that all configuration changes may not be applicable to your Spark environment so do the changes with some discretion. Below are the 5 settings which a beginner can use in EMR to tune Spark environment.

Step 1 : Changes to Spark Default configuration file

First thing I do is whenever I login into a Spark environment I check the "spark-defaults.conf" file. The file is generally present at "/etc/spark/conf" path. Some of the parameters for considerations are :

  • spark.executor.memory
  • spark.driver.memory
  • spark.port.maxRetries

Let's talk about the configurations* I use for above mentioned properties. You may want to switch to root user to update spark-default conf file.

spark.master : Default value is yarn and I keep it like that.
spark.dynamicAllocation.enabled: Default value is true and I keep it like that. You can set it to false and explicitly pass the num-executors for your spark application. This is good when you have clear understanding of type and volume of data you are handing in spark application.
spark.executor.cores: I generally keep it to 4. I change it to 4 if it is 1 or 2.
spark.executor.memory: I generally keep it as 6144M (1.5GB/core , 1.5*4 = 6GB)
spark.driver.memory: I generally keep it as 4096M. If the master node is of good configuration (more than 32GB) I generally bump it to more memory depending on the Spark tasks I am going to perform. Also I use deploy-mode as client so that master node is used as driver.
spark.port.maxRetries: I generally keep it 75. It depends on the parallel tasks you will be running. If I am running many spark applications and I do have cluster resources under-utilised I will further increase concurrent applications. Default value is 16 so I bump this number. Add this property if it already does not exists.

*assuming that the nodes are of at-least 32GB memory.

There are other properties too like setting java options for driver & executors. However I am not considering it in this post.

Step 2 : Download the required jar files into Spark LIB

You can find spark library at "/usr/lib/spark/jars" path in EMR.

If you are using Spark to connect to any databases via jdbc, make sure that you add relevant jar files into this path. Also if you are using any third party library , place the jar files at this path. The change will reflect only on new spark sessions created after placing jar files.

Step 3: Move the jar archive to hdfs

You can move the required jar files to hdfs directory. This will avoid one step where spark creates archive of jar files and puts in all the nodes for each spark application. This step may take more time like a minute if the cluster is really busy with concurrent applications. Even if you save 15 seconds for each application run, you can eventually save lot of time when running thousands of applications.

Use below commands

jar cv0f spark-libs.jar -C /usr/lib/spark/jars .
hdfs dfs -put spark-libs.jar hdfs:///var/

Now add the property in spark-defaults.conf file.

spark.yarn.archive  hdfs:///var/spark-libs.jar

Step 4 : Set the namenode log level

You may encounter space related problems where namenode log file starts taking lot of space. It generally is on rotation basis and the older log files are compressed. But if you are not working on something really critical and not interested in debugging with name node logs information , you may want to change level of some classes that push data into namenode log files. This shall reduce growing size of namenode log file considerably.

You can find namenode log file at path "/mnt/var/log/hadoop-hdfs/" and if the file size if growing quickly and in high numbers you can change log level to ERROR from default INFO for some identified classes. I will not recommend to change the root log level to ERROR.
Use below commands to set log level to ERROR for identified classes.

#handle log size by setting level to ERROR

var_ip=`hostname -I | awk '{print $1}'`
hadoop daemonlog -setlevel ${var_ip}:50070 org.apache.hadoop.hdfs.protocol.BlockStoragePolicy ERROR
hadoop daemonlog -setlevel ${var_ip}:50070 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager  ERROR
hadoop daemonlog -setlevel ${var_ip}:50070 org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy ERROR
hadoop daemonlog -setlevel ${var_ip}:50070 org.apache.hadoop.hdfs.server.blockmanagement.BlockReportLeaseManager ERROR
hadoop daemonlog -setlevel ${var_ip}:50070 org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor ERROR
hadoop daemonlog -setlevel ${var_ip}:50070 org.apache.hadoop.hdfs.server.blockmanagement.DecommissionManager ERROR
hadoop daemonlog -setlevel ${var_ip}:50070 org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager ERROR
hadoop daemonlog -setlevel ${var_ip}:50070 org.apache.hadoop.hdfs.server.namenode.FSDirectory ERROR
hadoop daemonlog -setlevel ${var_ip}:50070 org.apache.hadoop.hdfs.server.namenode.FSEditLog ERROR
hadoop daemonlog -setlevel ${var_ip}:50070 org.apache.hadoop.hdfs.server.namenode.FSNamesystem ERROR
hadoop daemonlog -setlevel ${var_ip}:50070 org.apache.hadoop.hdfs.server.namenode.LeaseManager ERROR
hadoop daemonlog -setlevel ${var_ip}:50070 org.apache.hadoop.hdfs.server.namenode.NameNode ERROR
hadoop daemonlog -setlevel ${var_ip}:50070 org.apache.hadoop.hdfs.StateChange ERROR
hadoop daemonlog -setlevel ${var_ip}:50070 org.apache.hadoop.net.NetworkTopology ERROR
hadoop daemonlog -setlevel ${var_ip}:50070 org.apache.hadoop.util.HostsFileReader ERROR

You can add or remove more classes from the above list.

Step 5: Set the JAVA HOME path

If you are using YARN related or some OS related operations in PySpark that uses java processes and those commands are failing with JAVA HOME error then you have to set JAVA_HOME path in hadoop to fix the problem.

Error: JAVA_HOME is not set and could not be found.

In order to fix this problem use below command

java_path=$(readlink -f /usr/bin/java | sed "s:bin/java::")
echo ${java_path}
#/usr/lib/jvm/java-1.8.0-amazon-corretto.x86_64/jre/

Once you have the java path go to hadoop conf path at "/etc/hadoop/conf" and open file "hadoop-env.sh" and add below line to this file.

# The java implementation to use.  Required.
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-amazon-corretto.x86_64/jre/

*Replace the java path which you get from above command.

I have shared in this post, 5 very common Spark Settings which I check and change in any given Spark environment. These are really helpful for beginners.

Leave a Reply

Your email address will not be published. Required fields are marked *