Situation: Someone in my team has executed Spark application in EMR and the job failed. The user is new to EMR and does not have much idea how to check the Spark logs. Now he has asked me to debug […]
In this post, we will see how you can run Spark application on existing EMR cluster using Apache Airflow. The most basic way of scheduling jobs in EMR is CRONTAB. But if you have worked with crontab you know how […]
I have been working on Spark for many years now. Initially I started with working on on-premises Hadoop cluster using CDH or HDP. In the past few years, I have been working a lot on EMR primarily for Spark or […]
The most common reason for namenode to go into safemode is due to under-replicated blocks. This is generally caused by storage issues on hdfs or when some jobs like Spark applications are suddenly aborted that leaves temp files which are […]
I was recently working on EMR running some pyspark jobs and I encountered “No space left on device” error. Now the error seems to be obvious that the system has run out of storage space and require some clean up. […]