I was recently working on EMR running some pyspark jobs and I encountered "No space left on device" error. Now the error seems to be obvious that the system has run out of storage space and require some clean up. However I want to share step by step method of fixing this problem.
Step 1: Identify the drive which is full
It is very much possible that the overall cluster has storage space available however a specific mounted drive is full. So if you are using some drive for your Hadoop/Spark application and that drive is full then your application will throw the error "No space left on device". So to check the drive which is full run the below command:
Now I can see that "/mnt" is 100% full however the cluster as such has much storage available. Since I have pointed Spark directories to /mnt the log files for spark/yarn/hadoop all is created in this drive.
Step 2 : Identify the big directories/files for clean up
Once we know the drive that is full , the next step is to identify the big files which we can remove to make some space. Run the below command to identify it:
sudo du -chs /mnt/*
and you may want to add directories one at a time to identify the exact file/directory which is taking most of the space.
Step 3: Remove the file to make some space
Once you identify the file taking most of the space, delete it. Use the below command for it. Make sure that this is not a critical file for the node to operate. Try deleting other smaller files from the drive to free some space initially.
sudo rm -v //path_to_bog_file/filename
Once you delete the file then check the free space again and verify if the file is actually deleted.
If you see that the space is free and Use% is no more 100% then you can continue running your applications. Else even after deleting a file if the space is not free then you may have to kill the process which is holding that file.
Step 4: Kill the process to free space
For this run the below command and see which process is holding the file you just deleted. Kill that process to free space.
sudo lsof | grep "file keyword"
kill -9 pid
Once you kill the process then check the free space again. Now you shall see free space available in the drive.
Step 5: Worst case scenario
Worst case scenario is when none of the above command is working and node is actually frozen. Even "ls" command is stuck and you can run any command on node. Then in that case take back-up of all the required file and go ahead with a "reboot" command. Again I will say it involves some risks however if nothing is responding then you can't do much than trying a reboot.