In this post, we will see how you can create your first PySpark script and then run it in batch mode. Many people I have seen use notebooks like Jupyter, Zeppelin however you may want to create pyspark script and run it as per schedule. This is especially helpful if you want to run ETL like process using PySpark which runs on fixed schedule.
How to write PySpark Script
Let’s create a simple PySpark script which will read data from some path and will write first 10 records into HDFS. The script will also show you how to create other dummy function along with main function. We will see how you can call other function inside main function and print some information as well.
Save the file as “run_sample_pyspark.py”
from pyspark.sql import SparkSession # another dummy function to add 2 numbers def add_2_no(a,b): return a+b #main function def main(): spark = SparkSession.builder.appName("Check_Parameters").getOrCreate() spark.sparkContext.setLogLevel("ERROR") df = spark.read.parquet("hdfs:///raw/data/products/product_category=Shoes/") df.printSchema() df.show(5) print ("Calling another function now.") sum_a_b = add_2_no(2,3) print ("Output Value is :"+str(sum_a_b)) df.limit(10).write.mode("overwrite").parquet("hdfs:///var/shoes/") print ("hdfs write completed") spark.stop() return None # entry point for PySpark ETL application if __name__ == '__main__': main()
How to run PySpark Script
You can run the pyspark script using spark-submit. spark-submit is used to run or submit pyspark applications in the cluster. You may also want to create a dedicated LOG file for this script execution. Use below command to run the pyspark script we created above on the cluster.
spark-submit filename
spark-submit run_sample_pyspark.py > run_sample_pyspark.log 2>&1 &
The above statement will run the PySpark script in the background by calling spark-submit. It also creates a log file in which you can see all the print statement output and other spark log info. We have set logging level to ERROR in the above script. You can change it to INFO, DEBUG,WARNING as well.
You can also pass parameters in the spark-submit command and also set spark level configuration as command-line arguments. Below is one sample example of how to execute PySpark script.
spark-submit --master yarn --executor-memory 2G --executor-cores 3 run_sample_pyspark.py > run_sample_pyspark.log 2>&1 &
As part of this post, I wanted to show you how easily you can create your first pyspark script and run in the cluster. I will keep it simple and will talk about more options available in the next post.
Summary
We saw how easy it is to create a pyspark script. We also saw how you can create multiple functions in the same script and call one from another. You can create a single method “main” and put all logic in it, though I will not encourage you to do so.
For execution of pyspark script, you have to pass the script to spark-submit which will take care of execution of logic in pyspark. Also you can create dedicated log file for each run for easy reference and debugging at later time.
Now if you are comfortable with this very basic sample pyspark script and you know how to run pyspark script I will strongly recommend to read the following post to see more options and configurations to be used in pyspark script.
Read More : PySpark script example and how to run pyspark script (intermediate level)