PySpark -Convert SQL queries to Dataframe

In PySpark, you can run dataframe commands or if you are comfortable with SQL then you can run SQL queries too. In this post, we will see how to run different variations of SELECT queries on table built on Hive & corresponding Dataframe commands to replicate same output as SQL query.

Let’s create a dataframe first for the table “sample_07” which will use in this post.

 df_sample_07 = spark.sql("select * from sample_07")

pyspark select all columns

In pyspark, if you want to select all columns then you don’t need to specify column list explicitly. You can directly refer to the dataframe and apply transformations/actions you want on it. In this example , we will just display the content of table via pyspark sql or pyspark dataframe . To display content of dataframe in pyspark use “show()” method.

>>> spark.sql("select * from sample_07").show()
#Dataframe 
>>> df_sample_07.select("*").show()
+-------+--------------------+---------+------+
 |   code|         description|total_emp|salary|
 +-------+--------------------+---------+------+
 |00-0000|     All Occupations|134354250| 40690|
 |11-0000|Management occupa...|  6003930| 96150|
 |11-1011|    Chief executives|   299160|151370|
 |11-1021|General and opera...|  1655410|103780|
 |11-1031|         Legislators|    61110| 33880|
 |11-2011|Advertising and p...|    36300| 91100|
 |11-2021|  Marketing managers|   165240|113400|
 |11-2022|      Sales managers|   322170|106790|
 |11-2031|Public relations ...|    47210| 97170|
 |11-3011|Administrative se...|   239360| 76370|
 |11-3021|Computer and info...|   264990|113880|
 |11-3031|  Financial managers|   484390|106200|
 |11-3041|Compensation and ...|    41780| 88400|
 |11-3042|Training and deve...|    28170| 90300|
 |11-3049|Human resources m...|    58100| 99810|
 |11-3051|Industrial produc...|   152870| 87550|
 |11-3061| Purchasing managers|    65600| 90430|
 |11-3071|Transportation, s...|    92790| 81980|
 |11-9011|Farm, ranch, and ...|     3480| 61030|
 |11-9012|Farmers and ranchers|      340| 42480|
 +-------+--------------------+---------+------+
 only showing top 20 rows>>>
By default, the pyspark cli prints only 20 records. Also you can see the values are getting truncated after 20 characters.

To see the full column content you can specify “truncate=False” in show method.

>>> spark.sql("select * from sample_07").show(truncate=False)
#Dataframe
>>> df_sample_07.select("*").show(truncate=False)
+-------+--------------------------------------------------+---------+------+
 |code   |description                                       |total_emp|salary|
 +-------+--------------------------------------------------+---------+------+
 |00-0000|All Occupations                                   |134354250|40690 |
 |11-0000|Management occupations                            |6003930  |96150 |
 |11-1011|Chief executives                                  |299160   |151370|
 |11-1021|General and operations managers                   |1655410  |103780|
 |11-1031|Legislators                                       |61110    |33880 |
 |11-2011|Advertising and promotions managers               |36300    |91100 |
 |11-2021|Marketing managers                                |165240   |113400|
 |11-2022|Sales managers                                    |322170   |106790|
 |11-2031|Public relations managers                         |47210    |97170 |
 |11-3011|Administrative services managers                  |239360   |76370 |
 |11-3021|Computer and information systems managers         |264990   |113880|
 |11-3031|Financial managers                                |484390   |106200|
 |11-3041|Compensation and benefits managers                |41780    |88400 |
 |11-3042|Training and development managers                 |28170    |90300 |
 |11-3049|Human resources managers, all other               |58100    |99810 |
 |11-3051|Industrial production managers                    |152870   |87550 |
 |11-3061|Purchasing managers                               |65600    |90430 |
 |11-3071|Transportation, storage, and distribution managers|92790    |81980 |
 |11-9011|Farm, ranch, and other agricultural managers      |3480     |61030 |
 |11-9012|Farmers and ranchers                              |340      |42480 |
 +-------+--------------------------------------------------+---------+------+
 only showing top 20 rows

If you want to show more or less rows then you can specify it as first parameter in show method.
Let’s see how to show only 5 rows in pyspark dataframe with full column content.

>>> spark.sql("select * from sample_07").show(5,truncate=False)
#Dataframe
>>> df_sample_07.show(5,truncate=False)
+-------+-------------------------------+---------+------+
 |code   |description                    |total_emp|salary|
 +-------+-------------------------------+---------+------+
 |00-0000|All Occupations                |134354250|40690 |
 |11-0000|Management occupations         |6003930  |96150 |
 |11-1011|Chief executives               |299160   |151370|
 |11-1021|General and operations managers|1655410  |103780|
 |11-1031|Legislators                    |61110    |33880 |
 +-------+-------------------------------+---------+------+
 only showing top 5 rows

pyspark select multiple columns from the table/dataframe

>>> spark.sql("select code,total_emp,salary from sample_07").show(5,truncate=False)
#Dataframe
>>> df_sample_07.select("code","total_emp","salary").show(5,truncate=False)
+-------+---------+------+
 |code   |total_emp|salary|
 +-------+---------+------+
 |00-0000|134354250|40690 |
 |11-0000|6003930  |96150 |
 |11-1011|299160   |151370|
 |11-1021|1655410  |103780|
 |11-1031|61110    |33880 |
 +-------+---------+------+
 only showing top 5 rows

pyspark pick first 10 rows from the table

>>> spark.sql("select code,total_emp,salary from sample_07 limit 10 ").show(truncate=False)
#Dataframe
>>> df_sample_07.limit(10).select("code","total_emp","salary").show(truncate=False)
+-------+---------+------+
 |code   |total_emp|salary|
 +-------+---------+------+
 |00-0000|134354250|40690 |
 |11-0000|6003930  |96150 |
 |11-1011|299160   |151370|
 |11-1021|1655410  |103780|
 |11-1031|61110    |33880 |
 |11-2011|36300    |91100 |
 |11-2021|165240   |113400|
 |11-2022|322170   |106790|
 |11-2031|47210    |97170 |
 |11-3011|239360   |76370 |
 +-------+---------+------+

pyspark filter on column value

>>> spark.sql("select * from sample_07 where code='00-0000'").show(5,truncate=False)
#Dataframe
>>> df_sample_07.filter("code='00-0000'").show(5,truncate=False)
#Dataframe --> You can use "where" too in place of "filter" while running dataframe code
>>> df_sample_07.where("code='00-0000'").show(5,truncate=False)

You can use “where” too in place of “filter” while running dataframe code

+-------+---------------+---------+------+
|code   |description    |total_emp|salary|
+-------+---------------+---------+------+
|00-0000|All Occupations|134354250|40690 |
+-------+---------------+---------+------+

pyspark filter multiple conditions

>>> spark.sql("select * from sample_07 where total_emp>50000 and salary>30000").show(5,truncate=False)
#Dataframe
>>> df_sample_07.filter("total_emp>50000 and salary>30000").show(5,truncate=False)
+-------+-------------------------------+---------+------+
 |code   |description                    |total_emp|salary|
 +-------+-------------------------------+---------+------+
 |00-0000|All Occupations                |134354250|40690 |
 |11-0000|Management occupations         |6003930  |96150 |
 |11-1011|Chief executives               |299160   |151370|
 |11-1021|General and operations managers|1655410  |103780|
 |11-1031|Legislators                    |61110    |33880 |
 +-------+-------------------------------+---------+------+
 only showing top 5 rows

pyspark filter multiple conditions with OR

>>> spark.sql("select * from sample_07 where total_emp>50000 or salary>30000").show(5,truncate=False)
#Dataframe
>>> df_sample_07.filter("total_emp>50000 or salary>30000").show(5,truncate=False)
+-------+-------------------------------+---------+------+
 |code   |description                    |total_emp|salary|
 +-------+-------------------------------+---------+------+
 |00-0000|All Occupations                |134354250|40690 |
 |11-0000|Management occupations         |6003930  |96150 |
 |11-1011|Chief executives               |299160   |151370|
 |11-1021|General and operations managers|1655410  |103780|
 |11-1031|Legislators                    |61110    |33880 |
 +-------+-------------------------------+---------+------+
 only showing top 5 rows

pyspark filter multiple conditions with IN

>>> spark.sql("select * from sample_07 where description IN ('Chief executives','Legislators')").show(5,truncate=False)
#Dataframe
>>> df_sample_07.filter("description IN ('Chief executives','Legislators')").show(5,truncate=False)
+-------+----------------+---------+------+
 |code   |description     |total_emp|salary|
 +-------+----------------+---------+------+
 |11-1011|Chief executives|299160   |151370|
 |11-1031|Legislators     |61110    |33880 |
 +-------+----------------+---------+------+

pyspark dataframe distinct rows

>>> spark.sql("select distinct * from sample_07").show(5,truncate=False)
#Dataframe
>>> df_sample_07.distinct().show(5,truncate=False)
#Dataframe - we can use drop_duplicate in place of distinct too
>>> df_sample_07.drop_duplicates().show(5,truncate=False)
+-------+------------------------------------------------------------------+---------+------+
 |code   |description                                                       |total_emp|salary|
 +-------+------------------------------------------------------------------+---------+------+
 |25-1031|Architecture teachers, postsecondary                              |6070     |71480 |
 |25-1123|English language and literature teachers, postsecondary           |60910    |60580 |
 |25-2031|Secondary school teachers, except special and vocational education|1058870  |52450 |
 |47-2161|Plasterers and stucco masons                                      |49310    |39190 |
 |51-9082|Medical appliance technicians                                     |11900    |36010 |
 +-------+------------------------------------------------------------------+---------+------+
 only showing top 5 rows

pyspark select distinct multiple columns

>>> spark.sql("select distinct code,total_emp,salary  from sample_07").show(5,truncate=False)
#Dataframe
>>> df_sample_07.select("code","total_emp","salary").distinct().show(5,truncate=False)
 +-------+---------+------+
 |code   |total_emp|salary|
 +-------+---------+------+
 |13-2011|1115010  |63180 |
 |19-3021|220740   |66980 |
 |23-2092|31160    |40880 |
 |27-1011|32290    |83230 |
 |43-5041|46220    |33520 |
 +-------+---------+------+
 only showing top 5 rows

pyspark sort multiple columns

>>> spark.sql("select * from sample_07 order by code asc,salary desc").show(5,truncate=False)
#Dataframe
>>> df_sample_07.orderBy(["code","salary"],ascending=[1,0]).show(5,truncate=False)
+-------+-------------------------------+---------+------+
 |code   |description                    |total_emp|salary|
 +-------+-------------------------------+---------+------+
 |00-0000|All Occupations                |134354250|40690 |
 |11-0000|Management occupations         |6003930  |96150 |
 |11-1011|Chief executives               |299160   |151370|
 |11-1021|General and operations managers|1655410  |103780|
 |11-1031|Legislators                    |61110    |33880 |
 +-------+-------------------------------+---------+------+
 only showing top 5 rows

I will continue to add more pyspark sql & dataframe queries with time.

Leave a Reply