PySpark SQL queries & Dataframe commands – Part 1



In PySpark, you can run dataframe commands or if you are comfortable with SQL then you can run SQL queries too. In this post, we will see how to run different variations of SELECT queries on table built on Hive & corresponding Dataframe commands to replicate same output as SQL query.

Let’s create a dataframe first for the table “sample_07” which will use in this post.


df_sample_07 = spark.sql("select * from sample_07")

So now I have table “sample_07” and a dataframe “df_sample_07”.

Query 1: Select all the columns


>>> spark.sql("select * from sample_07").show()

+-------+--------------------+---------+------+
| code| description|total_emp|salary|
+-------+--------------------+---------+------+
|00-0000| All Occupations|134354250| 40690|
|11-0000|Management occupa...| 6003930| 96150|
|11-1011| Chief executives| 299160|151370|
|11-1021|General and opera...| 1655410|103780|
|11-1031| Legislators| 61110| 33880|
|11-2011|Advertising and p...| 36300| 91100|
|11-2021| Marketing managers| 165240|113400|
|11-2022| Sales managers| 322170|106790|
|11-2031|Public relations ...| 47210| 97170|
|11-3011|Administrative se...| 239360| 76370|
|11-3021|Computer and info...| 264990|113880|
|11-3031| Financial managers| 484390|106200|
|11-3041|Compensation and ...| 41780| 88400|
|11-3042|Training and deve...| 28170| 90300|
|11-3049|Human resources m...| 58100| 99810|
|11-3051|Industrial produc...| 152870| 87550|
|11-3061| Purchasing managers| 65600| 90430|
|11-3071|Transportation, s...| 92790| 81980|
|11-9011|Farm, ranch, and ...| 3480| 61030|
|11-9012|Farmers and ranchers| 340| 42480|
+-------+--------------------+---------+------+
only showing top 20 rows
>>>
>>> df_sample_07.show()
+-------+--------------------+---------+------+
| code| description|total_emp|salary|
+-------+--------------------+---------+------+
|00-0000| All Occupations|134354250| 40690|
|11-0000|Management occupa...| 6003930| 96150|
|11-1011| Chief executives| 299160|151370|
|11-1021|General and opera...| 1655410|103780|
|11-1031| Legislators| 61110| 33880|
|11-2011|Advertising and p...| 36300| 91100|
|11-2021| Marketing managers| 165240|113400|
|11-2022| Sales managers| 322170|106790|
|11-2031|Public relations ...| 47210| 97170|
|11-3011|Administrative se...| 239360| 76370|
|11-3021|Computer and info...| 264990|113880|
|11-3031| Financial managers| 484390|106200|
|11-3041|Compensation and ...| 41780| 88400|
|11-3042|Training and deve...| 28170| 90300|
|11-3049|Human resources m...| 58100| 99810|
|11-3051|Industrial produc...| 152870| 87550|
|11-3061| Purchasing managers| 65600| 90430|
|11-3071|Transportation, s...| 92790| 81980|
|11-9011|Farm, ranch, and ...| 3480| 61030|
|11-9012|Farmers and ranchers| 340| 42480|
+-------+--------------------+---------+------+
only showing top 20 rows

By default, the pyspark cli prints only 20 records. Also you can see the values are getting truncated after 20 characters. To fix this you can specify “truncate=False” in show method.

>>> spark.sql("select * from sample_07").show(truncate=False)
+-------+--------------------------------------------------+---------+------+
|code |description |total_emp|salary|
+-------+--------------------------------------------------+---------+------+
|00-0000|All Occupations |134354250|40690 |
|11-0000|Management occupations |6003930 |96150 |
|11-1011|Chief executives |299160 |151370|
|11-1021|General and operations managers |1655410 |103780|
|11-1031|Legislators |61110 |33880 |
|11-2011|Advertising and promotions managers |36300 |91100 |
|11-2021|Marketing managers |165240 |113400|
|11-2022|Sales managers |322170 |106790|
|11-2031|Public relations managers |47210 |97170 |
|11-3011|Administrative services managers |239360 |76370 |
|11-3021|Computer and information systems managers |264990 |113880|
|11-3031|Financial managers |484390 |106200|
|11-3041|Compensation and benefits managers |41780 |88400 |
|11-3042|Training and development managers |28170 |90300 |
|11-3049|Human resources managers, all other |58100 |99810 |
|11-3051|Industrial production managers |152870 |87550 |
|11-3061|Purchasing managers |65600 |90430 |
|11-3071|Transportation, storage, and distribution managers|92790 |81980 |
|11-9011|Farm, ranch, and other agricultural managers |3480 |61030 |
|11-9012|Farmers and ranchers |340 |42480 |
+-------+--------------------------------------------------+---------+------+
only showing top 20 rows

>>> df_sample_07.show(truncate=False)
+-------+--------------------------------------------------+---------+------+
|code |description |total_emp|salary|
+-------+--------------------------------------------------+---------+------+
|00-0000|All Occupations |134354250|40690 |
|11-0000|Management occupations |6003930 |96150 |
|11-1011|Chief executives |299160 |151370|
|11-1021|General and operations managers |1655410 |103780|
|11-1031|Legislators |61110 |33880 |
|11-2011|Advertising and promotions managers |36300 |91100 |
|11-2021|Marketing managers |165240 |113400|
|11-2022|Sales managers |322170 |106790|
|11-2031|Public relations managers |47210 |97170 |
|11-3011|Administrative services managers |239360 |76370 |
|11-3021|Computer and information systems managers |264990 |113880|
|11-3031|Financial managers |484390 |106200|
|11-3041|Compensation and benefits managers |41780 |88400 |
|11-3042|Training and development managers |28170 |90300 |
|11-3049|Human resources managers, all other |58100 |99810 |
|11-3051|Industrial production managers |152870 |87550 |
|11-3061|Purchasing managers |65600 |90430 |
|11-3071|Transportation, storage, and distribution managers|92790 |81980 |
|11-9011|Farm, ranch, and other agricultural managers |3480 |61030 |
|11-9012|Farmers and ranchers |340 |42480 |
+-------+--------------------------------------------------+---------+------+
only showing top 20 rows

If you want to show more or less rows then you can specify it as first parameter in show method.
Let’s see how to print only 5 records without truncation.

>>> spark.sql("select * from sample_07").show(5,truncate=False)
+-------+-------------------------------+---------+------+
|code |description |total_emp|salary|
+-------+-------------------------------+---------+------+
|00-0000|All Occupations |134354250|40690 |
|11-0000|Management occupations |6003930 |96150 |
|11-1011|Chief executives |299160 |151370|
|11-1021|General and operations managers|1655410 |103780|
|11-1031|Legislators |61110 |33880 |
+-------+-------------------------------+---------+------+
only showing top 5 rows

>>> df_sample_07.show(5,truncate=False)
+-------+-------------------------------+---------+------+
|code |description |total_emp|salary|
+-------+-------------------------------+---------+------+
|00-0000|All Occupations |134354250|40690 |
|11-0000|Management occupations |6003930 |96150 |
|11-1011|Chief executives |299160 |151370|
|11-1021|General and operations managers|1655410 |103780|
|11-1031|Legislators |61110 |33880 |
+-------+-------------------------------+---------+------+
only showing top 5 rows

Query 2: Pick only few columns from the table


>>> spark.sql("select code,total_emp,salary from sample_07").show(5,truncate=False)
+-------+---------+------+
|code |total_emp|salary|
+-------+---------+------+
|00-0000|134354250|40690 |
|11-0000|6003930 |96150 |
|11-1011|299160 |151370|
|11-1021|1655410 |103780|
|11-1031|61110 |33880 |
+-------+---------+------+
only showing top 5 rows

>>> df_sample_07.select("code","total_emp","salary").show(5,truncate=False)
+-------+---------+------+
|code |total_emp|salary|
+-------+---------+------+
|00-0000|134354250|40690 |
|11-0000|6003930 |96150 |
|11-1011|299160 |151370|
|11-1021|1655410 |103780|
|11-1031|61110 |33880 |
+-------+---------+------+
only showing top 5 rows

Query 3: Pick top 10 rows from the table


>>> spark.sql("select code,total_emp,salary from sample_07 limit 10 ").show(truncate=False)
+-------+---------+------+
|code |total_emp|salary|
+-------+---------+------+
|00-0000|134354250|40690 |
|11-0000|6003930 |96150 |
|11-1011|299160 |151370|
|11-1021|1655410 |103780|
|11-1031|61110 |33880 |
|11-2011|36300 |91100 |
|11-2021|165240 |113400|
|11-2022|322170 |106790|
|11-2031|47210 |97170 |
|11-3011|239360 |76370 |
+-------+---------+------+

>>> df_sample_07.limit(10).select("code","total_emp","salary").show(truncate=False)
+-------+---------+------+
|code |total_emp|salary|
+-------+---------+------+
|00-0000|134354250|40690 |
|11-0000|6003930 |96150 |
|11-1011|299160 |151370|
|11-1021|1655410 |103780|
|11-1031|61110 |33880 |
|11-2011|36300 |91100 |
|11-2021|165240 |113400|
|11-2022|322170 |106790|
|11-2031|47210 |97170 |
|11-3011|239360 |76370 |
+-------+---------+------+

Query 4: Apply filter conditions


>>> spark.sql("select * from sample_07 where code='00-0000'").show(5,truncate=False)
+-------+---------------+---------+------+
|code |description |total_emp|salary|
+-------+---------------+---------+------+
|00-0000|All Occupations|134354250|40690 |
+-------+---------------+---------+------+

>>> df_sample_07.filter("code='00-0000'").show(5,truncate=False)
+-------+---------------+---------+------+
|code |description |total_emp|salary|
+-------+---------------+---------+------+
|00-0000|All Occupations|134354250|40690 |
+-------+---------------+---------+------+

You can use “WHERE” too in place of filter while running dataframe code

>>> df_sample_07.where("code='00-0000'").show(5,truncate=False)
+-------+---------------+---------+------+
|code |description |total_emp|salary|
+-------+---------------+---------+------+
|00-0000|All Occupations|134354250|40690 |
+-------+---------------+---------+------+

Query 5: Apply multiple filter conditions with AND condition


>>> spark.sql("select * from sample_07 where total_emp>50000 and salary>30000").show(5,truncate=False)
+-------+-------------------------------+---------+------+
|code |description |total_emp|salary|
+-------+-------------------------------+---------+------+
|00-0000|All Occupations |134354250|40690 |
|11-0000|Management occupations |6003930 |96150 |
|11-1011|Chief executives |299160 |151370|
|11-1021|General and operations managers|1655410 |103780|
|11-1031|Legislators |61110 |33880 |
+-------+-------------------------------+---------+------+
only showing top 5 rows

>>> df_sample_07.filter("total_emp>50000 and salary>30000").show(5,truncate=False)
+-------+-------------------------------+---------+------+
|code |description |total_emp|salary|
+-------+-------------------------------+---------+------+
|00-0000|All Occupations |134354250|40690 |
|11-0000|Management occupations |6003930 |96150 |
|11-1011|Chief executives |299160 |151370|
|11-1021|General and operations managers|1655410 |103780|
|11-1031|Legislators |61110 |33880 |
+-------+-------------------------------+---------+------+
only showing top 5 rows

Query 6: Apply multiple filter conditions with OR condition


>>> spark.sql("select * from sample_07 where total_emp>50000 or salary>30000").show(5,truncate=False)
+-------+-------------------------------+---------+------+
|code |description |total_emp|salary|
+-------+-------------------------------+---------+------+
|00-0000|All Occupations |134354250|40690 |
|11-0000|Management occupations |6003930 |96150 |
|11-1011|Chief executives |299160 |151370|
|11-1021|General and operations managers|1655410 |103780|
|11-1031|Legislators |61110 |33880 |
+-------+-------------------------------+---------+------+
only showing top 5 rows

>>> df_sample_07.filter("total_emp>50000 or salary>30000").show(5,truncate=False)
+-------+-------------------------------+---------+------+
|code |description |total_emp|salary|
+-------+-------------------------------+---------+------+
|00-0000|All Occupations |134354250|40690 |
|11-0000|Management occupations |6003930 |96150 |
|11-1011|Chief executives |299160 |151370|
|11-1021|General and operations managers|1655410 |103780|
|11-1031|Legislators |61110 |33880 |
+-------+-------------------------------+---------+------+
only showing top 5 rows

Query 7: Filter rows on some column using “IN” condition


>>> spark.sql("select * from sample_07 where description IN ('Chief executives','Legislators')").show(5,truncate=False)
+-------+----------------+---------+------+
|code |description |total_emp|salary|
+-------+----------------+---------+------+
|11-1011|Chief executives|299160 |151370|
|11-1031|Legislators |61110 |33880 |
+-------+----------------+---------+------+

>>> df_sample_07.filter("description IN ('Chief executives','Legislators')").show(5,truncate=False)
+-------+----------------+---------+------+
|code |description |total_emp|salary|
+-------+----------------+---------+------+
|11-1011|Chief executives|299160 |151370|
|11-1031|Legislators |61110 |33880 |
+-------+----------------+---------+------+

Query 8: Fetch only DISTINCT rows from table


>>> spark.sql("select distinct * from sample_07").show(5,truncate=False)
+-------+------------------------------------------------------------------+---------+------+
|code |description |total_emp|salary|
+-------+------------------------------------------------------------------+---------+------+
|25-1031|Architecture teachers, postsecondary |6070 |71480 |
|25-1123|English language and literature teachers, postsecondary |60910 |60580 |
|25-2031|Secondary school teachers, except special and vocational education|1058870 |52450 |
|47-2161|Plasterers and stucco masons |49310 |39190 |
|51-9082|Medical appliance technicians |11900 |36010 |
+-------+------------------------------------------------------------------+---------+------+
only showing top 5 rows

>>> df_sample_07.distinct().show(5,truncate=False)
+-------+------------------------------------------------------------------+---------+------+
|code |description |total_emp|salary|
+-------+------------------------------------------------------------------+---------+------+
|25-1031|Architecture teachers, postsecondary |6070 |71480 |
|25-1123|English language and literature teachers, postsecondary |60910 |60580 |
|25-2031|Secondary school teachers, except special and vocational education|1058870 |52450 |
|47-2161|Plasterers and stucco masons |49310 |39190 |
|51-9082|Medical appliance technicians |11900 |36010 |
+-------+------------------------------------------------------------------+---------+------+
only showing top 5 rows

You can also use “drop_duplicates” or “dropDuplicates”

>>> df_sample_07.drop_duplicates().show(5,truncate=False)
+-------+------------------------------------------------------------------+---------+------+
|code |description |total_emp|salary|
+-------+------------------------------------------------------------------+---------+------+
|25-1031|Architecture teachers, postsecondary |6070 |71480 |
|25-1123|English language and literature teachers, postsecondary |60910 |60580 |
|25-2031|Secondary school teachers, except special and vocational education|1058870 |52450 |
|47-2161|Plasterers and stucco masons |49310 |39190 |
|51-9082|Medical appliance technicians |11900 |36010 |
+-------+------------------------------------------------------------------+---------+------+
only showing top 5 rows
>>> df_sample_07.dropDuplicates().show(5,truncate=False)
+-------+------------------------------------------------------------------+---------+------+
|code |description |total_emp|salary|
+-------+------------------------------------------------------------------+---------+------+
|25-1031|Architecture teachers, postsecondary |6070 |71480 |
|25-1123|English language and literature teachers, postsecondary |60910 |60580 |
|25-2031|Secondary school teachers, except special and vocational education|1058870 |52450 |
|47-2161|Plasterers and stucco masons |49310 |39190 |
|51-9082|Medical appliance technicians |11900 |36010 |
+-------+------------------------------------------------------------------+---------+------+
only showing top 5 rows

Query 9: Fetch only distinct column list from table


>>> spark.sql("select distinct code,total_emp,salary from sample_07").show(5,truncate=False)

+-------+---------+------+
|code |total_emp|salary|
+-------+---------+------+
|13-2011|1115010 |63180 |
|19-3021|220740 |66980 |
|23-2092|31160 |40880 |
|27-1011|32290 |83230 |
|43-5041|46220 |33520 |
+-------+---------+------+
only showing top 5 rows

>>> df_sample_07.select("code","total_emp","salary").distinct().show(5,truncate=False)
+-------+---------+------+
|code |total_emp|salary|
+-------+---------+------+
|13-2011|1115010 |63180 |
|19-3021|220740 |66980 |
|23-2092|31160 |40880 |
|27-1011|32290 |83230 |
|43-5041|46220 |33520 |
+-------+---------+------+
only showing top 5 rows

Query 10: Sort rows on the basis of column using ascending & descending


>>> spark.sql("select * from sample_07 order by code asc,salary desc").show(5,truncate=False)
+-------+-------------------------------+---------+------+
|code |description |total_emp|salary|
+-------+-------------------------------+---------+------+
|00-0000|All Occupations |134354250|40690 |
|11-0000|Management occupations |6003930 |96150 |
|11-1011|Chief executives |299160 |151370|
|11-1021|General and operations managers|1655410 |103780|
|11-1031|Legislators |61110 |33880 |
+-------+-------------------------------+---------+------+
only showing top 5 rows

>>> df_sample_07.orderBy(["code","salary"],ascending=[1,0]).show(5,truncate=False)
+-------+-------------------------------+---------+------+
|code |description |total_emp|salary|
+-------+-------------------------------+---------+------+
|00-0000|All Occupations |134354250|40690 |
|11-0000|Management occupations |6003930 |96150 |
|11-1011|Chief executives |299160 |151370|
|11-1021|General and operations managers|1655410 |103780|
|11-1031|Legislators |61110 |33880 |
+-------+-------------------------------+---------+------+
only showing top 5 rows

In next post , we will see some more examples in Pyspark.

Leave a Reply

Your email address will not be published. Required fields are marked *