PySpark Demo Day2

PySpark Tutorial – Distinct , Filter , Sort on Dataframe

This is second part of PySpark Tutorial series. In this post, we will talk about :

  • Fetch unique values from dataframe in PySpark
  • Use Filter to select few records from Dataframe in PySpark
    • AND
    • OR
    • LIKE
    • IN
    • BETWEEN
    • NULL
  • How to SORT data on basis of one or more columns in ascending or descending order.

In the previous post, we covered following points and if you haven’t read it I will strongly recommend to read it first.

  1. Check Hadoop/Python/Spark version
  2. Connect to PySpark CLI
  3. Read CSV file into Dataframe and check some/all columns & rows in it.
  4. Check schema and copy schema from one dataframe to another
  5. Basic Metadata info of Dataframe

Let’s begin this post from where we left in the previous post in which we created a dataframe “df_category”. We will be using the same dataframe to work on different examples in this post too.

+-----+--------+---------+------------------------------------------+
|catid|catgroup|catname  |catdesc                                   |
+-----+--------+---------+------------------------------------------+
|1    |Sports  |MLB      |Major League Baseball                     |
|2    |Sports  |NHL      |National Hockey League                    |
|3    |Sports  |NFL      |National Football League                  |
|4    |Sports  |NBA      |National Basketball Association           |
|5    |Sports  |MLS      |Major League Soccer                       |
|6    |Shows   |Musicals |Musical theatre                           |
|7    |Shows   |Plays    |All non-musical theatre                   |
|8    |Shows   |Opera    |All opera and light opera                 |
|9    |Concerts|Pop      |All rock and pop music concerts           |
|10   |Concerts|Jazz     |All jazz singers and bands                |
|11   |Concerts|Classical|All symphony, concerto, and choir concerts|
+-----+--------+---------+------------------------------------------+

You can download the data file used in below examples for practice. I have uploaded file into S3 and not everyone may have access to it. Please change the path to point to path where you have placed data file.

Case 1 : PySpark Distinct Rows

You can select all columns by specifying “*” and then add “distinct” function on it to fetch distinct rows.

df_category.select('*').distinct().show(truncate=False)
+-----+--------+---------+------------------------------------------+
|catid|catgroup|catname  |catdesc                                   |
+-----+--------+---------+------------------------------------------+
|1    |Sports  |MLB      |Major League Baseball                     |
|2    |Sports  |NHL      |National Hockey League                    |
|3    |Sports  |NFL      |National Football League                  |
|4    |Sports  |NBA      |National Basketball Association           |
|5    |Sports  |MLS      |Major League Soccer                       |
|6    |Shows   |Musicals |Musical theatre                           |
|7    |Shows   |Plays    |All non-musical theatre                   |
|8    |Shows   |Opera    |All opera and light opera                 |
|9    |Concerts|Pop      |All rock and pop music concerts           |
|10   |Concerts|Jazz     |All jazz singers and bands                |
|11   |Concerts|Classical|All symphony, concerto, and choir concerts|
+-----+--------+---------+------------------------------------------+

Case 2: PySpark Distinct on one column

If you want to check distinct value of one column or check distinct on one column then you can mention that column in select and then apply distinct() on it.

df_category.select('catgroup').distinct().show(truncate=False)
+--------+
|catgroup|
+--------+
|Sports  |
|Concerts|
|Shows   |
+--------+

Case 3: PySpark Distinct multiple columns

If you want to check distinct values of multiple columns together then in the select add multiple columns and then apply distinct on it.

df_category.select('catgroup','catname').distinct().show(truncate=False)
+--------+---------+
|catgroup|catname  |
+--------+---------+
|Sports  |NBA      |
|Sports  |MLS      |
|Sports  |NHL      |
|Shows   |Plays    |
|Concerts|Pop      |
|Concerts|Jazz     |
|Shows   |Opera    |
|Sports  |NFL      |
|Shows   |Musicals |
|Concerts|Classical|
|Sports  |MLB      |
+--------+---------+

Case 4: PySpark Filter by column value

You can use FILTER or WHERE function to specify filter condition on one column by specifying the column value to getting  the output.

df_category.filter("catgroup = 'Sports'").select('*').show(truncate=False)
+-----+--------+-------+-------------------------------+
|catid|catgroup|catname|catdesc                        |
+-----+--------+-------+-------------------------------+
|1    |Sports  |MLB    |Major League Baseball          |
|2    |Sports  |NHL    |National Hockey League         |
|3    |Sports  |NFL    |National Football League       |
|4    |Sports  |NBA    |National Basketball Association|
|5    |Sports  |MLS    |Major League Soccer            |
+-----+--------+-------+-------------------------------+

PySpark Filter on multiple columns or multiple conditions

You can specify multiple columns in filter function along with multiple conditions to get required results. You can specify multiple conditions with “AND” or “OR” conditions.

Case 5: PySpark Filter on multiple conditions with AND

df_category.where("catgroup = 'Sports' and catname='NBA' ").select('*').show(truncate=False)
+-----+--------+-------+-------------------------------+
|catid|catgroup|catname|catdesc                        |
+-----+--------+-------+-------------------------------+
|4    |Sports  |NBA    |National Basketball Association|
+-----+--------+-------+-------------------------------+

Case 6: PySpark Filter on multiple conditions with OR

df_category.where("catgroup = 'Sports' or catname='NBA' ").select('*').show(truncate=False)
+-----+--------+-------+-------------------------------+
|catid|catgroup|catname|catdesc                        |
+-----+--------+-------+-------------------------------+
|1    |Sports  |MLB    |Major League Baseball          |
|2    |Sports  |NHL    |National Hockey League         |
|3    |Sports  |NFL    |National Football League       |
|4    |Sports  |NBA    |National Basketball Association|
|5    |Sports  |MLS    |Major League Soccer            |
+-----+--------+-------+-------------------------------+

Case 7: PySpark Filter with LIKE operator

You can use LIKE in filter conditions to filter column starts with some specific character or string pattern or ends with specific character or string pattern or comes in between or exists in the column value.

df_category.where("catgroup like 'S%' or catname='NBA' ").select('*').show(truncate=False)
+-----+--------+--------+-------------------------------+
|catid|catgroup|catname |catdesc                        |
+-----+--------+--------+-------------------------------+
|1    |Sports  |MLB     |Major League Baseball          |
|2    |Sports  |NHL     |National Hockey League         |
|3    |Sports  |NFL     |National Football League       |
|4    |Sports  |NBA     |National Basketball Association|
|5    |Sports  |MLS     |Major League Soccer            |
|6    |Shows   |Musicals|Musical theatre                |
|7    |Shows   |Plays   |All non-musical theatre        |
|8    |Shows   |Opera   |All opera and light opera      |
+-----+--------+--------+-------------------------------+

Case 8: PySpark Filter with NOT NULL

df_category.where("catgroup is not null ").select('*').show(truncate=False)
+-----+--------+---------+------------------------------------------+
|catid|catgroup|catname  |catdesc                                   |
+-----+--------+---------+------------------------------------------+
|1    |Sports  |MLB      |Major League Baseball                     |
|2    |Sports  |NHL      |National Hockey League                    |
|3    |Sports  |NFL      |National Football League                  |
|4    |Sports  |NBA      |National Basketball Association           |
|5    |Sports  |MLS      |Major League Soccer                       |
|6    |Shows   |Musicals |Musical theatre                           |
|7    |Shows   |Plays    |All non-musical theatre                   |
|8    |Shows   |Opera    |All opera and light opera                 |
|9    |Concerts|Pop      |All rock and pop music concerts           |
|10   |Concerts|Jazz     |All jazz singers and bands                |
|11   |Concerts|Classical|All symphony, concerto, and choir concerts|
+-----+--------+---------+------------------------------------------+

Case 9: PySpark Filter with NULL

df_category.where("catgroup is null ").select('*').show(truncate=False)
+-----+--------+-------+-------+
|catid|catgroup|catname|catdesc|
+-----+--------+-------+-------+
+-----+--------+-------+-------+

Case 10: PySpark Filter BETWEEN two column values

You can use between in Filter condition to fetch range of values from dataframe. Always give range from Minimum value to Maximum value else you will not get any result. You can use pyspark filter between two integers or two dates or any other range values.

df_category.where("catid between 3 and 8 ").select('*').show(truncate=False)
+-----+--------+--------+-------------------------------+
|catid|catgroup|catname |catdesc                        |
+-----+--------+--------+-------------------------------+
|3    |Sports  |NFL     |National Football League       |
|4    |Sports  |NBA     |National Basketball Association|
|5    |Sports  |MLS     |Major League Soccer            |
|6    |Shows   |Musicals|Musical theatre                |
|7    |Shows   |Plays   |All non-musical theatre        |
|8    |Shows   |Opera   |All opera and light opera      |
+-----+--------+--------+-------------------------------+

Case 11: PySpark Filter IN condition

You can also pass list of values in FILTER condition to fetch required result using IN condition.

df_category.where("catid in ( 3 ,8, 10) ").select('*').show(truncate=False)
+-----+--------+-------+--------------------------+
|catid|catgroup|catname|catdesc                   |
+-----+--------+-------+--------------------------+
|3    |Sports  |NFL    |National Football League  |
|8    |Shows   |Opera  |All opera and light opera |
|10   |Concerts|Jazz   |All jazz singers and bands|
+-----+--------+-------+--------------------------+

Case 12: PySpark SORT by column value

You can use column or columns to sort output from dataframe. The default sorting is ascending.

df_category.sort('catgroup').select('*').show(truncate=False)
+-----+--------+---------+------------------------------------------+
|catid|catgroup|catname  |catdesc                                   |
+-----+--------+---------+------------------------------------------+
|11   |Concerts|Classical|All symphony, concerto, and choir concerts|
|10   |Concerts|Jazz     |All jazz singers and bands                |
|9    |Concerts|Pop      |All rock and pop music concerts           |
|6    |Shows   |Musicals |Musical theatre                           |
|8    |Shows   |Opera    |All opera and light opera                 |
|7    |Shows   |Plays    |All non-musical theatre                   |
|1    |Sports  |MLB      |Major League Baseball                     |
|2    |Sports  |NHL      |National Hockey League                    |
|3    |Sports  |NFL      |National Football League                  |
|4    |Sports  |NBA      |National Basketball Association           |
|5    |Sports  |MLS      |Major League Soccer                       |
+-----+--------+---------+------------------------------------------+

Case 13: PySpark SORT by column value in Descending Order

However if you want to sort in descending order you will have to use “desc()” function. To use this function you have to import another function first “col” on top of which this function can be applied. Else it will assume column name to be string and will throw error.

df_category.sort('catgroup'.desc()).select('*').show(truncate=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'desc'

To fix the error import col from pyspark.sql.functions and use it to return column and then apply desc() function on it.

from pyspark.sql.functions import col

df_category.sort(col('catgroup').desc()).select('*').show(truncate=False)
+-----+--------+---------+------------------------------------------+
|catid|catgroup|catname  |catdesc                                   |
+-----+--------+---------+------------------------------------------+
|3    |Sports  |NFL      |National Football League                  |
|1    |Sports  |MLB      |Major League Baseball                     |
|4    |Sports  |NBA      |National Basketball Association           |
|2    |Sports  |NHL      |National Hockey League                    |
|5    |Sports  |MLS      |Major League Soccer                       |
|7    |Shows   |Plays    |All non-musical theatre                   |
|8    |Shows   |Opera    |All opera and light opera                 |
|6    |Shows   |Musicals |Musical theatre                           |
|9    |Concerts|Pop      |All rock and pop music concerts           |
|10   |Concerts|Jazz     |All jazz singers and bands                |
|11   |Concerts|Classical|All symphony, concerto, and choir concerts|
+-----+--------+---------+------------------------------------------+

Case 14: PySpark dataframe Sort by multiple columns

You can pass multiple columns in the sort function and also mention order style – ascending (default) or descending.

df_category.sort(col('catgroup').desc(),col('catid')).select('*').show(truncate=False)
+-----+--------+---------+------------------------------------------+
|catid|catgroup|catname  |catdesc                                   |
+-----+--------+---------+------------------------------------------+
|1    |Sports  |MLB      |Major League Baseball                     |
|2    |Sports  |NHL      |National Hockey League                    |
|3    |Sports  |NFL      |National Football League                  |
|4    |Sports  |NBA      |National Basketball Association           |
|5    |Sports  |MLS      |Major League Soccer                       |
|6    |Shows   |Musicals |Musical theatre                           |
|7    |Shows   |Plays    |All non-musical theatre                   |
|8    |Shows   |Opera    |All opera and light opera                 |
|10   |Concerts|Jazz     |All jazz singers and bands                |
|11   |Concerts|Classical|All symphony, concerto, and choir concerts|
|9    |Concerts|Pop      |All rock and pop music concerts           |
+-----+--------+---------+------------------------------------------+

In the next post, we will see how to calculate aggregates like COUNT, MIN, MAX on Pyspark dataframe.

Leave a Reply