Home » PySpark Filter – 25 examples to teach you everything

PySpark Filter – 25 examples to teach you everything

PySpark Filter is used to specify conditions and only the rows that satisfies those conditions are returned in the output. You can use WHERE or FILTER function in PySpark to apply conditional checks on the input rows and only the rows that pass all the mentioned checks will move to output result set.

PySpark WHERE vs FILTER

There is NO difference between FILTER or WHERE function in PySpark. FILTER echos more with the people coming from programming background like Scala and WHERE is more popular with people like me coming from SQL background. So you can use WHERE or FILTER which ever you wish to use in PySpark and there is absolutely no difference between the two.

I have used "FILTER" in the examples below but you can use "WHERE" as well.

PySpark Filter Condition Options

In this post , we will look into 10 most commonly used filter options available in PySpark.

  • equal
  • isin / in
  • like
  • rlike
  • between
  • not
  • isNull
  • contains
  • and
  • or

Refer to below diagram for easy reference to the multiple options available in PySpark Filter conditions.

PySpark FILTER / WHERE Conditions

Points to consider during FILTER

  • NOT : NOT is special conditional check which can reverse the obvious output. It basically negates the actual output by applying opposite FILTER condition on the input dataset to yield output. Also NOT can be clubbed along with most of the other FILTER condition operators like IN –> NOT IN , LIKE –> NOT LIKE etc.
  • AND/OR : AND and OR are used to merge multiple conditional checks together as single statement. When you have more than one condition for your input dataset then you will use AND/OR to connect individual conditions into single statement.
  • Case Sensitivity : Please be very careful while running filter conditions on string columns/values. It is case sensitive. So values 'SHOES','Shoes','shoes','shOes' all are different.

PySpark – create dataframe for testing

For this post, we will use amazon product review dataset for shoes category. We will create dataframe for it and then we will run different filter conditions on the dataframe rows and see the output.

>>> from pyspark.sql.functions import col
>>> df_shoes = spark.read.parquet("s3://amazon-reviews-pds/parquet/product_category=Shoes/")
>>> df_shoes.printSchema()
root
 |-- marketplace: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- review_id: string (nullable = true)
 |-- product_id: string (nullable = true)
 |-- product_parent: string (nullable = true)
 |-- product_title: string (nullable = true)
 |-- star_rating: integer (nullable = true)
 |-- helpful_votes: integer (nullable = true)
 |-- total_votes: integer (nullable = true)
 |-- vine: string (nullable = true)
 |-- verified_purchase: string (nullable = true)
 |-- review_headline: string (nullable = true)
 |-- review_body: string (nullable = true)
 |-- review_date: date (nullable = true)
 |-- year: integer (nullable = true)

pyspark select where

All the examples below apply some where condition and select only the required columns in the output. You can use select * to get all the columns else you can use select column_list to fetch only required columns.

PySpark filter equal

This is the most basic form of FILTER condition where you compare the column value with a given static value. If the value matches then the row is passed to output else it is restricted. In PySpark, you can use "==" operator to denote equal condition.

syntax :: filter(col("marketplace")=='UK')

df_shoes.filter(col("marketplace")=='UK').select("marketplace","product_title","star_rating","total_votes","verified_purchase","review_date").show(5)

+-----------+--------------------+-----------+-----------+-----------------+-----------+
|marketplace|       product_title|star_rating|total_votes|verified_purchase|review_date|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
|         UK|Crocs Professiona...|          5|          1|                Y| 2014-04-14|
|         UK|Crocs Professiona...|          4|          0|                Y| 2014-04-15|
|         UK|Crocs Unisex Blit...|          5|         24|                N| 2010-01-04|
|         UK|Crocs Unisex Baya...|          5|         34|                N| 2010-02-23|
|         UK|Crocs Jayna, Wome...|          4|          0|                Y| 2014-04-26|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
only showing top 5 rows

PySpark filter not equal

You can use not equal condition to fetch only those rows which does not match to any given value. NOT Equal to is denoted by "!=" operator in PySpark.

syntax :: filter(col("marketplace") != 'UK')

df_shoes.filter(col("marketplace") != 'UK').select("marketplace","product_title","star_rating","total_votes","verified_purchase","review_date").show(5)

+-----------+--------------------+-----------+-----------+-----------------+-----------+
|marketplace|       product_title|star_rating|total_votes|verified_purchase|review_date|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
|         US|Reebok Men's Oakl...|          5|          0|                Y| 2009-05-17|
|         US|Dr. Scholl's Wome...|          1|          1|                Y| 2014-04-09|
|         US|Kenneth Cole REAC...|          4|          1|                Y| 2009-05-17|
|         US|Rockport Cobb Hil...|          4|          0|                Y| 2014-04-09|
|         US|Yak Pak Megu Hand...|          5|          0|                Y| 2009-05-17|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
only showing top 5 rows

PySpark isin

When you have to make sure that the given column has value which is in list of allowed values only then you can use "isin" filter to get required rows in the result-set. Unlike "equal" to operator , in "isin" you can give list of values to compare and if the column value matches to anyone value in the list then it is passed. This is equivalent to "in" operation we do in SQL. Like in the below example we only want rows where marketplace is in UK or FR.

syntax :: filter(col("marketplace").isin('UK','FR'))

df_shoes.filter(col("marketplace").isin('UK','FR')).select("marketplace","product_title","star_rating","total_votes","verified_purchase","review_date").show(5)

+-----------+--------------------+-----------+-----------+-----------------+-----------+
|marketplace|       product_title|star_rating|total_votes|verified_purchase|review_date|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
|         FR|Crocs Jayna, Ball...|          4|          8|                Y| 2014-04-11|
|         UK|Crocs Professiona...|          5|          1|                Y| 2014-04-14|
|         UK|Crocs Professiona...|          4|          0|                Y| 2014-04-15|
|         UK|Crocs Unisex Blit...|          5|         24|                N| 2010-01-04|
|         UK|Crocs Unisex Baya...|          5|         34|                N| 2010-02-23|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
only showing top 5 rows

PySpark isin list

isin function accepts the list of values so you can also pass list directly to the isin function in place of individual values. Let's create a list (li) and pass that to isin function to get the output.

syntax :: filter(col("marketplace").isin(*li))

li=['UK','FR']

df_shoes.filter(col("marketplace").isin(*li)).select("marketplace","product_title","star_rating","total_votes","verified_purchase","review_date").show(5)

+-----------+--------------------+-----------+-----------+-----------------+-----------+
|marketplace|       product_title|star_rating|total_votes|verified_purchase|review_date|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
|         FR|Crocs Jayna, Ball...|          4|          8|                Y| 2014-04-11|
|         UK|Crocs Professiona...|          5|          1|                Y| 2014-04-14|
|         UK|Crocs Professiona...|          4|          0|                Y| 2014-04-15|
|         UK|Crocs Unisex Blit...|          5|         24|                N| 2010-01-04|
|         UK|Crocs Unisex Baya...|          5|         34|                N| 2010-02-23|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
only showing top 5 rows

PySpark NOT isin

As I mentioned above , NOT operator can be clubbed to any existing condition and it basically reverses the output. In PySpark, you can use "~" symbol to represent NOT operation on existing condition. So it is like in place of checking FALSE , you are checking NOT TRUE. You can use this negation operator along with most of the filter conditions and operators available. So in the previous example , let us add "~" in filter condition and see the output.

syntax :: filter(~col("marketplace").isin(*li))

li=['UK','FR']

df_shoes.filter(~col("marketplace").isin(*li)).select("marketplace","product_title","star_rating","total_votes","verified_purchase","review_date").show(5)

+-----------+--------------------+-----------+-----------+-----------------+-----------+
|marketplace|       product_title|star_rating|total_votes|verified_purchase|review_date|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
|         US|Reebok Men's Oakl...|          5|          0|                Y| 2009-05-17|
|         US|Dr. Scholl's Wome...|          1|          1|                Y| 2014-04-09|
|         US|Kenneth Cole REAC...|          4|          1|                Y| 2009-05-17|
|         US|Rockport Cobb Hil...|          4|          0|                Y| 2014-04-09|
|         US|Yak Pak Megu Hand...|          5|          0|                Y| 2009-05-17|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
only showing top 5 rows

PySpark LIKE

LIKE is used to do pattern matching in the filter condition. When you are looking for specific string pattern in the column value then LIKE operator is used. You have to use "%" in LIKE to represent rest of the STRING which can be anything and is of not much interest in filter condition. Also you can identify any given character/s at the start or at the end or somewhere in the input string using LIKE function.

syntax :: filter(col("product_title").like('Nike%'))

df_shoes.filter(col("product_title").like('Nike%')).select("marketplace","product_title","star_rating","total_votes","verified_purchase","review_date").show(5)

+-----------+--------------------+-----------+-----------+-----------------+-----------+
|marketplace|       product_title|star_rating|total_votes|verified_purchase|review_date|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
|         US|Nike Boy's Sunray...|          5|          0|                N| 2014-04-10|
|         US|Nike Golf Women's...|          5|          1|                Y| 2014-04-10|
|         US|Nike Jordan Men's...|          5|          1|                Y| 2014-04-10|
|         US|Nike Air Max 2014...|          5|          0|                Y| 2014-04-10|
|         US|Nike Revive Inter...|          5|          0|                Y| 2009-06-01|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
only showing top 5 rows

PySpark LIKE multiple values

You can also apply multiple conditions using LIKE operator on same column or different column by using "|" operator for each condition in LIKE. Unlike isin , LIKE does not accept list of values. Hence we have to separately pass the different values to LIKE function.

syntax :: filter(col("product_title").like('Nike%')|col("product_title").like('adidas%'))

df_shoes.filter(col("product_title").like('Nike%')|col("product_title").like('adidas%')).select("marketplace","product_title","star_rating","total_votes","verified_purchase","review_date").show(5)
+-----------+--------------------+-----------+-----------+-----------------+-----------+
|marketplace|       product_title|star_rating|total_votes|verified_purchase|review_date|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
|         US|adidas Little Kid...|          5|          0|                Y| 2009-05-19|
|         US|Nike Boy's Sunray...|          5|          0|                N| 2014-04-10|
|         US|adidas Women's Su...|          5|          0|                N| 2009-05-21|
|         US|adidas Men's Pred...|          5|          1|                N| 2009-05-21|
|         US|adidas Men's Abso...|          5|          0|                Y| 2009-05-21|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
only showing top 5 rows

PySpark NOT LIKE

Similarly, we can use negation (~) operator in front of LIKE condition to make it NOT LIKE. It will return all the rows which does not match the pattern mentioned in the LIKE condition.

syntax :: filter(~col("product_title").like('Nike%'))

df_shoes.filter(~col("product_title").like('nike%')).select("marketplace","product_title","star_rating","total_votes","verified_purchase","review_date").show(5)

+-----------+--------------------+-----------+-----------+-----------------+-----------+
|marketplace|       product_title|star_rating|total_votes|verified_purchase|review_date|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
|         US|Reebok Men's Oakl...|          5|          0|                Y| 2009-05-17|
|         US|Dr. Scholl's Wome...|          1|          1|                Y| 2014-04-09|
|         US|Kenneth Cole REAC...|          4|          1|                Y| 2009-05-17|
|         US|Rockport Cobb Hil...|          4|          0|                Y| 2014-04-09|
|         US|Yak Pak Megu Hand...|          5|          0|                Y| 2009-05-17|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
only showing top 5 rows

PySpark RLIKE

RLIKE is Regex-LIKE which means you can use regular expression in LIKE operator to make more powerful search in the given string. This makes pattern searching extremely flexible , useful and dangerous at the same time. So please be very careful while using regular expression in filter condition as it may give unexpected result as the input data changes. In the below example I have used REGEX to pick only those records which starts with Reebok and then anything after that till the end of string.

syntax :: filter(col("product_title").rlike("^(Reebok)(.*)+$"))

df_shoes.filter(col("product_title").rlike("^(Reebok)(.*)+$")).select("marketplace","product_title","star_rating","total_votes","verified_purchase","review_date").show(5)

+-----------+--------------------+-----------+-----------+-----------------+-----------+
|marketplace|       product_title|star_rating|total_votes|verified_purchase|review_date|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
|         US|Reebok Men's Oakl...|          5|          0|                Y| 2009-05-17|
|         US|Reebok Ziglite Ru...|          5|          1|                Y| 2014-04-09|
|         US|Reebok San Franci...|          2|          0|                Y| 2009-05-21|
|         US|Reebok Women's Re...|          3|          0|                Y| 2014-04-10|
|         US|Reebok Men's BB 4...|          2|          0|                Y| 2014-04-10|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
only showing top 5 rows

PySpark RLIKE multiple values

In RLIKE , you can very easily specify multiple values to check for in the string. You just have to separate multiple values using a "|" delimiter. In the below example, I am using regex to identify only those records which starts either with Reebok or Nike.

syntax :: filter(col("product_title").rlike("^(Reebok|Nike)(.*)+$"))

df_shoes.filter(col("product_title").rlike("^(Reebok|Nike)(.*)+$")).select("marketplace","product_title","star_rating","total_votes","verified_purchase","review_date").show(5)

+-----------+--------------------+-----------+-----------+-----------------+-----------+
|marketplace|       product_title|star_rating|total_votes|verified_purchase|review_date|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
|         US|Reebok Men's Oakl...|          5|          0|                Y| 2009-05-17|
|         US|Reebok Ziglite Ru...|          5|          1|                Y| 2014-04-09|
|         US|Nike Boy's Sunray...|          5|          0|                N| 2014-04-10|
|         US|Reebok San Franci...|          2|          0|                Y| 2009-05-21|
|         US|Nike Golf Women's...|          5|          1|                Y| 2014-04-10|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
only showing top 5 rows

PySpark RLIKE case insensitive

You can make RLIKE search case insensitive by adding "(?i)" to the search pattern. This will enable case-insensitive mode on. In the example below I have given "nike" in lowercase however I have added ?i to regex search and you can see output has Nike,NIKE both in the output.

syntax :: filter(col("product_title").rlike("(?i)^(nike(.*)+$"))

df_shoes.filter(col("product_title").rlike("(?i)^(nike)(.*)+$")).select("marketplace","product_title","star_rating","total_votes","verified_purchase","review_date").show(5)

+-----------+--------------------+-----------+-----------+-----------------+-----------+
|marketplace|       product_title|star_rating|total_votes|verified_purchase|review_date|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
|         US|NIKE Air Pegasus+...|          5|          2|                N| 2009-05-18|
|         US|Nike Boy's Sunray...|          5|          0|                N| 2014-04-10|
|         US|Nike Golf Women's...|          5|          1|                Y| 2014-04-10|
|         US|Nike Jordan Men's...|          5|          1|                Y| 2014-04-10|
|         US|Nike Air Max 2014...|          5|          0|                Y| 2014-04-10|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
only showing top 5 rows

PySpark NOT RLIKE

Similarly, we can use negation (~) operator in front of RLIKE condition to make it NOT RLIKE. It will return all the rows which does not match the pattern mentioned in the RLIKE condition.

syntax :: filter(~col("product_title").rlike("^(Reebok)(.*)+$"))

df_shoes.filter(~col("product_title").rlike("^(Reebok)(.*)+$")).select("marketplace","product_title","star_rating","total_votes","verified_purchase","review_date").show(5)
+-----------+--------------------+-----------+-----------+-----------------+-----------+
|marketplace|       product_title|star_rating|total_votes|verified_purchase|review_date|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
|         US|Dr. Scholl's Wome...|          1|          1|                Y| 2014-04-09|
|         US|Kenneth Cole REAC...|          4|          1|                Y| 2009-05-17|
|         US|Rockport Cobb Hil...|          4|          0|                Y| 2014-04-09|
|         US|Yak Pak Megu Hand...|          5|          0|                Y| 2009-05-17|
|         US|Crocs Kids' Handl...|          5|          0|                Y| 2014-04-09|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
only showing top 5 rows

PySpark LIKE vs RLIKE

If you refer to above mentioned examples of LIKE & RLIKE, you can clearly see the difference between the two. RLIKE supports regular expressions thereby making string lookups and searches very powerful. However if you don't have good command on regex then you may end up getting wrong results. Also regex is very heavily dependent on input data. So if input data is changed RLIKE may not give accurate data.

LIKE is simple and pretty straight forward. However you cannot have many different search patterns supported by LIKE. LIKE supports more of static value searches.

PySpark filter contains

PySpark contains filter condition is similar to LIKE where you check if the column value contains any give value in it or not. Basically you check if the sub-string exists in the string or not. PySpark "contain" function return true if the string is present in the given value else false.

syntax :: filter(col("product_title").contains('Nike'))

df_shoes.filter(col("product_title").contains('Nike')).select("marketplace","product_title","star_rating","total_votes","verified_purchase","review_date").show(5)

+-----------+--------------------+-----------+-----------+-----------------+-----------+
|marketplace|       product_title|star_rating|total_votes|verified_purchase|review_date|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
|         US|Nike Boy's Sunray...|          5|          0|                N| 2014-04-10|
|         US|Nike Golf Women's...|          5|          1|                Y| 2014-04-10|
|         US|New Nike Rx Presc...|          4|          2|                N| 2014-04-10|
|         US|Nike Jordan Men's...|          5|          1|                Y| 2014-04-10|
|         US|Nike Air Max 2014...|          5|          0|                Y| 2014-04-10|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
only showing top 5 rows

PySpark filter not contains

We can use negation (~) operator in front of contains condition to make it NOT contains. It will return all the rows which does not match the pattern mentioned in the contain condition.

syntax :: filter(~col("product_title").contains('Nike'))

df_shoes.filter(~col("product_title").contains('Nike')).select("marketplace","product_title","star_rating","total_votes","verified_purchase","review_date").show(5)

+-----------+--------------------+-----------+-----------+-----------------+-----------+
|marketplace|       product_title|star_rating|total_votes|verified_purchase|review_date|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
|         US|Reebok Men's Oakl...|          5|          0|                Y| 2009-05-17|
|         US|Dr. Scholl's Wome...|          1|          1|                Y| 2014-04-09|
|         US|Kenneth Cole REAC...|          4|          1|                Y| 2009-05-17|
|         US|Rockport Cobb Hil...|          4|          0|                Y| 2014-04-09|
|         US|Yak Pak Megu Hand...|          5|          0|                Y| 2009-05-17|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
only showing top 5 rows

PySpark Filter between

When we have to check if the value for any given column lies between range of value then we use BETWEEN function. Always give the range starting from MIN to MAX. Both the value which you pass i.e. MIN & MAX value are inclusive. So even if the value is equal to boundary value then also it is considered as pass.

In this example we have done filter on date between the given range by specifying MINIMUM & MAXIMUM value.

syntax :: filter(col("review_date").between('2014-01-01','2014-12-31'))

df_shoes.filter(col("review_date").between('2014-01-01','2014-12-31')).select("marketplace","product_title","star_rating","total_votes","verified_purchase","review_date").show(5)

+-----------+--------------------+-----------+-----------+-----------------+-----------+
|marketplace|       product_title|star_rating|total_votes|verified_purchase|review_date|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
|         US|Dr. Scholl's Wome...|          1|          1|                Y| 2014-04-09|
|         US|Rockport Cobb Hil...|          4|          0|                Y| 2014-04-09|
|         US|Crocs Kids' Handl...|          5|          0|                Y| 2014-04-09|
|         US|Liebeskind Berlin...|          5|          3|                Y| 2014-04-09|
|         US|Clarks Men's Skyw...|          4|          7|                Y| 2014-04-09|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
only showing top 5 rows

PySpark Filter NOT between

We can use negation (~) operator in front of between condition to make it NOT between. It will return all the rows which does not fall in the range mentioned in the between condition.

syntax :: filter(~col("review_date").between('2014-01-01','2014-12-31'))

df_shoes.filter(~col("review_date").between('2014-01-01','2014-12-31')).select("marketplace","product_title","star_rating","total_votes","verified_purchase","review_date").show(5)

+-----------+--------------------+-----------+-----------+-----------------+-----------+
|marketplace|       product_title|star_rating|total_votes|verified_purchase|review_date|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
|         US|Reebok Men's Oakl...|          5|          0|                Y| 2009-05-17|
|         US|Kenneth Cole REAC...|          4|          1|                Y| 2009-05-17|
|         US|Yak Pak Megu Hand...|          5|          0|                Y| 2009-05-17|
|         US|Teva Women's Mush...|          5|          9|                N| 2009-05-17|
|         US|Rachel Shoes Litt...|          5|          0|                N| 2009-05-17|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
only showing top 5 rows

PySpark filter isNull

PySpark isNull is used to identify the rows for which given column has NULL value. It returns true if the value is NULL else False.

syntax :: filter(col("review_date").isNull())

df_shoes.filter(col("review_date").isNull()).select("marketplace","product_title","star_rating","total_votes","verified_purchase","review_date").show(5)

+-----------+-------------+-----------+-----------+-----------------+-----------+
|marketplace|product_title|star_rating|total_votes|verified_purchase|review_date|
+-----------+-------------+-----------+-----------+-----------------+-----------+
+-----------+-------------+-----------+-----------+-----------------+-----------+

PySpark filter isNotNull

PySpark isNotNull is used to identify the rows for which given column has NOT NULL value. It returns true if the value is NOT NULL else False.

syntax :: filter(col("review_date").isNotNull())

df_shoes.filter(col("review_date").isNotNull()).select("marketplace","product_title","star_rating","total_votes","verified_purchase","review_date").show(5)
+-----------+--------------------+-----------+-----------+-----------------+-----------+
|marketplace|       product_title|star_rating|total_votes|verified_purchase|review_date|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
|         US|Reebok Men's Oakl...|          5|          0|                Y| 2009-05-17|
|         US|Dr. Scholl's Wome...|          1|          1|                Y| 2014-04-09|
|         US|Kenneth Cole REAC...|          4|          1|                Y| 2009-05-17|
|         US|Rockport Cobb Hil...|          4|          0|                Y| 2014-04-09|
|         US|Yak Pak Megu Hand...|          5|          0|                Y| 2009-05-17|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
only showing top 5 rows

PySpark Filter multiple conditions

As mentioned earlier , we can merge multiple filter conditions in PySpark using AND or OR operators. Below set of example will show you how you can implement multiple where conditions in PySpark.

PySpark Filter multiple conditions using AND

If we want all the conditions to be true then we have to use AND operator. So when we have multiple filter conditions then we can use "&" operator which denotes AND to merge multiple conditions into single statement.

syntax :: filter(col("marketplace").isin('UK','FR') & col("review_date").isNotNull())

df_shoes.filter(col("marketplace").isin('UK','FR') & col("review_date").isNotNull()).select("marketplace","product_title","star_rating","total_votes","verified_purchase","review_date").show(5)

+-----------+--------------------+-----------+-----------+-----------------+-----------+
|marketplace|       product_title|star_rating|total_votes|verified_purchase|review_date|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
|         FR|Crocs Jayna, Ball...|          4|          8|                Y| 2014-04-11|
|         UK|Crocs Professiona...|          5|          1|                Y| 2014-04-14|
|         UK|Crocs Professiona...|          4|          0|                Y| 2014-04-15|
|         UK|Crocs Unisex Blit...|          5|         24|                N| 2010-01-04|
|         UK|Crocs Unisex Baya...|          5|         34|                N| 2010-02-23|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
only showing top 5 rows

PySpark Filter multiple conditions using OR

If we want any one of the condition to be true then we have to use OR operator. So when we have multiple filter conditions then we can use "|" operator which denotes OR to merge multiple conditions into single statement.

syntax :: filter(col("marketplace").isin('UK','FR') | col("review_date").isNotNull())

df_shoes.filter(col("marketplace").isin('UK','FR') | col("review_date").isNotNull()).select("marketplace","product_title","star_rating","total_votes","verified_purchase","review_date").show(5)

+-----------+--------------------+-----------+-----------+-----------------+-----------+
|marketplace|       product_title|star_rating|total_votes|verified_purchase|review_date|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
|         US|Reebok Men's Oakl...|          5|          0|                Y| 2009-05-17|
|         US|Dr. Scholl's Wome...|          1|          1|                Y| 2014-04-09|
|         US|Kenneth Cole REAC...|          4|          1|                Y| 2009-05-17|
|         US|Rockport Cobb Hil...|          4|          0|                Y| 2014-04-09|
|         US|Yak Pak Megu Hand...|          5|          0|                Y| 2009-05-17|
+-----------+--------------------+-----------+-----------+-----------------+-----------+
only showing top 5 rows

This is all I wanted to cover in this post. It took lot of effort to write this one. Hope it helps. Let me know if you want me to cover anything more in PySpark Filter list and I will be happy to add more examples.

Leave a Reply

Your email address will not be published. Required fields are marked *