Apache Spark

Spark Dataframe Filter

As the name suggests, spark dataframe FILTER is used in Spark SQL to filter out records as per the requirement. If you do not want complete data set and just wish to fetch few records which satisfy some condition then you can use FILTER function. It is equivalent to SQL “WHERE” clause and is more commonly used in Spark-SQL.

Let’s fetch all the presidents who were born in New York. So we will filter spark dataframe by column value. In this example column is “birth state” and value is “New York”

Spark Filter Condition on one column

scala > df_pres.filter($"pres_bs" === "New York").SELECT
($"pres_name", $"pres_dob".alias("Date Of Birth"), $"pres_bs").show()

+--------------------+-------------+--------+
|           pres_name|Date Of Birth| pres_bs|
+--------------------+-------------+--------+
|    Martin Van Buren|   1782-12-05|New York|
|    Millard Fillmore|   1800-01-07|New York|
|  Theodore Roosevelt|   1858-10-27|New York|
|Franklin D. Roose...|   1882-01-30|New York|
|        Donald Trump|   1946-06-14|New York|
+--------------------+-------------+--------+

Spark multiple conditions in filter using OR (||) or AND (&&)

scala > df_pres
  .filter($"pres_bs" === "New York" || $"pres_bs" === "Ohio")
  .SELECT
($"pres_name", $"pres_dob".alias("Date Of Birth"), $"pres_bs").show()

+--------------------+-------------+--------+
|           pres_name|Date Of Birth| pres_bs|
+--------------------+-------------+--------+
|    Martin Van Buren|   1782-12-05|New York|
|    Millard Fillmore|   1800-01-07|New York|
|    Ulysses S. Grant|   1822-04-27|    Ohio|
| Rutherford B. Hayes|   1822-10-04|    Ohio|
|   James A. Garfield|   1831-11-19|    Ohio|
|   Benjamin Harrison|   1833-08-20|    Ohio|
|    William McKinley|   1843-01-29|    Ohio|
|  Theodore Roosevelt|   1858-10-27|New York|
| William Howard Taft|   1857-09-15|    Ohio|
|   Warren G. Harding|   1865-11-02|    Ohio|
|Franklin D. Roose...|   1882-01-30|New York|
|        Donald Trump|   1946-06-14|New York|
+--------------------+-------------+--------+

Spark Dataframe Multiple conditions in Filter using AND (&&)

scala > df_pres
  .filter($"pres_bs" === "New York" && $"pres_dob" > "1850-01-01")
  .SELECT
($"pres_name", $"pres_dob".alias("Date Of Birth"), $"pres_bs").show()

+--------------------+-------------+--------+
|           pres_name|Date Of Birth| pres_bs|
+--------------------+-------------+--------+
|  Theodore Roosevelt|   1858-10-27|New York|
|Franklin D. Roose...|   1882-01-30|New York|
|        Donald Trump|   1946-06-14|New York|
+--------------------+-------------+--------+

If required, you can use ALIAS column names too in FILTER condition.

Let’s check the example below:

scala> df_pres.select($"pres_id",
$"pres_dob".alias("DOB"),
$"pres_bs").FILTER($"DOB">"1900-01-01").show()

+-------+----------+-------------+
|pres_id|       DOB|      pres_bs|
+-------+----------+-------------+
|     35|1917-05-29|Massachusetts|
|     36|1908-08-27|        Texas|
|     37|1913-01-09|   California|
|     38|1913-07-14|     Nebraska|
|     39|1924-10-01|      Georgia|
|     40|1911-02-06|     Illinois|
|     41|1924-06-12|Massachusetts|
|     42|1946-08-19|     Arkansas|
|     43|1946-07-06|  Connecticut|
|     44|1961-08-04|       Hawaii|
|     45|1946-06-14|     New York|
+-------+----------+-------------+

You can also use “WHERE” in place of “FILTER”

scala> df_pres.where($"pres_bs" === "New York")
.SELECT
	($"pres_name",
	$"pres_dob".alias("Date Of Birth"),
	$"pres_bs").show()

+--------------------+-------------+--------+
|           pres_name|Date Of Birth| pres_bs|
+--------------------+-------------+--------+
|    Martin Van Buren|   1782-12-05|New York|
|    Millard Fillmore|   1800-01-07|New York|
|  Theodore Roosevelt|   1858-10-27|New York|
|Franklin D. Roose...|   1882-01-30|New York|
|        Donald Trump|   1946-06-14|New York|
+--------------------+-------------+--------+

Also if you don’t like the manner in which we write where condition then you can also write it as single block inside double quotes.

Basically another way of writing above query.

scala> df_pres.where("pres_bs = 'New York'")
.SELECT
	($"pres_name",
	$"pres_dob".alias("Date Of Birth"),
	$"pres_bs").show()
    
+--------------------+-------------+--------+
|           pres_name|Date Of Birth| pres_bs|
+--------------------+-------------+--------+
|    Martin Van Buren|   1782-12-05|New York|
|    Millard Fillmore|   1800-01-07|New York|
|  Theodore Roosevelt|   1858-10-27|New York|
|Franklin D. Roose...|   1882-01-30|New York|
|        Donald Trump|   1946-06-14|New York|
+--------------------+-------------+--------+

You can also specify multiple conditions in WHERE using this coding practice

scala> df_pres.where("pres_bs = 'New York' and pres_name='Martin Van Buren'")
.SELECT
	($"pres_name",
	$"pres_dob".alias("Date Of Birth"),
	$"pres_bs").show()
    
+----------------+-------------+--------+
|       pres_name|Date Of Birth| pres_bs|
+----------------+-------------+--------+
|Martin Van Buren|   1782-12-05|New York|
+----------------+-------------+--------+

If you wish to specify NOT EQUAL TO condition then you can use below method

scala> df_pres.filter($"pres_bs" !== "New York")
.SELECT
	($"pres_name",
	$"pres_dob",
	$"pres_bs").show()
    
+--------------------+----------+--------------------+
|           pres_name|  pres_dob|             pres_bs|
+--------------------+----------+--------------------+
|   George Washington|1732-02-22|            Virginia|
|          John Adams|1735-10-30|       Massachusetts|
|    Thomas Jefferson|1743-04-13|            Virginia|
|       James Madison|1751-03-16|            Virginia|
|        James Monroe|1758-04-28|            Virginia|
|   John Quincy Adams|1767-07-11|       Massachusetts|
|      Andrew Jackson|1767-03-15|South/North Carolina|
|William Henry Har...|1773-02-09|            Virginia|
|          John Tyler|1790-03-29|            Virginia|
|       James K. Polk|1795-11-02|      North Carolina|
|      Zachary Taylor|1784-11-24|            Virginia|
|     Franklin Pierce|1804-11-23|       New Hampshire|
|      James Buchanan|1791-04-23|        Pennsylvania|
|     Abraham Lincoln|1809-02-12|            Kentucky|
|      Andrew Johnson|1808-12-29|      North Carolina|
|    Ulysses S. Grant|1822-04-27|                Ohio|
| Rutherford B. Hayes|1822-10-04|                Ohio|
|   James A. Garfield|1831-11-19|                Ohio|
|   Chester A. Arthur|1829-10-05|             Vermont|
|    Grover Cleveland|1837-03-18|          New Jersey|
+--------------------+----------+--------------------+
only showing top 20 rows

If you want you can use equalTo function too for equal conditions

scala> df_pres.filter($"pres_bs".equalTo("New York"))
.SELECT
	($"pres_name",
	$"pres_dob",
	$"pres_bs").show()
                                         
+--------------------+----------+--------+
|           pres_name|  pres_dob| pres_bs|
+--------------------+----------+--------+
|    Martin Van Buren|1782-12-05|New York|
|    Millard Fillmore|1800-01-07|New York|
|  Theodore Roosevelt|1858-10-27|New York|
|Franklin D. Roose...|1882-01-30|New York|
|        Donald Trump|1946-06-14|New York|
+--------------------+----------+--------+

If you want to check for NOT EQUAL condition then you can use notEqual function

scala> df_pres.filter($"pres_bs".notEqual("New York"))
.SELECT
	($"pres_name",
	$"pres_dob",
	$"pres_bs").show()
                                          
+--------------------+----------+--------------------+
|           pres_name|  pres_dob|             pres_bs|
+--------------------+----------+--------------------+
|   George Washington|1732-02-22|            Virginia|
|          John Adams|1735-10-30|       Massachusetts|
|    Thomas Jefferson|1743-04-13|            Virginia|
|       James Madison|1751-03-16|            Virginia|
|        James Monroe|1758-04-28|            Virginia|
|   John Quincy Adams|1767-07-11|       Massachusetts|
|      Andrew Jackson|1767-03-15|South/North Carolina|
|William Henry Har...|1773-02-09|            Virginia|
|          John Tyler|1790-03-29|            Virginia|
|       James K. Polk|1795-11-02|      North Carolina|
|      Zachary Taylor|1784-11-24|            Virginia|
|     Franklin Pierce|1804-11-23|       New Hampshire|
|      James Buchanan|1791-04-23|        Pennsylvania|
|     Abraham Lincoln|1809-02-12|            Kentucky|
|      Andrew Johnson|1808-12-29|      North Carolina|
|    Ulysses S. Grant|1822-04-27|                Ohio|
| Rutherford B. Hayes|1822-10-04|                Ohio|
|   James A. Garfield|1831-11-19|                Ohio|
|   Chester A. Arthur|1829-10-05|             Vermont|
|    Grover Cleveland|1837-03-18|          New Jersey|
+--------------------+----------+--------------------+
only showing top 20 rows

From performance perspective, it is highly recommended to use FILTER at the beginning so that subsequent operations handle less volume of data. In the next post, we will see how to specify IN or NOT IN conditions in FILTER.

See how Spark Dataframe FILTER/WHERE works:

Looking for PySpark Filter examples ? check this post.

Leave a Reply