As the name suggests, FILTER is used in Spark SQL to filter out records as per the requirement. If you do not want complete data set and just wish to fetch few records which satisfy some condition then you can use FILTER function. It is equivalent to SQL “WHERE” clause and is more commonly used in Spark-SQL.

Let’s fetch all the presidents who were born in New York.
scala> df_pres.filter($"pres_bs" === "New York").select($"pres_name",$"pres_dob".alias("Date Of Birth"),$"pres_bs").show()
+--------------------+-------------+--------+
|           pres_name|Date Of Birth| pres_bs|
+--------------------+-------------+--------+
|    Martin Van Buren|   1782-12-05|New York|
|    Millard Fillmore|   1800-01-07|New York|
|  Theodore Roosevelt|   1858-10-27|New York|
|Franklin D. Roose...|   1882-01-30|New York|
|        Donald Trump|   1946-06-14|New York|
+--------------------+-------------+--------+
You can specify multiple conditions in filter using OR (||) or AND (&&)
scala> df_pres.filter($"pres_bs" === "New York" || $"pres_bs" === "Ohio").select($"pres_name",$"pres_dob".alias("Date Of Birth"),$"pres_bs").show()
+--------------------+-------------+--------+
|           pres_name|Date Of Birth| pres_bs|
+--------------------+-------------+--------+
|    Martin Van Buren|   1782-12-05|New York|
|    Millard Fillmore|   1800-01-07|New York|
|    Ulysses S. Grant|   1822-04-27|    Ohio|
| Rutherford B. Hayes|   1822-10-04|    Ohio|
|   James A. Garfield|   1831-11-19|    Ohio|
|   Benjamin Harrison|   1833-08-20|    Ohio|
|    William McKinley|   1843-01-29|    Ohio|
|  Theodore Roosevelt|   1858-10-27|New York|
| William Howard Taft|   1857-09-15|    Ohio|
|   Warren G. Harding|   1865-11-02|    Ohio|
|Franklin D. Roose...|   1882-01-30|New York|
|        Donald Trump|   1946-06-14|New York|
+--------------------+-------------+--------+
Multiple conditions in Filter using AND (&&)
scala> df_pres.filter($"pres_bs" === "New York" && $"pres_dob"> "1850-01-01").select($"pres_name",$"pres_dob".alias("Date Of Birth"),$"pres_bs").show()
+--------------------+-------------+--------+
|           pres_name|Date Of Birth| pres_bs|
+--------------------+-------------+--------+
|  Theodore Roosevelt|   1858-10-27|New York|
|Franklin D. Roose...|   1882-01-30|New York|
|        Donald Trump|   1946-06-14|New York|
+--------------------+-------------+--------+
If required, you can use ALIAS column names too in FILTER condition.

Let’s check the example below:

scala> df_pres.select($"pres_id",$"pres_dob".alias("DOB"),$"pres_bs").filter($"DOB">"1900-01-01").show()
+-------+----------+-------------+
|pres_id|       DOB|      pres_bs|
+-------+----------+-------------+
|     35|1917-05-29|Massachusetts|
|     36|1908-08-27|        Texas|
|     37|1913-01-09|   California|
|     38|1913-07-14|     Nebraska|
|     39|1924-10-01|      Georgia|
|     40|1911-02-06|     Illinois|
|     41|1924-06-12|Massachusetts|
|     42|1946-08-19|     Arkansas|
|     43|1946-07-06|  Connecticut|
|     44|1961-08-04|       Hawaii|
|     45|1946-06-14|     New York|
+-------+----------+-------------+
You can also use “WHERE” in place of “FILTER”
df_pres.where($"pres_bs" === "New York").select($"pres_name",$"pres_dob".alias("Date Of Birth"),$"pres_bs").show()
+--------------------+-------------+--------+
|           pres_name|Date Of Birth| pres_bs|
+--------------------+-------------+--------+
|    Martin Van Buren|   1782-12-05|New York|
|    Millard Fillmore|   1800-01-07|New York|
|  Theodore Roosevelt|   1858-10-27|New York|
|Franklin D. Roose...|   1882-01-30|New York|
|        Donald Trump|   1946-06-14|New York|
+--------------------+-------------+--------+

Also if you don’t like the manner in which we write where condition then you can also write it as single block inside double quotes.

Basically another way of writing above query.
df_pres.where("pres_bs = 'New York'").select($"pres_name",$"pres_dob".alias("Date Of Birth"),$"pres_bs").show()
+--------------------+-------------+--------+
|           pres_name|Date Of Birth| pres_bs|
+--------------------+-------------+--------+
|    Martin Van Buren|   1782-12-05|New York|
|    Millard Fillmore|   1800-01-07|New York|
|  Theodore Roosevelt|   1858-10-27|New York|
|Franklin D. Roose...|   1882-01-30|New York|
|        Donald Trump|   1946-06-14|New York|
+--------------------+-------------+--------+

You can also specify multiple conditions using this coding practice
df_pres.where("pres_bs = 'New York' and pres_name='Martin Van Buren'").select($"pres_name",$"pres_dob".alias("Date Of Birth"),$"pres_bs").show()
+----------------+-------------+--------+
|       pres_name|Date Of Birth| pres_bs|
+----------------+-------------+--------+
|Martin Van Buren|   1782-12-05|New York|
+----------------+-------------+--------+

If you wish to specify NOT EQUAL TOcondition then you can use below method:

scala> df_pres.filter($"pres_bs" !== "New York").select($"pres_name",$"pres_dob",$"pres_bs").show(45)
+--------------------+----------+--------------------+
|           pres_name|  pres_dob|             pres_bs|
+--------------------+----------+--------------------+
|   George Washington|1732-02-22|            Virginia|
|          John Adams|1735-10-30|       Massachusetts|
|    Thomas Jefferson|1743-04-13|            Virginia|
|       James Madison|1751-03-16|            Virginia|
|        James Monroe|1758-04-28|            Virginia|
|   John Quincy Adams|1767-07-11|       Massachusetts|
|      Andrew Jackson|1767-03-15|South/North Carolina|
|William Henry Har...|1773-02-09|            Virginia|
|          John Tyler|1790-03-29|            Virginia|
|       James K. Polk|1795-11-02|      North Carolina|
|      Zachary Taylor|1784-11-24|            Virginia|
|     Franklin Pierce|1804-11-23|       New Hampshire|
|      James Buchanan|1791-04-23|        Pennsylvania|
|     Abraham Lincoln|1809-02-12|            Kentucky|
|      Andrew Johnson|1808-12-29|      North Carolina|
|    Ulysses S. Grant|1822-04-27|                Ohio|
| Rutherford B. Hayes|1822-10-04|                Ohio|
|   James A. Garfield|1831-11-19|                Ohio|
|   Chester A. Arthur|1829-10-05|             Vermont|
|    Grover Cleveland|1837-03-18|          New Jersey|
+--------------------+----------+--------------------+
only showing top 20 rows

If you want you can use equalTo function too for equal conditions:

scala> df_pres.filter($"pres_bs".equalTo("New York")).select($"pres_name",$"pres_dob",$"pres_bs").show()
+--------------------+----------+--------+
|           pres_name|  pres_dob| pres_bs|
+--------------------+----------+--------+
|    Martin Van Buren|1782-12-05|New York|
|    Millard Fillmore|1800-01-07|New York|
|  Theodore Roosevelt|1858-10-27|New York|
|Franklin D. Roose...|1882-01-30|New York|
|        Donald Trump|1946-06-14|New York|
+--------------------+----------+--------+

If you want to check for NOT EQUAL condition then you can use notEqual function

scala> df_pres.filter($"pres_bs".notEqual("New York")).select($"pres_name",$"pres_dob",$"pres_bs").show()
+--------------------+----------+--------------------+
|           pres_name|  pres_dob|             pres_bs|
+--------------------+----------+--------------------+
|   George Washington|1732-02-22|            Virginia|
|          John Adams|1735-10-30|       Massachusetts|
|    Thomas Jefferson|1743-04-13|            Virginia|
|       James Madison|1751-03-16|            Virginia|
|        James Monroe|1758-04-28|            Virginia|
|   John Quincy Adams|1767-07-11|       Massachusetts|
|      Andrew Jackson|1767-03-15|South/North Carolina|
|William Henry Har...|1773-02-09|            Virginia|
|          John Tyler|1790-03-29|            Virginia|
|       James K. Polk|1795-11-02|      North Carolina|
|      Zachary Taylor|1784-11-24|            Virginia|
|     Franklin Pierce|1804-11-23|       New Hampshire|
|      James Buchanan|1791-04-23|        Pennsylvania|
|     Abraham Lincoln|1809-02-12|            Kentucky|
|      Andrew Johnson|1808-12-29|      North Carolina|
|    Ulysses S. Grant|1822-04-27|                Ohio|
| Rutherford B. Hayes|1822-10-04|                Ohio|
|   James A. Garfield|1831-11-19|                Ohio|
|   Chester A. Arthur|1829-10-05|             Vermont|
|    Grover Cleveland|1837-03-18|          New Jersey|
+--------------------+----------+--------------------+
only showing top 20 rows

From performance perspective, it is highly recommended to use FILTER at the beginning so that subsequent operations handle less volume of data. In the next post, we will see how to specify IN or NOT IN conditions in FILTER.

See how Spark Dataframe FILTER/WHERE works:

Leave a Reply

Your email address will not be published. Required fields are marked *