Spark Dataframe LIKE NOT LIKE RLIKE

LIKE condition is used in situation when you don’t know the exact value or you are looking for some specific pattern in the output. LIKE is similar as in SQL and can be used to specify any pattern in WHERE/FILTER or even in JOIN conditions.
Let’s see an example to find out all the president where name starts with James.

scala> df_pres.filter($"pres_name".like("James%")).select($"pres_name",$"pres_dob",$"pres_bs").show()
+-----------------+----------+--------------+
|        pres_name|  pres_dob|       pres_bs|
+-----------------+----------+--------------+
|    James Madison|1751-03-16|      Virginia|
|     James Monroe|1758-04-28|      Virginia|
|    James K. Polk|1795-11-02|North Carolina|
|   James Buchanan|1791-04-23|  Pennsylvania|
|James A. Garfield|1831-11-19|          Ohio|
+-----------------+----------+--------------+

If we want to use NOT LIKE then we will use negation in LIKE statement by adding “!” before columnname.

scala> df_pres.filter(!$"pres_name".like("James%")).select($"pres_name",$"pres_dob",$"pres_bs").show()
+--------------------+----------+--------------------+
|           pres_name|  pres_dob|             pres_bs|
+--------------------+----------+--------------------+
|   George Washington|1732-02-22|            Virginia|
|          John Adams|1735-10-30|       Massachusetts|
|    Thomas Jefferson|1743-04-13|            Virginia|
|   John Quincy Adams|1767-07-11|       Massachusetts|
|      Andrew Jackson|1767-03-15|South/North Carolina|
|    Martin Van Buren|1782-12-05|            New York|
|William Henry Har...|1773-02-09|            Virginia|
|          John Tyler|1790-03-29|            Virginia|
|      Zachary Taylor|1784-11-24|            Virginia|
|    Millard Fillmore|1800-01-07|            New York|
|     Franklin Pierce|1804-11-23|       New Hampshire|
|     Abraham Lincoln|1809-02-12|            Kentucky|
|      Andrew Johnson|1808-12-29|      North Carolina|
|    Ulysses S. Grant|1822-04-27|                Ohio|
| Rutherford B. Hayes|1822-10-04|                Ohio|
|   Chester A. Arthur|1829-10-05|             Vermont|
|    Grover Cleveland|1837-03-18|          New Jersey|
|   Benjamin Harrison|1833-08-20|                Ohio|
|    Grover Cleveland|1837-03-18|          New Jersey|
|    William McKinley|1843-01-29|                Ohio|
+--------------------+----------+--------------------+
only showing top 20 rows

If you want to mention several patterns then in place of LIKE, use RLIKE. RLIKE is regex like and can search for multiple patterns separated by a pipe symbol “|”. Let’s see an example where we want to fetch all president where name starts with either James or John.

scala> df_pres.filter($"pres_name".rlike("^(James|John)(.*)+$")).select($"pres_name",$"pres_dob",$"pres_bs").show()
+-----------------+----------+--------------+
|        pres_name|  pres_dob|       pres_bs|
+-----------------+----------+--------------+
|       John Adams|1735-10-30| Massachusetts|
|    James Madison|1751-03-16|      Virginia|
|     James Monroe|1758-04-28|      Virginia|
|John Quincy Adams|1767-07-11| Massachusetts|
|       John Tyler|1790-03-29|      Virginia|
|    James K. Polk|1795-11-02|North Carolina|
|   James Buchanan|1791-04-23|  Pennsylvania|
|James A. Garfield|1831-11-19|          Ohio|
|  John F. Kennedy|1917-05-29| Massachusetts|
+-----------------+----------+--------------+

If you want to use NOT LIKE and too with multiple patterns then use negation with RLIKE. Let’s see an example where we want to fetch all the president where name does NOT starts with James & John.

scala> df_pres.filter(!$"pres_name".rlike("^(James|John)(.*)+$")).select($"pres_name",$"pres_dob",$"pres_bs").show()
+--------------------+----------+--------------------+
|           pres_name|  pres_dob|             pres_bs|
+--------------------+----------+--------------------+
|   George Washington|1732-02-22|            Virginia|
|    Thomas Jefferson|1743-04-13|            Virginia|
|      Andrew Jackson|1767-03-15|South/North Carolina|
|    Martin Van Buren|1782-12-05|            New York|
|William Henry Har...|1773-02-09|            Virginia|
|      Zachary Taylor|1784-11-24|            Virginia|
|    Millard Fillmore|1800-01-07|            New York|
|     Franklin Pierce|1804-11-23|       New Hampshire|
|     Abraham Lincoln|1809-02-12|            Kentucky|
|      Andrew Johnson|1808-12-29|      North Carolina|
|    Ulysses S. Grant|1822-04-27|                Ohio|
| Rutherford B. Hayes|1822-10-04|                Ohio|
|   Chester A. Arthur|1829-10-05|             Vermont|
|    Grover Cleveland|1837-03-18|          New Jersey|
|   Benjamin Harrison|1833-08-20|                Ohio|
|    Grover Cleveland|1837-03-18|          New Jersey|
|    William McKinley|1843-01-29|                Ohio|
|  Theodore Roosevelt|1858-10-27|            New York|
| William Howard Taft|1857-09-15|                Ohio|
|      Woodrow Wilson|1856-12-28|            Virginia|
+--------------------+----------+--------------------+
only showing top 20 rows

So now you know how to use LIKE, NOT LIKE , RLIKE, NOT RLIKE. In the next post we will see how to use SQL CASE statement equivalent in Spark-SQL.

5 Replies to “Spark Dataframe LIKE NOT LIKE RLIKE”

  1. Hi, I am using spark 2.4 dataset. and the rlike is not working for me. Could you please suggest.
    Below is the snippet.

    Dataset getFull_Data3 =
    getFull_Data1.filter((getFull_Data1.col(!”EmailAddress”).rlike(” some pattern”)));

    1. Hi Samba
      Kindly confirm if “not working” means you are not getting proper result or are you getting any error. If yes, please share error description. And for not proper result, please share some sample pattern and values.

Leave a Reply

Your email address will not be published. Required fields are marked *