LIKE condition is used in situation when you don’t know the exact value or you are looking for some specific word pattern in the output. LIKE is similar as in SQL and can be used to specify any pattern in WHERE/FILTER or even in JOIN conditions.
Spark LIKE
Let’s see an example to find out all the president where name starts with James.
scala> df_pres.filter($"pres_name".like("James%")).select($"pres_name",$"pres_dob",$"pres_bs").show() +-----------------+----------+--------------+ | pres_name| pres_dob| pres_bs| +-----------------+----------+--------------+ | James Madison|1751-03-16| Virginia| | James Monroe|1758-04-28| Virginia| | James K. Polk|1795-11-02|North Carolina| | James Buchanan|1791-04-23| Pennsylvania| |James A. Garfield|1831-11-19| Ohio| +-----------------+----------+--------------+
Spark NOT LIKE
If we want to use NOT LIKE then we will use negation in LIKE statement by adding “!” before columnname.
scala> df_pres.filter(!$"pres_name".like("James%")).select($"pres_name",$"pres_dob",$"pres_bs").show() +--------------------+----------+--------------------+ | pres_name| pres_dob| pres_bs| +--------------------+----------+--------------------+ | George Washington|1732-02-22| Virginia| | John Adams|1735-10-30| Massachusetts| | Thomas Jefferson|1743-04-13| Virginia| | John Quincy Adams|1767-07-11| Massachusetts| | Andrew Jackson|1767-03-15|South/North Carolina| | Martin Van Buren|1782-12-05| New York| |William Henry Har...|1773-02-09| Virginia| | John Tyler|1790-03-29| Virginia| | Zachary Taylor|1784-11-24| Virginia| | Millard Fillmore|1800-01-07| New York| | Franklin Pierce|1804-11-23| New Hampshire| | Abraham Lincoln|1809-02-12| Kentucky| | Andrew Johnson|1808-12-29| North Carolina| | Ulysses S. Grant|1822-04-27| Ohio| | Rutherford B. Hayes|1822-10-04| Ohio| | Chester A. Arthur|1829-10-05| Vermont| | Grover Cleveland|1837-03-18| New Jersey| | Benjamin Harrison|1833-08-20| Ohio| | Grover Cleveland|1837-03-18| New Jersey| | William McKinley|1843-01-29| Ohio| +--------------------+----------+--------------------+ only showing top 20 rows
Spark RLIKE
If you want to mention several patterns then in place of LIKE, use RLIKE. RLIKE is regex like and can search for multiple patterns separated by a pipe symbol “|”. Let’s see an example where we want to fetch all president where name starts with either James or John.
scala> df_pres.filter($"pres_name".rlike("^(James|John)(.*)+$")).select($"pres_name",$"pres_dob",$"pres_bs").show() +-----------------+----------+--------------+ | pres_name| pres_dob| pres_bs| +-----------------+----------+--------------+ | John Adams|1735-10-30| Massachusetts| | James Madison|1751-03-16| Virginia| | James Monroe|1758-04-28| Virginia| |John Quincy Adams|1767-07-11| Massachusetts| | John Tyler|1790-03-29| Virginia| | James K. Polk|1795-11-02|North Carolina| | James Buchanan|1791-04-23| Pennsylvania| |James A. Garfield|1831-11-19| Ohio| | John F. Kennedy|1917-05-29| Massachusetts| +-----------------+----------+--------------+
Spark NOT RLIKE
If you want to use NOT LIKE and too with multiple patterns then use negation with RLIKE. Let’s see an example where we want to fetch all the president where name does NOT starts with James & John.
scala> df_pres.filter(!$"pres_name".rlike("^(James|John)(.*)+$")).select($"pres_name",$"pres_dob",$"pres_bs").show() +--------------------+----------+--------------------+ | pres_name| pres_dob| pres_bs| +--------------------+----------+--------------------+ | George Washington|1732-02-22| Virginia| | Thomas Jefferson|1743-04-13| Virginia| | Andrew Jackson|1767-03-15|South/North Carolina| | Martin Van Buren|1782-12-05| New York| |William Henry Har...|1773-02-09| Virginia| | Zachary Taylor|1784-11-24| Virginia| | Millard Fillmore|1800-01-07| New York| | Franklin Pierce|1804-11-23| New Hampshire| | Abraham Lincoln|1809-02-12| Kentucky| | Andrew Johnson|1808-12-29| North Carolina| | Ulysses S. Grant|1822-04-27| Ohio| | Rutherford B. Hayes|1822-10-04| Ohio| | Chester A. Arthur|1829-10-05| Vermont| | Grover Cleveland|1837-03-18| New Jersey| | Benjamin Harrison|1833-08-20| Ohio| | Grover Cleveland|1837-03-18| New Jersey| | William McKinley|1843-01-29| Ohio| | Theodore Roosevelt|1858-10-27| New York| | William Howard Taft|1857-09-15| Ohio| | Woodrow Wilson|1856-12-28| Virginia| +--------------------+----------+--------------------+ only showing top 20 rows
So now you know how to use LIKE, NOT LIKE , RLIKE, NOT RLIKE. In the next post we will see how to use SQL CASE statement equivalent in Spark-SQL.
Hi, I am using spark 2.4 dataset. and the rlike is not working for me. Could you please suggest.
Below is the snippet.
Dataset getFull_Data3 =
getFull_Data1.filter((getFull_Data1.col(!”EmailAddress”).rlike(” some pattern”)));
Hi Samba
Kindly confirm if “not working” means you are not getting proper result or are you getting any error. If yes, please share error description. And for not proper result, please share some sample pattern and values.
val likeResultDSL = newTxnsDF.filter(!col(“category”).rlike(“%Sports%”))
println(“\n============= LIKE OPERATION WITH DSL WAY =============”)
likeResultDSL.show(5)
This is really helpful. Thanks
Happy to help.
In Python:
Use (~(col(“some_field”))) for negation.
Thanks for sharing this. From this we can find out if String exists or not. Can you tell us how to replace string if it exists ?