Spark Dataframe – Distinct or Drop Duplicates

DISTINCT or dropDuplicates is used to remove duplicate rows in the Dataframe. Row consists of columns, if you are selecting only one column then output will be unique values for that specific column. DISTINCT is very commonly used to seek possible values which exists in the dataframe for any given column.
Example:

scala> df_pres.select($”pres_bs”).show(45)
+——————–+
| pres_bs|
+——————–+
| Virginia|
| Massachusetts|
| Virginia|
| Virginia|
| Virginia|
| Massachusetts|
|South/North Carolina|
| New York|
| Virginia|
| Virginia|
| North Carolina|
| Virginia|
| New York|
| New Hampshire|
| Pennsylvania|
| Kentucky|
| North Carolina|
| Ohio|
| Ohio|
| Ohio|
| Vermont|
| New Jersey|
| Ohio|
| New Jersey|
| Ohio|
| New York|
| Ohio|
| Virginia|
| Ohio|
| Vermont|
| Iowa|
| New York|
| Missouri|
| Texas|
| Massachusetts|
| Texas|
| California|
| Nebraska|
| Georgia|
| Illinois|
| Massachusetts|
| Arkansas|
| Connecticut|
| Hawaii|
| New York|
+——————–+

You can see it has many duplicate values. If you just want unique values then use distinct function.

scala> df_pres.select($”pres_bs”).distinct().show(45)
+——————–+
| pres_bs|
+——————–+
| Virginia|
| Massachusetts|
|South/North Carolina|
| New York|
| North Carolina|
| New Hampshire|
| Pennsylvania|
| Kentucky|
| Ohio|
| Vermont|
| New Jersey|
| Iowa|
| Missouri|
| Texas|
| California|
| Nebraska|
| Georgia|
| Illinois|
| Arkansas|
| Connecticut|
| Hawaii|
+——————–+

You can also use dropDuplicates to get unique values.

scala> df_pres.select($”pres_bs”).dropDuplicates().show(45)
+——————–+
| pres_bs|
+——————–+
| Virginia|
| Massachusetts|
|South/North Carolina|
| New York|
| North Carolina|
| New Hampshire|
| Pennsylvania|
| Kentucky|
| Ohio|
| Vermont|
| New Jersey|
| Iowa|
| Missouri|
| Texas|
| California|
| Nebraska|
| Georgia|
| Illinois|
| Arkansas|
| Connecticut|
| Hawaii|
+——————–+

Leave a Reply

Your email address will not be published. Required fields are marked *