Apache Spark

Spark Dataframe orderBy Sort

SORT is used to order resultset on the basis of values for any selected column. The syntax is to use sort function with column name inside it. We can also specify asending or descending order for sorting, default is ascending. In our dataframe, if we want to order the resultset on the basis of the state in which President was born then we will use below query:

scala> df_pres.select($"pres_id",$"pres_dob",$"pres_bs").sort($"pres_bs".asc).show()
+-------+----------+-------------+
|pres_id|  pres_dob|      pres_bs|
+-------+----------+-------------+
|     42|1946-08-19|     Arkansas|
|     37|1913-01-09|   California|
|     43|1946-07-06|  Connecticut|
|     39|1924-10-01|      Georgia|
|     44|1961-08-04|       Hawaii|
|     40|1911-02-06|     Illinois|
|     31|1874-08-10|         Iowa|
|     16|1809-02-12|     Kentucky|
|     41|1924-06-12|Massachusetts|
|      6|1767-07-11|Massachusetts|
|      2|1735-10-30|Massachusetts|
|     35|1917-05-29|Massachusetts|
|     33|1884-05-08|     Missouri|
|     38|1913-07-14|     Nebraska|
|     14|1804-11-23|New Hampshire|
|     22|1837-03-18|   New Jersey|
|     24|1837-03-18|   New Jersey|
|     32|1882-01-30|     New York|
|     26|1858-10-27|     New York|
|     13|1800-01-07|     New York|
+-------+----------+-------------+
only showing top 20 rows

If we want to SORT in descending order then we will use below query:

scala> df_pres.select($"pres_id",$"pres_dob",$"pres_bs").sort($"pres_bs".desc).show()
+-------+----------+--------------------+
|pres_id|  pres_dob|             pres_bs|
+-------+----------+--------------------+
|      1|1732-02-22|            Virginia|
|     10|1790-03-29|            Virginia|
|      4|1751-03-16|            Virginia|
|     12|1784-11-24|            Virginia|
|     28|1856-12-28|            Virginia|
|      3|1743-04-13|            Virginia|
|      5|1758-04-28|            Virginia|
|      9|1773-02-09|            Virginia|
|     21|1829-10-05|             Vermont|
|     30|1872-07-04|             Vermont|
|     36|1908-08-27|               Texas|
|     34|1890-10-14|               Texas|
|      7|1767-03-15|South/North Carolina|
|     15|1791-04-23|        Pennsylvania|
|     18|1822-04-27|                Ohio|
|     19|1822-10-04|                Ohio|
|     20|1831-11-19|                Ohio|
|     27|1857-09-15|                Ohio|
|     23|1833-08-20|                Ohio|
|     29|1865-11-02|                Ohio|
+-------+----------+--------------------+
only showing top 20 rows

If you want to specify SORTing on the basis of multiple columns then use below query:

scala> df_pres.select($"pres_id",$"pres_dob",$"pres_bs").sort($"pres_bs".desc,$"pres_dob".asc).show()
+-------+----------+--------------------+
|pres_id|  pres_dob|             pres_bs|
+-------+----------+--------------------+
|      1|1732-02-22|            Virginia|
|      3|1743-04-13|            Virginia|
|      4|1751-03-16|            Virginia|
|      5|1758-04-28|            Virginia|
|      9|1773-02-09|            Virginia|
|     12|1784-11-24|            Virginia|
|     10|1790-03-29|            Virginia|
|     28|1856-12-28|            Virginia|
|     21|1829-10-05|             Vermont|
|     30|1872-07-04|             Vermont|
|     34|1890-10-14|               Texas|
|     36|1908-08-27|               Texas|
|      7|1767-03-15|South/North Carolina|
|     15|1791-04-23|        Pennsylvania|
|     18|1822-04-27|                Ohio|
|     19|1822-10-04|                Ohio|
|     20|1831-11-19|                Ohio|
|     23|1833-08-20|                Ohio|
|     25|1843-01-29|                Ohio|
|     27|1857-09-15|                Ohio|
+-------+----------+--------------------+
only showing top 20 rows

You can also sort the result set on the basis of derived columns. For this example we will refer to previous post and will apply sort to the derived column.

scala>df_pres.select($"pres_name",$"pres_dob",$"pres_bs",when($"pres_bs"==="Virginia","VA").when($"pres_bs"==="Massachusetts","MA").when($"pres_bs"==="Ohio","OH").otherwise("Others").alias("state_abbr")).sort($"state_abbr".desc).show()
+--------------------+----------+--------------------+----------+
|           pres_name|  pres_dob|             pres_bs|state_abbr|
+--------------------+----------+--------------------+----------+
|   George Washington|1732-02-22|            Virginia|        VA|
|          John Tyler|1790-03-29|            Virginia|        VA|
|       James Madison|1751-03-16|            Virginia|        VA|
|      Zachary Taylor|1784-11-24|            Virginia|        VA|
|      Woodrow Wilson|1856-12-28|            Virginia|        VA|
|    Thomas Jefferson|1743-04-13|            Virginia|        VA|
|        James Monroe|1758-04-28|            Virginia|        VA|
|William Henry Har...|1773-02-09|            Virginia|        VA|
|  Theodore Roosevelt|1858-10-27|            New York|    Others|
|    Grover Cleveland|1837-03-18|          New Jersey|    Others|
|     Calvin Coolidge|1872-07-04|             Vermont|    Others|
|Franklin D. Roose...|1882-01-30|            New York|    Others|
|    Martin Van Buren|1782-12-05|            New York|    Others|
|      Andrew Jackson|1767-03-15|South/North Carolina|    Others|
|      Herbert Hoover|1874-08-10|                Iowa|    Others|
|     Abraham Lincoln|1809-02-12|            Kentucky|    Others|
|     Harry S. Truman|1884-05-08|            Missouri|    Others|
|Dwight D. Eisenhower|1890-10-14|               Texas|    Others|
|     Franklin Pierce|1804-11-23|       New Hampshire|    Others|
|   Chester A. Arthur|1829-10-05|             Vermont|    Others|
+--------------------+----------+--------------------+----------+
only showing top 20 rows

You can also apply sorting to the expression in SORT function. Let’s see the example below:

scala>df_pres.select($"pres_name",$"pres_dob",$"pres_bs").sort(when($"pres_bs"==="Virginia","VA").when($"pres_bs"==="Massachusetts","MA").when($"pres_bs"==="Ohio","OH").otherwise("Others").desc).show()
+--------------------+----------+--------------------+
|           pres_name|  pres_dob|             pres_bs|
+--------------------+----------+--------------------+
|   George Washington|1732-02-22|            Virginia|
|          John Tyler|1790-03-29|            Virginia|
|       James Madison|1751-03-16|            Virginia|
|      Zachary Taylor|1784-11-24|            Virginia|
|      Woodrow Wilson|1856-12-28|            Virginia|
|    Thomas Jefferson|1743-04-13|            Virginia|
|        James Monroe|1758-04-28|            Virginia|
|William Henry Har...|1773-02-09|            Virginia|
|  Theodore Roosevelt|1858-10-27|            New York|
|    Grover Cleveland|1837-03-18|          New Jersey|
|     Calvin Coolidge|1872-07-04|             Vermont|
|Franklin D. Roose...|1882-01-30|            New York|
|    Martin Van Buren|1782-12-05|            New York|
|      Andrew Jackson|1767-03-15|South/North Carolina|
|      Herbert Hoover|1874-08-10|                Iowa|
|     Abraham Lincoln|1809-02-12|            Kentucky|
|     Harry S. Truman|1884-05-08|            Missouri|
|Dwight D. Eisenhower|1890-10-14|               Texas|
|     Franklin Pierce|1804-11-23|       New Hampshire|
|   Chester A. Arthur|1829-10-05|             Vermont|
+--------------------+----------+--------------------+
only showing top 20 rows

In the above example, we have created an expression using WHEN and then applied sorting to it. In the resultset you can see that all states starting with “V” are not at top and the sorting is done on the basis of expression only.
In place of “sort” you can also use “orderBy” function too.

scala> df_pres.select($"pres_id",$"pres_dob",$"pres_bs").orderBy($"pres_bs".asc).show()
+-------+----------+-------------+
|pres_id|  pres_dob|      pres_bs|
+-------+----------+-------------+
|     42|1946-08-19|     Arkansas|
|     37|1913-01-09|   California|
|     43|1946-07-06|  Connecticut|
|     39|1924-10-01|      Georgia|
|     44|1961-08-04|       Hawaii|
|     40|1911-02-06|     Illinois|
|     31|1874-08-10|         Iowa|
|     16|1809-02-12|     Kentucky|
|     41|1924-06-12|Massachusetts|
|      6|1767-07-11|Massachusetts|
|      2|1735-10-30|Massachusetts|
|     35|1917-05-29|Massachusetts|
|     33|1884-05-08|     Missouri|
|     38|1913-07-14|     Nebraska|
|     14|1804-11-23|New Hampshire|
|     22|1837-03-18|   New Jersey|
|     24|1837-03-18|   New Jersey|
|     32|1882-01-30|     New York|
|     26|1858-10-27|     New York|
|     13|1800-01-07|     New York|
+-------+----------+-------------+
only showing top 20 rows

Leave a Reply