Spark Dataframe add multiple columns with value

You may need to add new columns in the existing SPARK dataframe as per the requirement. This new column can be initialized with a default value or you can assign some dynamic value to it depending on some logical conditions. Let’s see an example below to add 2 new columns with logical value and 1 column with default value.

scala> df_pres.select($"pres_name",$"pres_dob",$"pres_bs").show(false)
+----------------------+----------+--------------+
|pres_name             |pres_dob  |pres_bs       |
+----------------------+----------+--------------+
|George Washington     |1732-02-22|Virginia      |
|John Adams            |1735-10-30|Massachusetts |
|Thomas Jefferson      |1743-04-13|Virginia      |
|James Madison         |1751-03-16|Virginia      |
|James Monroe          |1758-04-28|Virginia      |
|John Quincy Adams     |1767-07-11|Massachusetts |
|Andrew Jackson        |1767-03-15|North Carolina|
|Martin Van Buren      |1782-12-05|New York      |
|William Henry Harrison|1773-02-09|Virginia      |
|John Tyler            |1790-03-29|Virginia      |
|James K. Polk         |1795-11-02|North Carolina|
|Zachary Taylor        |1784-11-24|Virginia      |
|Millard Fillmore      |1800-01-07|New York      |
|Franklin Pierce       |1804-11-23|New Hampshire |
|James Buchanan        |1791-04-23|Pennsylvania  |
|Abraham Lincoln       |1809-02-12|Kentucky      |
|Andrew Johnson        |1808-12-29|North Carolina|
|Ulysses S. Grant      |1822-04-27|Ohio          |
|Rutherford B. Hayes   |1822-10-04|Ohio          |
|James A. Garfield     |1831-11-19|Ohio          |
+----------------------+----------+--------------+
only showing top 20 rows

Let’s add 2 new columns to it. One for State Abbreviation and other for Century to which President was born. Also we will add 1 new column with default value using “lit” function.

scala> df_pres.select($"pres_name",$"pres_dob",$"pres_bs",when($"pres_bs"==="Virginia","VA").when($"pres_bs"==="Massachusetts","MA").when($"pres_bs"==="Ohio","OH").otherwise("Others").alias("state_abbr"), when($"pres_dob".between("1701-01-01","1800-12-31"),"18th Century").when($"pres_dob".between("1801-01-01","1900-12-31"),"19th Century").when($"pres_dob".between("1901-01-01","2000-12-31"),"20th Century").alias("Century"),lit("-1").alias("C3")).show(50)
+--------------------+----------+--------------+----------+------------+---+
|           pres_name|  pres_dob|       pres_bs|state_abbr|     Century| C3|
+--------------------+----------+--------------+----------+------------+---+
|   George Washington|1732-02-22|      Virginia|        VA|18th Century| -1|
|          John Adams|1735-10-30| Massachusetts|        MA|18th Century| -1|
|    Thomas Jefferson|1743-04-13|      Virginia|        VA|18th Century| -1|
|       James Madison|1751-03-16|      Virginia|        VA|18th Century| -1|
|        James Monroe|1758-04-28|      Virginia|        VA|18th Century| -1|
|   John Quincy Adams|1767-07-11| Massachusetts|        MA|18th Century| -1|
|      Andrew Jackson|1767-03-15|North Carolina|    Others|18th Century| -1|
|    Martin Van Buren|1782-12-05|      New York|    Others|18th Century| -1|
|William Henry Har...|1773-02-09|      Virginia|        VA|18th Century| -1|
|          John Tyler|1790-03-29|      Virginia|        VA|18th Century| -1|
|       James K. Polk|1795-11-02|North Carolina|    Others|18th Century| -1|
|      Zachary Taylor|1784-11-24|      Virginia|        VA|18th Century| -1|
|    Millard Fillmore|1800-01-07|      New York|    Others|18th Century| -1|
|     Franklin Pierce|1804-11-23| New Hampshire|    Others|19th Century| -1|
|      James Buchanan|1791-04-23|  Pennsylvania|    Others|18th Century| -1|
|     Abraham Lincoln|1809-02-12|      Kentucky|    Others|19th Century| -1|
|      Andrew Johnson|1808-12-29|North Carolina|    Others|19th Century| -1|
|    Ulysses S. Grant|1822-04-27|          Ohio|        OH|19th Century| -1|
| Rutherford B. Hayes|1822-10-04|          Ohio|        OH|19th Century| -1|
|   James A. Garfield|1831-11-19|          Ohio|        OH|19th Century| -1|
|   Chester A. Arthur|1829-10-05|       Vermont|    Others|19th Century| -1|
|    Grover Cleveland|1837-03-18|    New Jersey|    Others|19th Century| -1|
|   Benjamin Harrison|1833-08-20|          Ohio|        OH|19th Century| -1|
|    Grover Cleveland|1837-03-18|    New Jersey|    Others|19th Century| -1|
|    William McKinley|1843-01-29|          Ohio|        OH|19th Century| -1|
|  Theodore Roosevelt|1858-10-27|      New York|    Others|19th Century| -1|
| William Howard Taft|1857-09-15|          Ohio|        OH|19th Century| -1|
|      Woodrow Wilson|1856-12-28|      Virginia|        VA|19th Century| -1|
|   Warren G. Harding|1865-11-02|          Ohio|        OH|19th Century| -1|
|     Calvin Coolidge|1872-07-04|       Vermont|    Others|19th Century| -1|
|      Herbert Hoover|1874-08-10|          Iowa|    Others|19th Century| -1|
|Franklin D. Roose...|1882-01-30|      New York|    Others|19th Century| -1|
|     Harry S. Truman|1884-05-08|      Missouri|    Others|19th Century| -1|
|Dwight D. Eisenhower|1890-10-14|         Texas|    Others|19th Century| -1|
|     John F. Kennedy|1917-05-29| Massachusetts|        MA|20th Century| -1|
|   Lyndon B. Johnson|1908-08-27|         Texas|    Others|20th Century| -1|
|    Richard M. Nixon|1913-01-09|    California|    Others|20th Century| -1|
|      Gerald R. Ford|1913-07-14|      Nebraska|    Others|20th Century| -1|
|        Jimmy Carter|1924-10-01|       Georgia|    Others|20th Century| -1|
|       Ronald Reagan|1911-02-06|      Illinois|    Others|20th Century| -1|
|   George H. W. Bush|1924-06-12| Massachusetts|        MA|20th Century| -1|
|        Bill Clinton|1946-08-19|      Arkansas|    Others|20th Century| -1|
|      George W. Bush|1946-07-06|   Connecticut|    Others|20th Century| -1|
|        Barack Obama|1961-08-04|        Hawaii|    Others|20th Century| -1|
|        Donald Trump|1946-06-14|      New York|    Others|20th Century| -1|
+--------------------+----------+--------------+----------+------------+---+

We can also use withColumn method to add new columns in spark dataframe.

scala> df_pres.select($"pres_name",$"pres_dob",$"pres_bs").withColumn("state_abbr",when($"pres_bs"==="Virginia","VA").when($"pres_bs"==="Massachusetts","MA").when($"pres_bs"==="Ohio","OH").otherwise("Others")).show(false)
+----------------------+----------+--------------+----------+
|pres_name             |pres_dob  |pres_bs       |state_abbr|
+----------------------+----------+--------------+----------+
|George Washington     |1732-02-22|Virginia      |VA        |
|John Adams            |1735-10-30|Massachusetts |MA        |
|Thomas Jefferson      |1743-04-13|Virginia      |VA        |
|James Madison         |1751-03-16|Virginia      |VA        |
|James Monroe          |1758-04-28|Virginia      |VA        |
|John Quincy Adams     |1767-07-11|Massachusetts |MA        |
|Andrew Jackson        |1767-03-15|North Carolina|Others    |
|Martin Van Buren      |1782-12-05|New York      |Others    |
|William Henry Harrison|1773-02-09|Virginia      |VA        |
|John Tyler            |1790-03-29|Virginia      |VA        |
|James K. Polk         |1795-11-02|North Carolina|Others    |
|Zachary Taylor        |1784-11-24|Virginia      |VA        |
|Millard Fillmore      |1800-01-07|New York      |Others    |
|Franklin Pierce       |1804-11-23|New Hampshire |Others    |
|James Buchanan        |1791-04-23|Pennsylvania  |Others    |
|Abraham Lincoln       |1809-02-12|Kentucky      |Others    |
|Andrew Johnson        |1808-12-29|North Carolina|Others    |
|Ulysses S. Grant      |1822-04-27|Ohio          |OH        |
|Rutherford B. Hayes   |1822-10-04|Ohio          |OH        |
|James A. Garfield     |1831-11-19|Ohio          |OH        |
+----------------------+----------+--------------+----------+
only showing top 20 rows

In this post , we saw how to add new columns to spark dataframe. This new column can be with default value or some other values.

Leave a Reply