Apache Spark

Spark Dataframe add multiple columns with value

You may need to add new columns in the existing SPARK dataframe as per the requirement. This new column can be initialized with a default value or you can assign some dynamic value to it depending on some logical conditions. Let's see an example below to add 2 new columns with logical value and 1 column with default value.

scala> df_pres.select($"pres_name",$"pres_dob",$"pres_bs").show(false)
|pres_name             |pres_dob  |pres_bs       |
|George Washington     |1732-02-22|Virginia      |
|John Adams            |1735-10-30|Massachusetts |
|Thomas Jefferson      |1743-04-13|Virginia      |
|James Madison         |1751-03-16|Virginia      |
|James Monroe          |1758-04-28|Virginia      |
|John Quincy Adams     |1767-07-11|Massachusetts |
|Andrew Jackson        |1767-03-15|North Carolina|
|Martin Van Buren      |1782-12-05|New York      |
|William Henry Harrison|1773-02-09|Virginia      |
|John Tyler            |1790-03-29|Virginia      |
|James K. Polk         |1795-11-02|North Carolina|
|Zachary Taylor        |1784-11-24|Virginia      |
|Millard Fillmore      |1800-01-07|New York      |
|Franklin Pierce       |1804-11-23|New Hampshire |
|James Buchanan        |1791-04-23|Pennsylvania  |
|Abraham Lincoln       |1809-02-12|Kentucky      |
|Andrew Johnson        |1808-12-29|North Carolina|
|Ulysses S. Grant      |1822-04-27|Ohio          |
|Rutherford B. Hayes   |1822-10-04|Ohio          |
|James A. Garfield     |1831-11-19|Ohio          |
only showing top 20 rows

Let's add 2 new columns to it. One for State Abbreviation and other for Century to which President was born. Also we will add 1 new column with default value using "lit" function.

scala> df_pres.select($"pres_name",$"pres_dob",$"pres_bs",when($"pres_bs"==="Virginia","VA").when($"pres_bs"==="Massachusetts","MA").when($"pres_bs"==="Ohio","OH").otherwise("Others").alias("state_abbr"), when($"pres_dob".between("1701-01-01","1800-12-31"),"18th Century").when($"pres_dob".between("1801-01-01","1900-12-31"),"19th Century").when($"pres_dob".between("1901-01-01","2000-12-31"),"20th Century").alias("Century"),lit("-1").alias("C3")).show(50)
|           pres_name|  pres_dob|       pres_bs|state_abbr|     Century| C3|
|   George Washington|1732-02-22|      Virginia|        VA|18th Century| -1|
|          John Adams|1735-10-30| Massachusetts|        MA|18th Century| -1|
|    Thomas Jefferson|1743-04-13|      Virginia|        VA|18th Century| -1|
|       James Madison|1751-03-16|      Virginia|        VA|18th Century| -1|
|        James Monroe|1758-04-28|      Virginia|        VA|18th Century| -1|
|   John Quincy Adams|1767-07-11| Massachusetts|        MA|18th Century| -1|
|      Andrew Jackson|1767-03-15|North Carolina|    Others|18th Century| -1|
|    Martin Van Buren|1782-12-05|      New York|    Others|18th Century| -1|
|William Henry Har...|1773-02-09|      Virginia|        VA|18th Century| -1|
|          John Tyler|1790-03-29|      Virginia|        VA|18th Century| -1|
|       James K. Polk|1795-11-02|North Carolina|    Others|18th Century| -1|
|      Zachary Taylor|1784-11-24|      Virginia|        VA|18th Century| -1|
|    Millard Fillmore|1800-01-07|      New York|    Others|18th Century| -1|
|     Franklin Pierce|1804-11-23| New Hampshire|    Others|19th Century| -1|
|      James Buchanan|1791-04-23|  Pennsylvania|    Others|18th Century| -1|
|     Abraham Lincoln|1809-02-12|      Kentucky|    Others|19th Century| -1|
|      Andrew Johnson|1808-12-29|North Carolina|    Others|19th Century| -1|
|    Ulysses S. Grant|1822-04-27|          Ohio|        OH|19th Century| -1|
| Rutherford B. Hayes|1822-10-04|          Ohio|        OH|19th Century| -1|
|   James A. Garfield|1831-11-19|          Ohio|        OH|19th Century| -1|
|   Chester A. Arthur|1829-10-05|       Vermont|    Others|19th Century| -1|
|    Grover Cleveland|1837-03-18|    New Jersey|    Others|19th Century| -1|
|   Benjamin Harrison|1833-08-20|          Ohio|        OH|19th Century| -1|
|    Grover Cleveland|1837-03-18|    New Jersey|    Others|19th Century| -1|
|    William McKinley|1843-01-29|          Ohio|        OH|19th Century| -1|
|  Theodore Roosevelt|1858-10-27|      New York|    Others|19th Century| -1|
| William Howard Taft|1857-09-15|          Ohio|        OH|19th Century| -1|
|      Woodrow Wilson|1856-12-28|      Virginia|        VA|19th Century| -1|
|   Warren G. Harding|1865-11-02|          Ohio|        OH|19th Century| -1|
|     Calvin Coolidge|1872-07-04|       Vermont|    Others|19th Century| -1|
|      Herbert Hoover|1874-08-10|          Iowa|    Others|19th Century| -1|
|Franklin D. Roose...|1882-01-30|      New York|    Others|19th Century| -1|
|     Harry S. Truman|1884-05-08|      Missouri|    Others|19th Century| -1|
|Dwight D. Eisenhower|1890-10-14|         Texas|    Others|19th Century| -1|
|     John F. Kennedy|1917-05-29| Massachusetts|        MA|20th Century| -1|
|   Lyndon B. Johnson|1908-08-27|         Texas|    Others|20th Century| -1|
|    Richard M. Nixon|1913-01-09|    California|    Others|20th Century| -1|
|      Gerald R. Ford|1913-07-14|      Nebraska|    Others|20th Century| -1|
|        Jimmy Carter|1924-10-01|       Georgia|    Others|20th Century| -1|
|       Ronald Reagan|1911-02-06|      Illinois|    Others|20th Century| -1|
|   George H. W. Bush|1924-06-12| Massachusetts|        MA|20th Century| -1|
|        Bill Clinton|1946-08-19|      Arkansas|    Others|20th Century| -1|
|      George W. Bush|1946-07-06|   Connecticut|    Others|20th Century| -1|
|        Barack Obama|1961-08-04|        Hawaii|    Others|20th Century| -1|
|        Donald Trump|1946-06-14|      New York|    Others|20th Century| -1|

We can also use withColumn method to add new columns in spark dataframe.

scala> df_pres.select($"pres_name",$"pres_dob",$"pres_bs").withColumn("state_abbr",when($"pres_bs"==="Virginia","VA").when($"pres_bs"==="Massachusetts","MA").when($"pres_bs"==="Ohio","OH").otherwise("Others")).show(false)
|pres_name             |pres_dob  |pres_bs       |state_abbr|
|George Washington     |1732-02-22|Virginia      |VA        |
|John Adams            |1735-10-30|Massachusetts |MA        |
|Thomas Jefferson      |1743-04-13|Virginia      |VA        |
|James Madison         |1751-03-16|Virginia      |VA        |
|James Monroe          |1758-04-28|Virginia      |VA        |
|John Quincy Adams     |1767-07-11|Massachusetts |MA        |
|Andrew Jackson        |1767-03-15|North Carolina|Others    |
|Martin Van Buren      |1782-12-05|New York      |Others    |
|William Henry Harrison|1773-02-09|Virginia      |VA        |
|John Tyler            |1790-03-29|Virginia      |VA        |
|James K. Polk         |1795-11-02|North Carolina|Others    |
|Zachary Taylor        |1784-11-24|Virginia      |VA        |
|Millard Fillmore      |1800-01-07|New York      |Others    |
|Franklin Pierce       |1804-11-23|New Hampshire |Others    |
|James Buchanan        |1791-04-23|Pennsylvania  |Others    |
|Abraham Lincoln       |1809-02-12|Kentucky      |Others    |
|Andrew Johnson        |1808-12-29|North Carolina|Others    |
|Ulysses S. Grant      |1822-04-27|Ohio          |OH        |
|Rutherford B. Hayes   |1822-10-04|Ohio          |OH        |
|James A. Garfield     |1831-11-19|Ohio          |OH        |
only showing top 20 rows

In this post , we saw how to add new columns to spark dataframe. This new column can be with default value or some other values.

Leave a Comment

Your email address will not be published.