You may need to add new columns in the existing SPARK dataframe as per the requirement. This new column can be initialized with a default value or you can assign some dynamic value to it depending on some logical conditions. Let’s see an example below to add 2 new columns with logical value and 1 column with default value.
scala> df_pres.select($"pres_name",$"pres_dob",$"pres_bs").show(false)
+----------------------+----------+--------------+
|pres_name |pres_dob |pres_bs |
+----------------------+----------+--------------+
|George Washington |1732-02-22|Virginia |
|John Adams |1735-10-30|Massachusetts |
|Thomas Jefferson |1743-04-13|Virginia |
|James Madison |1751-03-16|Virginia |
|James Monroe |1758-04-28|Virginia |
|John Quincy Adams |1767-07-11|Massachusetts |
|Andrew Jackson |1767-03-15|North Carolina|
|Martin Van Buren |1782-12-05|New York |
|William Henry Harrison|1773-02-09|Virginia |
|John Tyler |1790-03-29|Virginia |
|James K. Polk |1795-11-02|North Carolina|
|Zachary Taylor |1784-11-24|Virginia |
|Millard Fillmore |1800-01-07|New York |
|Franklin Pierce |1804-11-23|New Hampshire |
|James Buchanan |1791-04-23|Pennsylvania |
|Abraham Lincoln |1809-02-12|Kentucky |
|Andrew Johnson |1808-12-29|North Carolina|
|Ulysses S. Grant |1822-04-27|Ohio |
|Rutherford B. Hayes |1822-10-04|Ohio |
|James A. Garfield |1831-11-19|Ohio |
+----------------------+----------+--------------+
only showing top 20 rows
Let’s add 2 new columns to it. One for State Abbreviation and other for Century to which President was born. Also we will add 1 new column with default value using “lit” function.
scala> df_pres.select($"pres_name",$"pres_dob",$"pres_bs",when($"pres_bs"==="Virginia","VA").when($"pres_bs"==="Massachusetts","MA").when($"pres_bs"==="Ohio","OH").otherwise("Others").alias("state_abbr"), when($"pres_dob".between("1701-01-01","1800-12-31"),"18th Century").when($"pres_dob".between("1801-01-01","1900-12-31"),"19th Century").when($"pres_dob".between("1901-01-01","2000-12-31"),"20th Century").alias("Century"),lit("-1").alias("C3")).show(50)
+--------------------+----------+--------------+----------+------------+---+
| pres_name| pres_dob| pres_bs|state_abbr| Century| C3|
+--------------------+----------+--------------+----------+------------+---+
| George Washington|1732-02-22| Virginia| VA|18th Century| -1|
| John Adams|1735-10-30| Massachusetts| MA|18th Century| -1|
| Thomas Jefferson|1743-04-13| Virginia| VA|18th Century| -1|
| James Madison|1751-03-16| Virginia| VA|18th Century| -1|
| James Monroe|1758-04-28| Virginia| VA|18th Century| -1|
| John Quincy Adams|1767-07-11| Massachusetts| MA|18th Century| -1|
| Andrew Jackson|1767-03-15|North Carolina| Others|18th Century| -1|
| Martin Van Buren|1782-12-05| New York| Others|18th Century| -1|
|William Henry Har...|1773-02-09| Virginia| VA|18th Century| -1|
| John Tyler|1790-03-29| Virginia| VA|18th Century| -1|
| James K. Polk|1795-11-02|North Carolina| Others|18th Century| -1|
| Zachary Taylor|1784-11-24| Virginia| VA|18th Century| -1|
| Millard Fillmore|1800-01-07| New York| Others|18th Century| -1|
| Franklin Pierce|1804-11-23| New Hampshire| Others|19th Century| -1|
| James Buchanan|1791-04-23| Pennsylvania| Others|18th Century| -1|
| Abraham Lincoln|1809-02-12| Kentucky| Others|19th Century| -1|
| Andrew Johnson|1808-12-29|North Carolina| Others|19th Century| -1|
| Ulysses S. Grant|1822-04-27| Ohio| OH|19th Century| -1|
| Rutherford B. Hayes|1822-10-04| Ohio| OH|19th Century| -1|
| James A. Garfield|1831-11-19| Ohio| OH|19th Century| -1|
| Chester A. Arthur|1829-10-05| Vermont| Others|19th Century| -1|
| Grover Cleveland|1837-03-18| New Jersey| Others|19th Century| -1|
| Benjamin Harrison|1833-08-20| Ohio| OH|19th Century| -1|
| Grover Cleveland|1837-03-18| New Jersey| Others|19th Century| -1|
| William McKinley|1843-01-29| Ohio| OH|19th Century| -1|
| Theodore Roosevelt|1858-10-27| New York| Others|19th Century| -1|
| William Howard Taft|1857-09-15| Ohio| OH|19th Century| -1|
| Woodrow Wilson|1856-12-28| Virginia| VA|19th Century| -1|
| Warren G. Harding|1865-11-02| Ohio| OH|19th Century| -1|
| Calvin Coolidge|1872-07-04| Vermont| Others|19th Century| -1|
| Herbert Hoover|1874-08-10| Iowa| Others|19th Century| -1|
|Franklin D. Roose...|1882-01-30| New York| Others|19th Century| -1|
| Harry S. Truman|1884-05-08| Missouri| Others|19th Century| -1|
|Dwight D. Eisenhower|1890-10-14| Texas| Others|19th Century| -1|
| John F. Kennedy|1917-05-29| Massachusetts| MA|20th Century| -1|
| Lyndon B. Johnson|1908-08-27| Texas| Others|20th Century| -1|
| Richard M. Nixon|1913-01-09| California| Others|20th Century| -1|
| Gerald R. Ford|1913-07-14| Nebraska| Others|20th Century| -1|
| Jimmy Carter|1924-10-01| Georgia| Others|20th Century| -1|
| Ronald Reagan|1911-02-06| Illinois| Others|20th Century| -1|
| George H. W. Bush|1924-06-12| Massachusetts| MA|20th Century| -1|
| Bill Clinton|1946-08-19| Arkansas| Others|20th Century| -1|
| George W. Bush|1946-07-06| Connecticut| Others|20th Century| -1|
| Barack Obama|1961-08-04| Hawaii| Others|20th Century| -1|
| Donald Trump|1946-06-14| New York| Others|20th Century| -1|
+--------------------+----------+--------------+----------+------------+---+
We can also use withColumn method to add new columns in spark dataframe.
scala> df_pres.select($"pres_name",$"pres_dob",$"pres_bs").withColumn("state_abbr",when($"pres_bs"==="Virginia","VA").when($"pres_bs"==="Massachusetts","MA").when($"pres_bs"==="Ohio","OH").otherwise("Others")).show(false)
+----------------------+----------+--------------+----------+
|pres_name |pres_dob |pres_bs |state_abbr|
+----------------------+----------+--------------+----------+
|George Washington |1732-02-22|Virginia |VA |
|John Adams |1735-10-30|Massachusetts |MA |
|Thomas Jefferson |1743-04-13|Virginia |VA |
|James Madison |1751-03-16|Virginia |VA |
|James Monroe |1758-04-28|Virginia |VA |
|John Quincy Adams |1767-07-11|Massachusetts |MA |
|Andrew Jackson |1767-03-15|North Carolina|Others |
|Martin Van Buren |1782-12-05|New York |Others |
|William Henry Harrison|1773-02-09|Virginia |VA |
|John Tyler |1790-03-29|Virginia |VA |
|James K. Polk |1795-11-02|North Carolina|Others |
|Zachary Taylor |1784-11-24|Virginia |VA |
|Millard Fillmore |1800-01-07|New York |Others |
|Franklin Pierce |1804-11-23|New Hampshire |Others |
|James Buchanan |1791-04-23|Pennsylvania |Others |
|Abraham Lincoln |1809-02-12|Kentucky |Others |
|Andrew Johnson |1808-12-29|North Carolina|Others |
|Ulysses S. Grant |1822-04-27|Ohio |OH |
|Rutherford B. Hayes |1822-10-04|Ohio |OH |
|James A. Garfield |1831-11-19|Ohio |OH |
+----------------------+----------+--------------+----------+
only showing top 20 rows
In this post , we saw how to add new columns to spark dataframe. This new column can be with default value or some other values.