UNION method is used to MERGE data from 2 dataframes into one. The dataframe must have identical schema. If you are from SQL background then please be very cautious while using UNION operator in SPARK dataframes. Unlike typical RDBMS, UNION in Spark does not remove duplicates from resultant dataframe. It simply MERGEs the data without removing any duplicates. UNION ALL is deprecated and it is recommended to use UNION only.
Let’s see one example to understand it more properly. I have 2 dataframes with 5 & 10 records respectively with first 5 common in both the dataframes.
First Dataframe with 5 records
scala> df_pres1.show() +-------+-----------------+----------+-------------------+-------------+----------+----------+ |pres_id| pres_name| pres_dob| pres_bp| pres_bs| pres_in| pres_out| +-------+-----------------+----------+-------------------+-------------+----------+----------+ | 1|George Washington|1732-02-22|Westmoreland County| Virginia|1789-04-30|1797-03-04| | 2| John Adams|1735-10-30| Braintree|Massachusetts|1797-03-04|1801-03-04| | 3| Thomas Jefferson|1743-04-13| Shadwell| Virginia|1801-03-04|1809-03-04| | 4| James Madison|1751-03-16| Port Conway| Virginia|1809-03-04|1817-03-04| | 5| James Monroe|1758-04-28| Monroe Hall| Virginia|1817-03-04|1825-03-04| +-------+-----------------+----------+-------------------+-------------+----------+----------+
Second Dataframe with 10 records
scala> df_pres2.show() +-------+--------------------+----------+-------------------+--------------+----------+----------+ |pres_id| pres_name| pres_dob| pres_bp| pres_bs| pres_in| pres_out| +-------+--------------------+----------+-------------------+--------------+----------+----------+ | 1| George Washington|1732-02-22|Westmoreland County| Virginia|1789-04-30|1797-03-04| | 2| John Adams|1735-10-30| Braintree| Massachusetts|1797-03-04|1801-03-04| | 3| Thomas Jefferson|1743-04-13| Shadwell| Virginia|1801-03-04|1809-03-04| | 4| James Madison|1751-03-16| Port Conway| Virginia|1809-03-04|1817-03-04| | 5| James Monroe|1758-04-28| Monroe Hall| Virginia|1817-03-04|1825-03-04| | 6| John Quincy Adams|1767-07-11| Braintree| Massachusetts|1825-03-04|1829-03-04| | 7| Andrew Jackson|1767-03-15| Waxhaws Region|North Carolina|1829-03-04|1837-03-04| | 8| Martin Van Buren|1782-12-05| Kinderhook| New York|1837-03-04|1841-03-04| | 9| William Henry Har…|1773-02-09|Charles City County| Virginia|1841-03-04|1841-04-04| | 10| John Tyler|1790-03-29|Charles City County| Virginia|1841-04-04|1845-03-04| +-------+--------------------+----------+-------------------+--------------+----------+----------+
Dataframe UNION in spark
scala> df_pres1.union(df_pres2).show() +-------+--------------------+----------+-------------------+--------------+----------+----------+ |pres_id| pres_name| pres_dob| pres_bp| pres_bs| pres_in| pres_out| +-------+--------------------+----------+-------------------+--------------+----------+----------+ | 1| George Washington|1732-02-22|Westmoreland County| Virginia|1789-04-30|1797-03-04| | 2| John Adams|1735-10-30| Braintree| Massachusetts|1797-03-04|1801-03-04| | 3| Thomas Jefferson|1743-04-13| Shadwell| Virginia|1801-03-04|1809-03-04| | 4| James Madison|1751-03-16| Port Conway| Virginia|1809-03-04|1817-03-04| | 5| James Monroe|1758-04-28| Monroe Hall| Virginia|1817-03-04|1825-03-04| | 1| George Washington|1732-02-22|Westmoreland County| Virginia|1789-04-30|1797-03-04| | 2| John Adams|1735-10-30| Braintree| Massachusetts|1797-03-04|1801-03-04| | 3| Thomas Jefferson|1743-04-13| Shadwell| Virginia|1801-03-04|1809-03-04| | 4| James Madison|1751-03-16| Port Conway| Virginia|1809-03-04|1817-03-04| | 5| James Monroe|1758-04-28| Monroe Hall| Virginia|1817-03-04|1825-03-04| | 6| John Quincy Adams|1767-07-11| Braintree| Massachusetts|1825-03-04|1829-03-04| | 7| Andrew Jackson|1767-03-15| Waxhaws Region|North Carolina|1829-03-04|1837-03-04| | 8| Martin Van Buren|1782-12-05| Kinderhook| New York|1837-03-04|1841-03-04| | 9| William Henry Har…|1773-02-09|Charles City County| Virginia|1841-03-04|1841-04-04| | 10| John Tyler|1790-03-29|Charles City County| Virginia|1841-04-04|1845-03-04| +-------+--------------------+----------+-------------------+--------------+----------+----------+
You can see in the above output that rows are duplicate and UNION is only MERGING the records without any de-duplication. If we need distinct records or similar functionality of SQL “UNION” then we should apply distinct method to UNION output.
Dataframe UNION with DISTINCT in spark
scala> df_pres1.union(df_pres2).distinct().show() +-------+--------------------+----------+-------------------+--------------+----------+----------+ |pres_id| pres_name| pres_dob| pres_bp| pres_bs| pres_in| pres_out| +-------+--------------------+----------+-------------------+--------------+----------+----------+ | 1| George Washington|1732-02-22|Westmoreland County| Virginia|1789-04-30|1797-03-04| | 2| John Adams|1735-10-30| Braintree| Massachusetts|1797-03-04|1801-03-04| | 6| John Quincy Adams|1767-07-11| Braintree| Massachusetts|1825-03-04|1829-03-04| | 7| Andrew Jackson|1767-03-15| Waxhaws Region|North Carolina|1829-03-04|1837-03-04| | 9| William Henry Har…|1773-02-09|Charles City County| Virginia|1841-03-04|1841-04-04| | 10| John Tyler|1790-03-29|Charles City County| Virginia|1841-04-04|1845-03-04| | 3| Thomas Jefferson|1743-04-13| Shadwell| Virginia|1801-03-04|1809-03-04| | 4| James Madison|1751-03-16| Port Conway| Virginia|1809-03-04|1817-03-04| | 5| James Monroe|1758-04-28| Monroe Hall| Virginia|1817-03-04|1825-03-04| | 8| Martin Van Buren|1782-12-05| Kinderhook| New York|1837-03-04|1841-03-04| +-------+--------------------+----------+-------------------+--------------+----------+----------+
Leave a comment if you need any help for UNION.