select DISTINCT in HIVE
DISTINCT keyword is used in SELECT statement in HIVE to fetch only unique rows. The row does not mean entire row in the table but it means “row” as per column listed in the SELECT statement. If the SELECT has 3 columns listed then SELECT DISTINCT will fetch unique row for those 3 column values only. Also it will not fetch DISTINCT value for 1 of the column. The row comprising of 3 columns will be UNIQUE, not 1, not 2 but all 3 columns. DISTINCT cannot be applied to individual column if multiple columns are listed in SELECT statement. It is applied to entire column list in SELECT statement.
Example:
select distinct pres_bs as pres_birth_state from usa_president; +-----------------------+--+ | pres_birth_state | +-----------------------+--+ | Arkansas | | California | | Connecticut | | Georgia | | Hawaii | | Illinois | | Iowa | | Kentucky | | Massachusetts | | Missouri | | Nebraska | | New Hampshire | | New Jersey | | New York | | North Carolina | | Ohio | | Pennsylvania | | South/North Carolina | | Texas | | Vermont | | Virginia | +-----------------------+--+
Since there are only 21 UNIQUE states out of 50 from which USA president has come from we see only 21 rows in the output. In order to get all columns as DISTINCT you can use DISTINCT *
select DISTINCT * from usa_president; +------------------------+--------------------------+-------------------------+------------------------+------------------------+------------------------+-------------------------+--+ | usa_president.pres_id | usa_president.pres_name | usa_president.pres_dob | usa_president.pres_bp | usa_president.pres_bs | usa_president.pres_in | usa_president.pres_out | +------------------------+--------------------------+-------------------------+------------------------+------------------------+------------------------+-------------------------+--+ | 1 | George Washington | 1732-02-22 | Westmoreland County | Virginia | 1789-04-30 | 1797-03-04 | | 2 | John Adams | 1735-10-30 | Braintree | Massachusetts | 1797-03-04 | 1801-03-04 | | 3 | Thomas Jefferson | 1743-04-13 | Shadwell | Virginia | 1801-03-04 | 1809-03-04 | | 4 | James Madison | 1751-03-16 | Port Conway | Virginia | 1809-03-04 | 1817-03-04 | | 5 | James Monroe | 1758-04-28 | Monroe Hall | Virginia | 1817-03-04 | 1825-03-04 | | 6 | John Quincy Adams | 1767-07-11 | Braintree | Massachusetts | 1825-03-04 | 1829-03-04 | | 7 | Andrew Jackson | 1767-03-15 | Waxhaws Region | South/North Carolina | 1829-03-04 | 1837-03-04 | | 8 | Martin Van Buren | 1782-12-05 | Kinderhook | New York | 1837-03-04 | 1841-03-04 | | 9 | William Henry Harrison | 1773-02-09 | Charles City County | Virginia | 1841-03-04 | 1841-04-04 | | 10 | John Tyler | 1790-03-29 | Charles City County | Virginia | 1841-04-04 | 1845-03-04 | | 11 | James K. Polk | 1795-11-02 | Pineville | North Carolina | 1845-03-04 | 1849-03-04 | | 12 | Zachary Taylor | 1784-11-24 | Barboursville | Virginia | 1849-03-04 | 1850-07-09 | | 13 | Millard Fillmore | 1800-01-07 | Summerhill | New York | 1850-07-09 | 1853-03-04 | | 14 | Franklin Pierce | 1804-11-23 | Hillsborough | New Hampshire | 1853-03-04 | 1857-03-04 | | 15 | James Buchanan | 1791-04-23 | Cove Gap | Pennsylvania | 1857-03-04 | 1861-03-04 | | 16 | Abraham Lincoln | 1809-02-12 | Sinking spring | Kentucky | 1861-03-04 | 1865-04-15 | | 17 | Andrew Johnson | 1808-12-29 | Raleigh | North Carolina | 1865-04-15 | 1869-03-04 | | 18 | Ulysses S. Grant | 1822-04-27 | Point Pleasant | Ohio | 1869-03-04 | 1877-03-04 | | 19 | Rutherford B. Hayes | 1822-10-04 | Delaware | Ohio | 1877-03-04 | 1881-03-04 | | 20 | James A. Garfield | 1831-11-19 | Moreland Hills | Ohio | 1881-03-04 | 1881-09-19 | | 21 | Chester A. Arthur | 1829-10-05 | Fairfield | Vermont | 1881-09-19 | 1885-03-04 | | 22 | Grover Cleveland | 1837-03-18 | Caldwell | New Jersey | 1885-03-04 | 1889-03-04 | | 23 | Benjamin Harrison | 1833-08-20 | North Bend | Ohio | 1889-03-04 | 1893-03-04 | | 24 | Grover Cleveland | 1837-03-18 | Caldwell | New Jersey | 1893-03-04 | 1897-03-04 | | 25 | William McKinley | 1843-01-29 | Niles | Ohio | 1897-03-04 | 1901-09-14 | | 26 | Theodore Roosevelt | 1858-10-27 | Manhattan | New York | 1901-09-14 | 1909-03-04 | | 27 | William Howard Taft | 1857-09-15 | Cincinnati | Ohio | 1909-03-04 | 1913-03-04 | | 28 | Woodrow Wilson | 1856-12-28 | Staunton | Virginia | 1913-03-04 | 1921-03-04 | | 29 | Warren G. Harding | 1865-11-02 | Blooming Grove | Ohio | 1921-03-04 | 1923-08-02 | | 30 | Calvin Coolidge | 1872-07-04 | Plymouth | Vermont | 1923-08-02 | 1929-03-04 | | 31 | Herbert Hoover | 1874-08-10 | West Branch | Iowa | 1929-03-04 | 1933-03-04 | | 32 | Franklin D. Roosevelt | 1882-01-30 | Hyde Park | New York | 1933-03-04 | 1945-04-12 | | 33 | Harry S. Truman | 1884-05-08 | Lamar | Missouri | 1945-04-12 | 1953-01-20 | | 34 | Dwight D. Eisenhower | 1890-10-14 | Denison | Texas | 1953-01-20 | 1961-01-20 | | 35 | John F. Kennedy | 1917-05-29 | Brookline | Massachusetts | 1961-01-20 | 1963-11-22 | | 36 | Lyndon B. Johnson | 1908-08-27 | Stonewall | Texas | 1963-11-22 | 1969-01-20 | | 37 | Richard M. Nixon | 1913-01-09 | Yorba Linda | California | 1969-01-20 | 1974-08-09 | | 38 | Gerald R. Ford | 1913-07-14 | Omaha | Nebraska | 1974-08-09 | 1977-01-20 | | 39 | Jimmy Carter | 1924-10-01 | Plains | Georgia | 1977-01-20 | 1981-01-20 | | 40 | Ronald Reagan | 1911-02-06 | Tampico | Illinois | 1981-01-20 | 1989-01-20 | | 41 | George H. W. Bush | 1924-06-12 | Milton | Massachusetts | 1989-01-20 | 1993-01-20 | | 42 | Bill Clinton | 1946-08-19 | Hope | Arkansas | 1993-01-20 | 2001-01-20 | | 43 | George W. Bush | 1946-07-06 | New Haven | Connecticut | 2001-01-20 | 2009-01-20 | | 44 | Barack Obama | 1961-08-04 | Honolulu | Hawaii | 2009-01-20 | 2017-01-20 | | 45 | Donald Trump | 1946-06-14 | Queens | New York | 2017-01-20 | NULL | +------------------------+--------------------------+-------------------------+------------------------+------------------------+------------------------+-------------------------+--+
Since all the rows are unique in entirety output is all the 45 rows from table.
DISTINCT can be very handy to identify unique records/values. However in HIVE it is considered as one of the most costly operation in HIVE QL(Query Language). It consumes high resources and may give memory related errors if data volume is too high. Hence it is advisable to restrict your output rows by using some filters if possible if you must use DISTINCT. We will talk about another way of finding unique records when we will talk about performance tuning in HIVE.