PySpark RDD operations – Map, Filter, SortBy, reduceByKey, Joins

In the last post, we discussed about basic operations on RDD in PySpark. In this post, we will see other common operations one can perform on RDD in PySpark. Let’s quickly see the syntax and examples for various RDD operations:

Read a file into RDD

>>> rdd_states = sc.textFile("/tmp/US_States_data.txt")
>>> rdd_states.take(2)
[u'stateid~state_name~state_abbr~state_capital~largest_city~population', u'1~Alabama~AL~Montgomery~Birmingham~4874747']

Convert record into LIST of elements

>>> rdd_states = rdd_states.map(lambda x:x.split("~"))
>>> rdd_states.take(2)
[[u'stateid', u'state_name', u'state_abbr', u'state_capital', u'largest_city', u'population'], [u'1', u'Alabama', u'AL', u'Montgomery', u'Birmingham', u'4874747']]

Remove the header data

>>> rdd_states = rdd_states.filter(lambda x:not(x[0].startswith("stateid")))
>>> rdd_states.take(2)
[[u'1', u'Alabama', u'AL', u'Montgomery', u'Birmingham', u'4874747'], [u'2', u'Alaska', u'AK', u'Juneau', u'Anchorage', u'739795']]

Check the count of records in RDD

>>> rdd_states.count()
50

Check the first element in RDD

>>> rdd_states.first()
[u'1', u'Alabama', u'AL', u'Montgomery', u'Birmingham', u'4874747']

Check the partitions for RDD

>>> rdd_states.getNumPartitions()
2

Use custom function in RDD operations

>>> def onlyUpper(st_name):
...     st_name_upper = st_name.upper()
...     return st_name_upper
...
>>> onlyUpper("hello")
'HELLO'

Apply custom function to RDD and see the result:

>>> rdd_states_upper=rdd_states.map(lambda x:onlyUpper(x[1]))
>>> rdd_states_upper.take(10)
[u'ALABAMA', u'ALASKA', u'ARIZONA', u'ARKANSAS', u'CALIFORNIA', u'COLORADO', u'CONNECTICUT', u'DELAWARE', u'FLORIDA', u'GEORGIA']

If you want to convert all the columns to UPPER case, create another function which accepts LIST and return LIST in uppercase for all elements.

>>> def onlyUpperList(lst_state):
...     rec = []
...     for y in lst_state:
...         rec.append(y.upper())
...     return rec
...
>>> onlyUpperList(["abc","def","pqr"])
['ABC', 'DEF', 'PQR']

Let’s apply the method to RDD and see the result:

>>> rdd_states_upper = rdd_states.map(onlyUpperList)
>>> rdd_states_upper.take(2)
[[u'1', u'ALABAMA', u'AL', u'MONTGOMERY', u'BIRMINGHAM', u'4874747'], [u'2', u'ALASKA', u'AK', u'JUNEAU', u'ANCHORAGE', u'739795']]

Filter the data in RDD to select states with population more than 5 Mn.

>>> rdd_states_population = rdd_states.filter(lambda x: int(x[5])>5000000)
>>> rdd_states_population.count()
23
>>> rdd_states_population.take(2)
[[u'3', u'Arizona', u'AZ', u'Phoenix', u'Phoenix', u'7016270'], [u'5', u'California', u'CA', u'Sacramento', u'Los Angeles', u'39536653']]

For next few operations , let’s create another RDD with above mentioned steps:

rdd_pres = sc.textFile("/tmp/US_President_data.txt")
rdd_pres = rdd_pres.map(lambda x:x.split("~"))
rdd_pres = rdd_pres.filter(lambda x:not(x[0].startswith("President_ID")))

for x in rdd_pres.collect():
    print (x)
	
[u'1', u'George Washington', u'1732-02-22', u'Westmoreland County', u'Virginia', u'1789-04-30', u'1797-03-04']
[u'2', u'John Adams', u'1735-10-30', u'Braintree', u'Massachusetts', u'1797-03-04', u'1801-03-04']
[u'3', u'Thomas Jefferson', u'1743-04-13', u'Shadwell', u'Virginia', u'1801-03-04', u'1809-03-04']
[u'4', u'James Madison', u'1751-03-16', u'Port Conway', u'Virginia', u'1809-03-04', u'1817-03-04']
[u'5', u'James Monroe', u'1758-04-28', u'Monroe Hall', u'Virginia', u'1817-03-04', u'1825-03-04']
[u'6', u'John Quincy Adams', u'1767-07-11', u'Braintree', u'Massachusetts', u'1825-03-04', u'1829-03-04']
[u'7', u'Andrew Jackson', u'1767-03-15', u'Waxhaws Region', u'North Carolina', u'1829-03-04', u'1837-03-04']
[u'8', u'Martin Van Buren', u'1782-12-05', u'Kinderhook', u'New York', u'1837-03-04', u'1841-03-04']
[u'9', u'William Henry Harrison', u'1773-02-09', u'Charles City County', u'Virginia', u'1841-03-04', u'1841-04-04']
[u'10', u'John Tyler', u'1790-03-29', u'Charles City County', u'Virginia', u'1841-04-04', u'1845-03-04']
[u'11', u'James K. Polk', u'1795-11-02', u'Pineville', u'North Carolina', u'1845-03-04', u'1849-03-04']
[u'12', u'Zachary Taylor', u'1784-11-24', u'Barboursville', u'Virginia', u'1849-03-04', u'1850-07-09']
[u'13', u'Millard Fillmore', u'1800-01-07', u'Summerhill', u'New York', u'1850-07-09', u'1853-03-04']
[u'14', u'Franklin Pierce', u'1804-11-23', u'Hillsborough', u'New Hampshire', u'1853-03-04', u'1857-03-04']
[u'15', u'James Buchanan', u'1791-04-23', u'Cove Gap', u'Pennsylvania', u'1857-03-04', u'1861-03-04']
[u'16', u'Abraham Lincoln', u'1809-02-12', u'Sinking spring', u'Kentucky', u'1861-03-04', u'1865-04-15']
[u'17', u'Andrew Johnson', u'1808-12-29', u'Raleigh', u'North Carolina', u'1865-04-15', u'1869-03-04']
[u'18', u'Ulysses S. Grant', u'1822-04-27', u'Point Pleasant', u'Ohio', u'1869-03-04', u'1877-03-04']
[u'19', u'Rutherford B. Hayes', u'1822-10-04', u'Delaware', u'Ohio', u'1877-03-04', u'1881-03-04']
[u'20', u'James A. Garfield', u'1831-11-19', u'Moreland Hills', u'Ohio', u'1881-03-04', u'1881-09-19']
[u'21', u'Chester A. Arthur', u'1829-10-05', u'Fairfield', u'Vermont', u'1881-09-19', u'1885-03-04']
[u'22', u'Grover Cleveland', u'1837-03-18', u'Caldwell', u'New Jersey', u'1885-03-04', u'1889-03-04']
[u'23', u'Benjamin Harrison', u'1833-08-20', u'North Bend', u'Ohio', u'1889-03-04', u'1893-03-04']
[u'24', u'Grover Cleveland', u'1837-03-18', u'Caldwell', u'New Jersey', u'1893-03-04', u'1897-03-04']
[u'25', u'William McKinley', u'1843-01-29', u'Niles', u'Ohio', u'1897-03-04', u'1901-09-14']
[u'26', u'Theodore Roosevelt', u'1858-10-27', u'Manhattan', u'New York', u'1901-09-14', u'1909-03-04']
[u'27', u'William Howard Taft', u'1857-09-15', u'Cincinnati', u'Ohio', u'1909-03-04', u'1913-03-04']
[u'28', u'Woodrow Wilson', u'1856-12-28', u'Staunton', u'Virginia', u'1913-03-04', u'1921-03-04']
[u'29', u'Warren G. Harding', u'1865-11-02', u'Blooming Grove', u'Ohio', u'1921-03-04', u'1923-08-02']
[u'30', u'Calvin Coolidge', u'1872-07-04', u'Plymouth', u'Vermont', u'1923-08-02', u'1929-03-04']
[u'31', u'Herbert Hoover', u'1874-08-10', u'West Branch', u'Iowa', u'1929-03-04', u'1933-03-04']
[u'32', u'Franklin D. Roosevelt', u'1882-01-30', u'Hyde Park', u'New York', u'1933-03-04', u'1945-04-12']
[u'33', u'Harry S. Truman', u'1884-05-08', u'Lamar', u'Missouri', u'1945-04-12', u'1953-01-20']
[u'34', u'Dwight D. Eisenhower', u'1890-10-14', u'Denison', u'Texas', u'1953-01-20', u'1961-01-20']
[u'35', u'John F. Kennedy', u'1917-05-29', u'Brookline', u'Massachusetts', u'1961-01-20', u'1963-11-22']
[u'36', u'Lyndon B. Johnson', u'1908-08-27', u'Stonewall', u'Texas', u'1963-11-22', u'1969-01-20']
[u'37', u'Richard M. Nixon', u'1913-01-09', u'Yorba Linda', u'California', u'1969-01-20', u'1974-08-09']
[u'38', u'Gerald R. Ford', u'1913-07-14', u'Omaha', u'Nebraska', u'1974-08-09', u'1977-01-20']
[u'39', u'Jimmy Carter', u'1924-10-01', u'Plains', u'Georgia', u'1977-01-20', u'1981-01-20']
[u'40', u'Ronald Reagan', u'1911-02-06', u'Tampico', u'Illinois', u'1981-01-20', u'1989-01-20']
[u'41', u'George H. W. Bush', u'1924-06-12', u'Milton', u'Massachusetts', u'1989-01-20', u'1993-01-20']
[u'42', u'Bill Clinton', u'1946-08-19', u'Hope', u'Arkansas', u'1993-01-20', u'2001-01-20']
[u'43', u'George W. Bush', u'1946-07-06', u'New Haven', u'Connecticut', u'2001-01-20', u'2009-01-20']
[u'44', u'Barack Obama', u'1961-08-04', u'Honolulu', u'Hawaii', u'2009-01-20', u'2017-01-20']
[u'45', u'Donald Trump', u'1946-06-14', u'Queens', u'New York', u'2017-01-20', u'']

Sort the RDD data on the basis of state name.

>>> rdd_pres_sort = rdd_pres.sortBy(keyfunc = lambda x: x[4])
>>> for x in rdd_pres_sort.collect():
...     print (x)
...
[u'42', u'Bill Clinton', u'1946-08-19', u'Hope', u'Arkansas', u'1993-01-20', u'2001-01-20']
[u'37', u'Richard M. Nixon', u'1913-01-09', u'Yorba Linda', u'California', u'1969-01-20', u'1974-08-09']
[u'43', u'George W. Bush', u'1946-07-06', u'New Haven', u'Connecticut', u'2001-01-20', u'2009-01-20']
[u'39', u'Jimmy Carter', u'1924-10-01', u'Plains', u'Georgia', u'1977-01-20', u'1981-01-20']
[u'44', u'Barack Obama', u'1961-08-04', u'Honolulu', u'Hawaii', u'2009-01-20', u'2017-01-20']
[u'40', u'Ronald Reagan', u'1911-02-06', u'Tampico', u'Illinois', u'1981-01-20', u'1989-01-20']
[u'31', u'Herbert Hoover', u'1874-08-10', u'West Branch', u'Iowa', u'1929-03-04', u'1933-03-04']
[u'16', u'Abraham Lincoln', u'1809-02-12', u'Sinking spring', u'Kentucky', u'1861-03-04', u'1865-04-15']
[u'2', u'John Adams', u'1735-10-30', u'Braintree', u'Massachusetts', u'1797-03-04', u'1801-03-04']
[u'6', u'John Quincy Adams', u'1767-07-11', u'Braintree', u'Massachusetts', u'1825-03-04', u'1829-03-04']
[u'35', u'John F. Kennedy', u'1917-05-29', u'Brookline', u'Massachusetts', u'1961-01-20', u'1963-11-22']
[u'41', u'George H. W. Bush', u'1924-06-12', u'Milton', u'Massachusetts', u'1989-01-20', u'1993-01-20']
[u'33', u'Harry S. Truman', u'1884-05-08', u'Lamar', u'Missouri', u'1945-04-12', u'1953-01-20']
[u'38', u'Gerald R. Ford', u'1913-07-14', u'Omaha', u'Nebraska', u'1974-08-09', u'1977-01-20']
[u'14', u'Franklin Pierce', u'1804-11-23', u'Hillsborough', u'New Hampshire', u'1853-03-04', u'1857-03-04']
[u'22', u'Grover Cleveland', u'1837-03-18', u'Caldwell', u'New Jersey', u'1885-03-04', u'1889-03-04']
[u'24', u'Grover Cleveland', u'1837-03-18', u'Caldwell', u'New Jersey', u'1893-03-04', u'1897-03-04']
[u'8', u'Martin Van Buren', u'1782-12-05', u'Kinderhook', u'New York', u'1837-03-04', u'1841-03-04']
[u'13', u'Millard Fillmore', u'1800-01-07', u'Summerhill', u'New York', u'1850-07-09', u'1853-03-04']
[u'26', u'Theodore Roosevelt', u'1858-10-27', u'Manhattan', u'New York', u'1901-09-14', u'1909-03-04']
[u'32', u'Franklin D. Roosevelt', u'1882-01-30', u'Hyde Park', u'New York', u'1933-03-04', u'1945-04-12']
[u'45', u'Donald Trump', u'1946-06-14', u'Queens', u'New York', u'2017-01-20', u'']
[u'7', u'Andrew Jackson', u'1767-03-15', u'Waxhaws Region', u'North Carolina', u'1829-03-04', u'1837-03-04']
[u'11', u'James K. Polk', u'1795-11-02', u'Pineville', u'North Carolina', u'1845-03-04', u'1849-03-04']
[u'17', u'Andrew Johnson', u'1808-12-29', u'Raleigh', u'North Carolina', u'1865-04-15', u'1869-03-04']
[u'18', u'Ulysses S. Grant', u'1822-04-27', u'Point Pleasant', u'Ohio', u'1869-03-04', u'1877-03-04']
[u'19', u'Rutherford B. Hayes', u'1822-10-04', u'Delaware', u'Ohio', u'1877-03-04', u'1881-03-04']
[u'20', u'James A. Garfield', u'1831-11-19', u'Moreland Hills', u'Ohio', u'1881-03-04', u'1881-09-19']
[u'23', u'Benjamin Harrison', u'1833-08-20', u'North Bend', u'Ohio', u'1889-03-04', u'1893-03-04']
[u'25', u'William McKinley', u'1843-01-29', u'Niles', u'Ohio', u'1897-03-04', u'1901-09-14']
[u'27', u'William Howard Taft', u'1857-09-15', u'Cincinnati', u'Ohio', u'1909-03-04', u'1913-03-04']
[u'29', u'Warren G. Harding', u'1865-11-02', u'Blooming Grove', u'Ohio', u'1921-03-04', u'1923-08-02']
[u'15', u'James Buchanan', u'1791-04-23', u'Cove Gap', u'Pennsylvania', u'1857-03-04', u'1861-03-04']
[u'34', u'Dwight D. Eisenhower', u'1890-10-14', u'Denison', u'Texas', u'1953-01-20', u'1961-01-20']
[u'36', u'Lyndon B. Johnson', u'1908-08-27', u'Stonewall', u'Texas', u'1963-11-22', u'1969-01-20']
[u'21', u'Chester A. Arthur', u'1829-10-05', u'Fairfield', u'Vermont', u'1881-09-19', u'1885-03-04']
[u'30', u'Calvin Coolidge', u'1872-07-04', u'Plymouth', u'Vermont', u'1923-08-02', u'1929-03-04']
[u'1', u'George Washington', u'1732-02-22', u'Westmoreland County', u'Virginia', u'1789-04-30', u'1797-03-04']
[u'3', u'Thomas Jefferson', u'1743-04-13', u'Shadwell', u'Virginia', u'1801-03-04', u'1809-03-04']
[u'4', u'James Madison', u'1751-03-16', u'Port Conway', u'Virginia', u'1809-03-04', u'1817-03-04']
[u'5', u'James Monroe', u'1758-04-28', u'Monroe Hall', u'Virginia', u'1817-03-04', u'1825-03-04']
[u'9', u'William Henry Harrison', u'1773-02-09', u'Charles City County', u'Virginia', u'1841-03-04', u'1841-04-04']
[u'10', u'John Tyler', u'1790-03-29', u'Charles City County', u'Virginia', u'1841-04-04', u'1845-03-04']
[u'12', u'Zachary Taylor', u'1784-11-24', u'Barboursville', u'Virginia', u'1849-03-04', u'1850-07-09']
[u'28', u'Woodrow Wilson', u'1856-12-28', u'Staunton', u'Virginia', u'1913-03-04', u'1921-03-04']

Similarly you can sort the data on the basis of President name, pass the respective position index in lambda function.

>>> rdd_pres_sort = rdd_pres.sortBy(keyfunc = lambda x: x[1])
>>> for x in rdd_pres_sort.collect():
...     print (x)
...
[u'16', u'Abraham Lincoln', u'1809-02-12', u'Sinking spring', u'Kentucky', u'1861-03-04', u'1865-04-15']
[u'7', u'Andrew Jackson', u'1767-03-15', u'Waxhaws Region', u'North Carolina', u'1829-03-04', u'1837-03-04']
[u'17', u'Andrew Johnson', u'1808-12-29', u'Raleigh', u'North Carolina', u'1865-04-15', u'1869-03-04']
[u'44', u'Barack Obama', u'1961-08-04', u'Honolulu', u'Hawaii', u'2009-01-20', u'2017-01-20']
[u'23', u'Benjamin Harrison', u'1833-08-20', u'North Bend', u'Ohio', u'1889-03-04', u'1893-03-04']
[u'42', u'Bill Clinton', u'1946-08-19', u'Hope', u'Arkansas', u'1993-01-20', u'2001-01-20']
[u'30', u'Calvin Coolidge', u'1872-07-04', u'Plymouth', u'Vermont', u'1923-08-02', u'1929-03-04']
[u'21', u'Chester A. Arthur', u'1829-10-05', u'Fairfield', u'Vermont', u'1881-09-19', u'1885-03-04']
[u'45', u'Donald Trump', u'1946-06-14', u'Queens', u'New York', u'2017-01-20', u'']
[u'34', u'Dwight D. Eisenhower', u'1890-10-14', u'Denison', u'Texas', u'1953-01-20', u'1961-01-20']
[u'32', u'Franklin D. Roosevelt', u'1882-01-30', u'Hyde Park', u'New York', u'1933-03-04', u'1945-04-12']
[u'14', u'Franklin Pierce', u'1804-11-23', u'Hillsborough', u'New Hampshire', u'1853-03-04', u'1857-03-04']
[u'41', u'George H. W. Bush', u'1924-06-12', u'Milton', u'Massachusetts', u'1989-01-20', u'1993-01-20']
[u'43', u'George W. Bush', u'1946-07-06', u'New Haven', u'Connecticut', u'2001-01-20', u'2009-01-20']
[u'1', u'George Washington', u'1732-02-22', u'Westmoreland County', u'Virginia', u'1789-04-30', u'1797-03-04']
[u'38', u'Gerald R. Ford', u'1913-07-14', u'Omaha', u'Nebraska', u'1974-08-09', u'1977-01-20']
[u'22', u'Grover Cleveland', u'1837-03-18', u'Caldwell', u'New Jersey', u'1885-03-04', u'1889-03-04']
[u'24', u'Grover Cleveland', u'1837-03-18', u'Caldwell', u'New Jersey', u'1893-03-04', u'1897-03-04']
[u'33', u'Harry S. Truman', u'1884-05-08', u'Lamar', u'Missouri', u'1945-04-12', u'1953-01-20']
[u'31', u'Herbert Hoover', u'1874-08-10', u'West Branch', u'Iowa', u'1929-03-04', u'1933-03-04']
[u'20', u'James A. Garfield', u'1831-11-19', u'Moreland Hills', u'Ohio', u'1881-03-04', u'1881-09-19']
[u'15', u'James Buchanan', u'1791-04-23', u'Cove Gap', u'Pennsylvania', u'1857-03-04', u'1861-03-04']
[u'11', u'James K. Polk', u'1795-11-02', u'Pineville', u'North Carolina', u'1845-03-04', u'1849-03-04']
[u'4', u'James Madison', u'1751-03-16', u'Port Conway', u'Virginia', u'1809-03-04', u'1817-03-04']
[u'5', u'James Monroe', u'1758-04-28', u'Monroe Hall', u'Virginia', u'1817-03-04', u'1825-03-04']
[u'39', u'Jimmy Carter', u'1924-10-01', u'Plains', u'Georgia', u'1977-01-20', u'1981-01-20']
[u'2', u'John Adams', u'1735-10-30', u'Braintree', u'Massachusetts', u'1797-03-04', u'1801-03-04']
[u'35', u'John F. Kennedy', u'1917-05-29', u'Brookline', u'Massachusetts', u'1961-01-20', u'1963-11-22']
[u'6', u'John Quincy Adams', u'1767-07-11', u'Braintree', u'Massachusetts', u'1825-03-04', u'1829-03-04']
[u'10', u'John Tyler', u'1790-03-29', u'Charles City County', u'Virginia', u'1841-04-04', u'1845-03-04']
[u'36', u'Lyndon B. Johnson', u'1908-08-27', u'Stonewall', u'Texas', u'1963-11-22', u'1969-01-20']
[u'8', u'Martin Van Buren', u'1782-12-05', u'Kinderhook', u'New York', u'1837-03-04', u'1841-03-04']
[u'13', u'Millard Fillmore', u'1800-01-07', u'Summerhill', u'New York', u'1850-07-09', u'1853-03-04']
[u'37', u'Richard M. Nixon', u'1913-01-09', u'Yorba Linda', u'California', u'1969-01-20', u'1974-08-09']
[u'40', u'Ronald Reagan', u'1911-02-06', u'Tampico', u'Illinois', u'1981-01-20', u'1989-01-20']
[u'19', u'Rutherford B. Hayes', u'1822-10-04', u'Delaware', u'Ohio', u'1877-03-04', u'1881-03-04']
[u'26', u'Theodore Roosevelt', u'1858-10-27', u'Manhattan', u'New York', u'1901-09-14', u'1909-03-04']
[u'3', u'Thomas Jefferson', u'1743-04-13', u'Shadwell', u'Virginia', u'1801-03-04', u'1809-03-04']
[u'18', u'Ulysses S. Grant', u'1822-04-27', u'Point Pleasant', u'Ohio', u'1869-03-04', u'1877-03-04']
[u'29', u'Warren G. Harding', u'1865-11-02', u'Blooming Grove', u'Ohio', u'1921-03-04', u'1923-08-02']
[u'9', u'William Henry Harrison', u'1773-02-09', u'Charles City County', u'Virginia', u'1841-03-04', u'1841-04-04']
[u'27', u'William Howard Taft', u'1857-09-15', u'Cincinnati', u'Ohio', u'1909-03-04', u'1913-03-04']
[u'25', u'William McKinley', u'1843-01-29', u'Niles', u'Ohio', u'1897-03-04', u'1901-09-14']
[u'28', u'Woodrow Wilson', u'1856-12-28', u'Staunton', u'Virginia', u'1913-03-04', u'1921-03-04']
[u'12', u'Zachary Taylor', u'1784-11-24', u'Barboursville', u'Virginia', u'1849-03-04', u'1850-07-09']

Sort in descending order, set ascending flag to FALSE in sortyBy function.

>>> rdd_pres_sort = rdd_pres.sortBy(keyfunc = lambda x: x[1],ascending=False)
>>> for x in rdd_pres_sort.collect():
...     print (x)
...
[u'12', u'Zachary Taylor', u'1784-11-24', u'Barboursville', u'Virginia', u'1849-03-04', u'1850-07-09']
[u'28', u'Woodrow Wilson', u'1856-12-28', u'Staunton', u'Virginia', u'1913-03-04', u'1921-03-04']
[u'25', u'William McKinley', u'1843-01-29', u'Niles', u'Ohio', u'1897-03-04', u'1901-09-14']
[u'27', u'William Howard Taft', u'1857-09-15', u'Cincinnati', u'Ohio', u'1909-03-04', u'1913-03-04']
[u'9', u'William Henry Harrison', u'1773-02-09', u'Charles City County', u'Virginia', u'1841-03-04', u'1841-04-04']
[u'29', u'Warren G. Harding', u'1865-11-02', u'Blooming Grove', u'Ohio', u'1921-03-04', u'1923-08-02']
[u'18', u'Ulysses S. Grant', u'1822-04-27', u'Point Pleasant', u'Ohio', u'1869-03-04', u'1877-03-04']
[u'3', u'Thomas Jefferson', u'1743-04-13', u'Shadwell', u'Virginia', u'1801-03-04', u'1809-03-04']
[u'26', u'Theodore Roosevelt', u'1858-10-27', u'Manhattan', u'New York', u'1901-09-14', u'1909-03-04']
[u'19', u'Rutherford B. Hayes', u'1822-10-04', u'Delaware', u'Ohio', u'1877-03-04', u'1881-03-04']
[u'40', u'Ronald Reagan', u'1911-02-06', u'Tampico', u'Illinois', u'1981-01-20', u'1989-01-20']
[u'37', u'Richard M. Nixon', u'1913-01-09', u'Yorba Linda', u'California', u'1969-01-20', u'1974-08-09']
[u'13', u'Millard Fillmore', u'1800-01-07', u'Summerhill', u'New York', u'1850-07-09', u'1853-03-04']
[u'8', u'Martin Van Buren', u'1782-12-05', u'Kinderhook', u'New York', u'1837-03-04', u'1841-03-04']
[u'36', u'Lyndon B. Johnson', u'1908-08-27', u'Stonewall', u'Texas', u'1963-11-22', u'1969-01-20']
[u'10', u'John Tyler', u'1790-03-29', u'Charles City County', u'Virginia', u'1841-04-04', u'1845-03-04']
[u'6', u'John Quincy Adams', u'1767-07-11', u'Braintree', u'Massachusetts', u'1825-03-04', u'1829-03-04']
[u'35', u'John F. Kennedy', u'1917-05-29', u'Brookline', u'Massachusetts', u'1961-01-20', u'1963-11-22']
[u'2', u'John Adams', u'1735-10-30', u'Braintree', u'Massachusetts', u'1797-03-04', u'1801-03-04']
[u'39', u'Jimmy Carter', u'1924-10-01', u'Plains', u'Georgia', u'1977-01-20', u'1981-01-20']
[u'5', u'James Monroe', u'1758-04-28', u'Monroe Hall', u'Virginia', u'1817-03-04', u'1825-03-04']
[u'4', u'James Madison', u'1751-03-16', u'Port Conway', u'Virginia', u'1809-03-04', u'1817-03-04']
[u'11', u'James K. Polk', u'1795-11-02', u'Pineville', u'North Carolina', u'1845-03-04', u'1849-03-04']
[u'15', u'James Buchanan', u'1791-04-23', u'Cove Gap', u'Pennsylvania', u'1857-03-04', u'1861-03-04']
[u'20', u'James A. Garfield', u'1831-11-19', u'Moreland Hills', u'Ohio', u'1881-03-04', u'1881-09-19']
[u'31', u'Herbert Hoover', u'1874-08-10', u'West Branch', u'Iowa', u'1929-03-04', u'1933-03-04']
[u'33', u'Harry S. Truman', u'1884-05-08', u'Lamar', u'Missouri', u'1945-04-12', u'1953-01-20']
[u'22', u'Grover Cleveland', u'1837-03-18', u'Caldwell', u'New Jersey', u'1885-03-04', u'1889-03-04']
[u'24', u'Grover Cleveland', u'1837-03-18', u'Caldwell', u'New Jersey', u'1893-03-04', u'1897-03-04']
[u'38', u'Gerald R. Ford', u'1913-07-14', u'Omaha', u'Nebraska', u'1974-08-09', u'1977-01-20']
[u'1', u'George Washington', u'1732-02-22', u'Westmoreland County', u'Virginia', u'1789-04-30', u'1797-03-04']
[u'43', u'George W. Bush', u'1946-07-06', u'New Haven', u'Connecticut', u'2001-01-20', u'2009-01-20']
[u'41', u'George H. W. Bush', u'1924-06-12', u'Milton', u'Massachusetts', u'1989-01-20', u'1993-01-20']
[u'14', u'Franklin Pierce', u'1804-11-23', u'Hillsborough', u'New Hampshire', u'1853-03-04', u'1857-03-04']
[u'32', u'Franklin D. Roosevelt', u'1882-01-30', u'Hyde Park', u'New York', u'1933-03-04', u'1945-04-12']
[u'34', u'Dwight D. Eisenhower', u'1890-10-14', u'Denison', u'Texas', u'1953-01-20', u'1961-01-20']
[u'45', u'Donald Trump', u'1946-06-14', u'Queens', u'New York', u'2017-01-20', u'']
[u'21', u'Chester A. Arthur', u'1829-10-05', u'Fairfield', u'Vermont', u'1881-09-19', u'1885-03-04']
[u'30', u'Calvin Coolidge', u'1872-07-04', u'Plymouth', u'Vermont', u'1923-08-02', u'1929-03-04']
[u'42', u'Bill Clinton', u'1946-08-19', u'Hope', u'Arkansas', u'1993-01-20', u'2001-01-20']
[u'23', u'Benjamin Harrison', u'1833-08-20', u'North Bend', u'Ohio', u'1889-03-04', u'1893-03-04']
[u'44', u'Barack Obama', u'1961-08-04', u'Honolulu', u'Hawaii', u'2009-01-20', u'2017-01-20']
[u'17', u'Andrew Johnson', u'1808-12-29', u'Raleigh', u'North Carolina', u'1865-04-15', u'1869-03-04']
[u'7', u'Andrew Jackson', u'1767-03-15', u'Waxhaws Region', u'North Carolina', u'1829-03-04', u'1837-03-04']
[u'16', u'Abraham Lincoln', u'1809-02-12', u'Sinking spring', u'Kentucky', u'1861-03-04', u'1865-04-15']

Aggregate the values to identify number of president from each state

>>> rdd_pres_agg = rdd_pres.map(lambda x: [x[4],1])
>>> for i in rdd_pres_agg.collect():
...     print (i)
...
[u'Virginia', 1]
[u'Massachusetts', 1]
[u'Virginia', 1]
[u'Virginia', 1]
[u'Virginia', 1]
[u'Massachusetts', 1]
[u'North Carolina', 1]
[u'New York', 1]
[u'Virginia', 1]
[u'Virginia', 1]
[u'North Carolina', 1]
[u'Virginia', 1]
[u'New York', 1]
[u'New Hampshire', 1]
[u'Pennsylvania', 1]
[u'Kentucky', 1]
[u'North Carolina', 1]
[u'Ohio', 1]
[u'Ohio', 1]
[u'Ohio', 1]
[u'Vermont', 1]
[u'New Jersey', 1]
[u'Ohio', 1]
[u'New Jersey', 1]
[u'Ohio', 1]
[u'New York', 1]
[u'Ohio', 1]
[u'Virginia', 1]
[u'Ohio', 1]
[u'Vermont', 1]
[u'Iowa', 1]
[u'New York', 1]
[u'Missouri', 1]
[u'Texas', 1]
[u'Massachusetts', 1]
[u'Texas', 1]
[u'California', 1]
[u'Nebraska', 1]
[u'Georgia', 1]
[u'Illinois', 1]
[u'Massachusetts', 1]
[u'Arkansas', 1]
[u'Connecticut', 1]
[u'Hawaii', 1]
[u'New York', 1]
>>> from operator import add
>>> rdd_pres_agg = rdd_pres_agg.reduceByKey(add)
>>> for i in rdd_pres_agg.collect():
...     print (i)
...
(u'Vermont', 2)
(u'Iowa', 1)
(u'Pennsylvania', 1)
(u'North Carolina', 3)
(u'New Jersey', 2)
(u'Connecticut', 1)
(u'New Hampshire', 1)
(u'Kentucky', 1)
(u'Arkansas', 1)
(u'California', 1)
(u'Texas', 2)
(u'Georgia', 1)
(u'Hawaii', 1)
(u'Nebraska', 1)
(u'Virginia', 8)
(u'New York', 5)
(u'Massachusetts', 4)
(u'Missouri', 1)
(u'Ohio', 7)
(u'Illinois', 1)

Sort the aggregation output in ascending order

>>> for i in rdd_pres_agg.sortBy(keyfunc = lambda x:x[1]).collect():
...     print (i)
...
(u'Iowa', 1)
(u'Pennsylvania', 1)
(u'Connecticut', 1)
(u'New Hampshire', 1)
(u'Kentucky', 1)
(u'Arkansas', 1)
(u'California', 1)
(u'Georgia', 1)
(u'Hawaii', 1)
(u'Nebraska', 1)
(u'Missouri', 1)
(u'Illinois', 1)
(u'Vermont', 2)
(u'New Jersey', 2)
(u'Texas', 2)
(u'North Carolina', 3)
(u'Massachusetts', 4)
(u'New York', 5)
(u'Ohio', 7)
(u'Virginia', 8)

Sort the aggregation in descending order

>>> for i in rdd_pres_agg.sortBy(keyfunc = lambda x:x[1],ascending=False).collect():
...     print (i)
...
(u'Virginia', 8)
(u'Ohio', 7)
(u'New York', 5)
(u'Massachusetts', 4)
(u'North Carolina', 3)
(u'Vermont', 2)
(u'New Jersey', 2)
(u'Texas', 2)
(u'Iowa', 1)
(u'Pennsylvania', 1)
(u'Connecticut', 1)
(u'New Hampshire', 1)
(u'Kentucky', 1)
(u'Arkansas', 1)
(u'California', 1)
(u'Georgia', 1)
(u'Hawaii', 1)
(u'Nebraska', 1)
(u'Missouri', 1)
(u'Illinois', 1)

Now, let’s look into how to perform JOINs using RDD in PySpark. Before that we will introduce one more concept here of Paired RDDs. Paired RDDs are RDD with key-value information. So we will create one “column” as Key and others as values. We will be joining RDDS on the basis of keys and will see the result.

>>> rdd_state_pair = rdd_states.map(lambda x:[x[1],[x[2],x[3]]])
>>> rdd_state_pair.keys().collect()
[u'Alabama', u'Alaska', u'Arizona', u'Arkansas', u'California', u'Colorado', u'Connecticut', u'Delaware', u'Florida', u'Georgia', u'Hawaii', u'Idaho', u'Illinois', u'Indiana', u'Iowa', u'Kansas', u'Kentucky', u'Louisiana', u'Maine', u'Maryland', u'Massachusetts', u'Michigan', u'Minnesota', u'Mississippi', u'Missouri', u'Montana', u'Nebraska', u'Nevada', u'New Hampshire', u'New Jersey', u'New Mexico', u'New York', u'North Carolina', u' North Dakota', u'Ohio', u'Oklahoma', u'Oregon', u'Pennsylvania', u'Rhode Island', u'South Carolina', u'South Dakota', u'Tennessee', u'Texas', u'Utah', u'Vermont', u'Virginia', u'Washington', u'West Virginia', u'Wisconsin', u'Wyoming']
>>> rdd_state_pair.values().collect()
[[u'AL', u'Montgomery'], [u'AK', u'Juneau'], [u'AZ', u'Phoenix'], [u'AR', u'Little Rock'], [u'CA', u'Sacramento'], [u'CO', u'Denver'], [u'CT', u'Hartford'], [u'DE', u'Dover'], [u'FL', u'Tallahassee'], [u'GA', u'Atlanta'], [u'HI', u'Honolulu'], [u'ID', u'Boise'], [u'IL', u'Springfield'], [u'IN', u'Indianapolis'], [u'IA', u'Des Moines'], [u'KS', u'Topeka'], [u'KY', u'Frankfort'], [u'LA', u'Baton Rouge'], [u'ME', u'Augusta'], [u'MD', u'Annapolis'], [u'MA', u'Boston'], [u'MI', u'Lansing'], [u'MN', u'St. Paul'], [u'MS', u'Jackson'], [u'MO', u'Jefferson City'], [u'MT', u'Helena'], [u'NE', u'Lincoln'], [u'NV', u'Carson City'], [u'NH', u'Concord'], [u'NJ', u'Trenton'], [u'NM', u'Santa Fe'], [u'NY', u'Albany'], [u'NC', u'Raleigh'], [u'ND', u'Bismarck'], [u'OH', u'Columbus'], [u'OK', u'Oklahoma City'], [u'OR', u'Salem'], [u'PA', u'Harrisburg'], [u'RI', u'Providence'], [u'SC', u'Columbia'], [u'SD', u'Pierre'], [u'TN', u'Nashville'], [u'TX', u'Austin'], [u'UT', u'Salt Lake City'], [u'VT', u'Montpelier'], [u'VA', u'Richmond'], [u'WA', u'Olympia'], [u'WV', u'Charleston'], [u'WI', u'Madison'], [u'WY', u'Cheyenne']]

>>> rdd_pres_pair = rdd_pres.map(lambda x:[x[4],[x[1],x[3]]])
>>> rdd_pres_pair.keys().collect()
[u'Virginia', u'Massachusetts', u'Virginia', u'Virginia', u'Virginia', u'Massachusetts', u'North Carolina', u'New York', u'Virginia', u'Virginia', u'North Carolina', u'Virginia', u'New York', u'New Hampshire', u'Pennsylvania', u'Kentucky', u'North Carolina', u'Ohio', u'Ohio', u'Ohio', u'Vermont', u'New Jersey', u'Ohio', u'New Jersey', u'Ohio', u'New York', u'Ohio', u'Virginia', u'Ohio', u'Vermont', u'Iowa', u'New York', u'Missouri', u'Texas', u'Massachusetts', u'Texas', u'California', u'Nebraska', u'Georgia', u'Illinois', u'Massachusetts', u'Arkansas', u'Connecticut', u'Hawaii', u'New York']
>>> rdd_pres_pair.values().collect()
[[u'George Washington', u'Westmoreland County'], [u'John Adams', u'Braintree'], [u'Thomas Jefferson', u'Shadwell'], [u'James Madison', u'Port Conway'], [u'James Monroe', u'Monroe Hall'], [u'John Quincy Adams', u'Braintree'], [u'Andrew Jackson', u'Waxhaws Region'], [u'Martin Van Buren', u'Kinderhook'], [u'William Henry Harrison', u'Charles City County'], [u'John Tyler', u'Charles City County'], [u'James K. Polk', u'Pineville'], [u'Zachary Taylor', u'Barboursville'], [u'Millard Fillmore', u'Summerhill'], [u'Franklin Pierce', u'Hillsborough'], [u'James Buchanan', u'Cove Gap'], [u'Abraham Lincoln', u'Sinking spring'], [u'Andrew Johnson', u'Raleigh'], [u'Ulysses S. Grant', u'Point Pleasant'], [u'Rutherford B. Hayes', u'Delaware'], [u'James A. Garfield', u'Moreland Hills'], [u'Chester A. Arthur', u'Fairfield'], [u'Grover Cleveland', u'Caldwell'], [u'Benjamin Harrison', u'North Bend'], [u'Grover Cleveland', u'Caldwell'], [u'William McKinley', u'Niles'], [u'Theodore Roosevelt', u'Manhattan'], [u'William Howard Taft', u'Cincinnati'], [u'Woodrow Wilson', u'Staunton'], [u'Warren G. Harding', u'Blooming Grove'], [u'Calvin Coolidge', u'Plymouth'], [u'Herbert Hoover', u'West Branch'], [u'Franklin D. Roosevelt', u'Hyde Park'], [u'Harry S. Truman', u'Lamar'], [u'Dwight D. Eisenhower', u'Denison'], [u'John F. Kennedy', u'Brookline'], [u'Lyndon B. Johnson', u'Stonewall'], [u'Richard M. Nixon', u'Yorba Linda'], [u'Gerald R. Ford', u'Omaha'], [u'Jimmy Carter', u'Plains'], [u'Ronald Reagan', u'Tampico'], [u'George H. W. Bush', u'Milton'], [u'Bill Clinton', u'Hope'], [u'George W. Bush', u'New Haven'], [u'Barack Obama', u'Honolulu'], [u'Donald Trump', u'Queens']]

JOINs – Inner/Left/Right/Full

INNER JOIN: returns the common rows in both the RDDs. Since we have only 20 states from which president has come. Output will have rows with those 20 states only i.e. 45 rows in output.

>>> rdd_inner = rdd_state_pair.join(rdd_pres_pair)
>>> rdd_inner.count()
45
>>> for i in rdd_inner.collect():
...     print (i)
...
(u'California', ([u'CA', u'Sacramento'], [u'Richard M. Nixon', u'Yorba Linda']))
(u'Pennsylvania', ([u'PA', u'Harrisburg'], [u'James Buchanan', u'Cove Gap']))
(u'New Hampshire', ([u'NH', u'Concord'], [u'Franklin Pierce', u'Hillsborough']))
(u'New Jersey', ([u'NJ', u'Trenton'], [u'Grover Cleveland', u'Caldwell']))
(u'New Jersey', ([u'NJ', u'Trenton'], [u'Grover Cleveland', u'Caldwell']))
(u'Arkansas', ([u'AR', u'Little Rock'], [u'Bill Clinton', u'Hope']))
(u'Iowa', ([u'IA', u'Des Moines'], [u'Herbert Hoover', u'West Branch']))
(u'Virginia', ([u'VA', u'Richmond'], [u'George Washington', u'Westmoreland County']))
(u'Virginia', ([u'VA', u'Richmond'], [u'Thomas Jefferson', u'Shadwell']))
(u'Virginia', ([u'VA', u'Richmond'], [u'James Madison', u'Port Conway']))
(u'Virginia', ([u'VA', u'Richmond'], [u'James Monroe', u'Monroe Hall']))
(u'Virginia', ([u'VA', u'Richmond'], [u'William Henry Harrison', u'Charles City County']))
(u'Virginia', ([u'VA', u'Richmond'], [u'John Tyler', u'Charles City County']))
(u'Virginia', ([u'VA', u'Richmond'], [u'Zachary Taylor', u'Barboursville']))
(u'Virginia', ([u'VA', u'Richmond'], [u'Woodrow Wilson', u'Staunton']))
(u'Hawaii', ([u'HI', u'Honolulu'], [u'Barack Obama', u'Honolulu']))
(u'Massachusetts', ([u'MA', u'Boston'], [u'John Adams', u'Braintree']))
(u'Massachusetts', ([u'MA', u'Boston'], [u'John Quincy Adams', u'Braintree']))
(u'Massachusetts', ([u'MA', u'Boston'], [u'John F. Kennedy', u'Brookline']))
(u'Massachusetts', ([u'MA', u'Boston'], [u'George H. W. Bush', u'Milton']))
(u'Ohio', ([u'OH', u'Columbus'], [u'Ulysses S. Grant', u'Point Pleasant']))
(u'Ohio', ([u'OH', u'Columbus'], [u'Rutherford B. Hayes', u'Delaware']))
(u'Ohio', ([u'OH', u'Columbus'], [u'James A. Garfield', u'Moreland Hills']))
(u'Ohio', ([u'OH', u'Columbus'], [u'Benjamin Harrison', u'North Bend']))
(u'Ohio', ([u'OH', u'Columbus'], [u'William McKinley', u'Niles']))
(u'Ohio', ([u'OH', u'Columbus'], [u'William Howard Taft', u'Cincinnati']))
(u'Ohio', ([u'OH', u'Columbus'], [u'Warren G. Harding', u'Blooming Grove']))
(u'Vermont', ([u'VT', u'Montpelier'], [u'Chester A. Arthur', u'Fairfield']))
(u'Vermont', ([u'VT', u'Montpelier'], [u'Calvin Coolidge', u'Plymouth']))
(u'Kentucky', ([u'KY', u'Frankfort'], [u'Abraham Lincoln', u'Sinking spring']))
(u'North Carolina', ([u'NC', u'Raleigh'], [u'Andrew Jackson', u'Waxhaws Region']))
(u'North Carolina', ([u'NC', u'Raleigh'], [u'James K. Polk', u'Pineville']))
(u'North Carolina', ([u'NC', u'Raleigh'], [u'Andrew Johnson', u'Raleigh']))
(u'Connecticut', ([u'CT', u'Hartford'], [u'George W. Bush', u'New Haven']))
(u'Texas', ([u'TX', u'Austin'], [u'Dwight D. Eisenhower', u'Denison']))
(u'Texas', ([u'TX', u'Austin'], [u'Lyndon B. Johnson', u'Stonewall']))
(u'Georgia', ([u'GA', u'Atlanta'], [u'Jimmy Carter', u'Plains']))
(u'New York', ([u'NY', u'Albany'], [u'Martin Van Buren', u'Kinderhook']))
(u'New York', ([u'NY', u'Albany'], [u'Millard Fillmore', u'Summerhill']))
(u'New York', ([u'NY', u'Albany'], [u'Theodore Roosevelt', u'Manhattan']))
(u'New York', ([u'NY', u'Albany'], [u'Franklin D. Roosevelt', u'Hyde Park']))
(u'New York', ([u'NY', u'Albany'], [u'Donald Trump', u'Queens']))
(u'Nebraska', ([u'NE', u'Lincoln'], [u'Gerald R. Ford', u'Omaha']))
(u'Illinois', ([u'IL', u'Springfield'], [u'Ronald Reagan', u'Tampico']))
(u'Missouri', ([u'MO', u'Jefferson City'], [u'Harry S. Truman', u'Lamar']))

LEFT OUTER JOIN: return all the records from left and matching from right side RDD. All the 50 records will come from left-side RDD. Also some states have one-to-many mapping possible as few president have come from same state, we may have multiple occurences of such states in output. Thereby increasing the expected number of output rows. Also when there is no match from right-side RDD, “None” will be returned.This is equivalent to NULL.

>>> rdd_left = rdd_state_pair.leftOuterJoin(rdd_pres_pair)
>>> rdd_left.count()
75
>>> for i in rdd_left.collect():
...     print(i)
...
(u' North Dakota', ([u'ND', u'Bismarck'], None))
(u'Wisconsin', ([u'WI', u'Madison'], None))
(u'California', ([u'CA', u'Sacramento'], [u'Richard M. Nixon', u'Yorba Linda']))
(u'Pennsylvania', ([u'PA', u'Harrisburg'], [u'James Buchanan', u'Cove Gap']))
(u'Utah', ([u'UT', u'Salt Lake City'], None))
(u'New Hampshire', ([u'NH', u'Concord'], [u'Franklin Pierce', u'Hillsborough']))
(u'Florida', ([u'FL', u'Tallahassee'], None))
(u'New Jersey', ([u'NJ', u'Trenton'], [u'Grover Cleveland', u'Caldwell']))
(u'New Jersey', ([u'NJ', u'Trenton'], [u'Grover Cleveland', u'Caldwell']))
(u'Arkansas', ([u'AR', u'Little Rock'], [u'Bill Clinton', u'Hope']))
(u'South Carolina', ([u'SC', u'Columbia'], None))
(u'Iowa', ([u'IA', u'Des Moines'], [u'Herbert Hoover', u'West Branch']))
(u'Louisiana', ([u'LA', u'Baton Rouge'], None))
(u'Washington', ([u'WA', u'Olympia'], None))
(u'Colorado', ([u'CO', u'Denver'], None))
(u'Virginia', ([u'VA', u'Richmond'], [u'George Washington', u'Westmoreland County']))
(u'Virginia', ([u'VA', u'Richmond'], [u'Thomas Jefferson', u'Shadwell']))
(u'Virginia', ([u'VA', u'Richmond'], [u'James Madison', u'Port Conway']))
(u'Virginia', ([u'VA', u'Richmond'], [u'James Monroe', u'Monroe Hall']))
(u'Virginia', ([u'VA', u'Richmond'], [u'William Henry Harrison', u'Charles City County']))
(u'Virginia', ([u'VA', u'Richmond'], [u'John Tyler', u'Charles City County']))
(u'Virginia', ([u'VA', u'Richmond'], [u'Zachary Taylor', u'Barboursville']))
(u'Virginia', ([u'VA', u'Richmond'], [u'Woodrow Wilson', u'Staunton']))
(u'Hawaii', ([u'HI', u'Honolulu'], [u'Barack Obama', u'Honolulu']))
(u'Alaska', ([u'AK', u'Juneau'], None))
(u'Minnesota', ([u'MN', u'St. Paul'], None))
(u'Massachusetts', ([u'MA', u'Boston'], [u'John Adams', u'Braintree']))
(u'Massachusetts', ([u'MA', u'Boston'], [u'John Quincy Adams', u'Braintree']))
(u'Massachusetts', ([u'MA', u'Boston'], [u'John F. Kennedy', u'Brookline']))
(u'Massachusetts', ([u'MA', u'Boston'], [u'George H. W. Bush', u'Milton']))
(u'Ohio', ([u'OH', u'Columbus'], [u'Ulysses S. Grant', u'Point Pleasant']))
(u'Ohio', ([u'OH', u'Columbus'], [u'Rutherford B. Hayes', u'Delaware']))
(u'Ohio', ([u'OH', u'Columbus'], [u'James A. Garfield', u'Moreland Hills']))
(u'Ohio', ([u'OH', u'Columbus'], [u'Benjamin Harrison', u'North Bend']))
(u'Ohio', ([u'OH', u'Columbus'], [u'William McKinley', u'Niles']))
(u'Ohio', ([u'OH', u'Columbus'], [u'William Howard Taft', u'Cincinnati']))
(u'Ohio', ([u'OH', u'Columbus'], [u'Warren G. Harding', u'Blooming Grove']))
(u'Indiana', ([u'IN', u'Indianapolis'], None))
(u'Maine', ([u'ME', u'Augusta'], None))
(u'New Mexico', ([u'NM', u'Santa Fe'], None))
(u'Mississippi', ([u'MS', u'Jackson'], None))
(u'Oregon', ([u'OR', u'Salem'], None))
(u'Michigan', ([u'MI', u'Lansing'], None))
(u'Vermont', ([u'VT', u'Montpelier'], [u'Chester A. Arthur', u'Fairfield']))
(u'Vermont', ([u'VT', u'Montpelier'], [u'Calvin Coolidge', u'Plymouth']))
(u'Kentucky', ([u'KY', u'Frankfort'], [u'Abraham Lincoln', u'Sinking spring']))
(u'Oklahoma', ([u'OK', u'Oklahoma City'], None))
(u'North Carolina', ([u'NC', u'Raleigh'], [u'Andrew Jackson', u'Waxhaws Region']))
(u'North Carolina', ([u'NC', u'Raleigh'], [u'James K. Polk', u'Pineville']))
(u'North Carolina', ([u'NC', u'Raleigh'], [u'Andrew Johnson', u'Raleigh']))
(u'Connecticut', ([u'CT', u'Hartford'], [u'George W. Bush', u'New Haven']))
(u'Texas', ([u'TX', u'Austin'], [u'Dwight D. Eisenhower', u'Denison']))
(u'Texas', ([u'TX', u'Austin'], [u'Lyndon B. Johnson', u'Stonewall']))
(u'Maryland', ([u'MD', u'Annapolis'], None))
(u'Alabama', ([u'AL', u'Montgomery'], None))
(u'Idaho', ([u'ID', u'Boise'], None))
(u'Wyoming', ([u'WY', u'Cheyenne'], None))
(u'Tennessee', ([u'TN', u'Nashville'], None))
(u'Georgia', ([u'GA', u'Atlanta'], [u'Jimmy Carter', u'Plains']))
(u'New York', ([u'NY', u'Albany'], [u'Martin Van Buren', u'Kinderhook']))
(u'New York', ([u'NY', u'Albany'], [u'Millard Fillmore', u'Summerhill']))
(u'New York', ([u'NY', u'Albany'], [u'Theodore Roosevelt', u'Manhattan']))
(u'New York', ([u'NY', u'Albany'], [u'Franklin D. Roosevelt', u'Hyde Park']))
(u'New York', ([u'NY', u'Albany'], [u'Donald Trump', u'Queens']))
(u'West Virginia', ([u'WV', u'Charleston'], None))
(u'Kansas', ([u'KS', u'Topeka'], None))
(u'Nebraska', ([u'NE', u'Lincoln'], [u'Gerald R. Ford', u'Omaha']))
(u'Illinois', ([u'IL', u'Springfield'], [u'Ronald Reagan', u'Tampico']))
(u'Arizona', ([u'AZ', u'Phoenix'], None))
(u'Montana', ([u'MT', u'Helena'], None))
(u'Nevada', ([u'NV', u'Carson City'], None))
(u'Missouri', ([u'MO', u'Jefferson City'], [u'Harry S. Truman', u'Lamar']))
(u'Delaware', ([u'DE', u'Dover'], None))
(u'South Dakota', ([u'SD', u'Pierre'], None))
(u'Rhode Island', ([u'RI', u'Providence'], None))

RIGHT OUTER JOIN: returns all the records from the right side RDD and matching records from left-side RDD. If not match then None is returned. Since in this case, all the president have come from some state we will not see any “None” values. i.e. it’a a 100% match all the time.

>>> rdd_right = rdd_state_pair.rightOuterJoin(rdd_pres_pair)
>>> rdd_right.count()
45
>>> for i in rdd_right.collect():
...     print (i)
...
(u'California', ([u'CA', u'Sacramento'], [u'Richard M. Nixon', u'Yorba Linda']))
(u'Pennsylvania', ([u'PA', u'Harrisburg'], [u'James Buchanan', u'Cove Gap']))
(u'New Hampshire', ([u'NH', u'Concord'], [u'Franklin Pierce', u'Hillsborough']))
(u'New Jersey', ([u'NJ', u'Trenton'], [u'Grover Cleveland', u'Caldwell']))
(u'New Jersey', ([u'NJ', u'Trenton'], [u'Grover Cleveland', u'Caldwell']))
(u'Arkansas', ([u'AR', u'Little Rock'], [u'Bill Clinton', u'Hope']))
(u'Iowa', ([u'IA', u'Des Moines'], [u'Herbert Hoover', u'West Branch']))
(u'Virginia', ([u'VA', u'Richmond'], [u'George Washington', u'Westmoreland County']))
(u'Virginia', ([u'VA', u'Richmond'], [u'Thomas Jefferson', u'Shadwell']))
(u'Virginia', ([u'VA', u'Richmond'], [u'James Madison', u'Port Conway']))
(u'Virginia', ([u'VA', u'Richmond'], [u'James Monroe', u'Monroe Hall']))
(u'Virginia', ([u'VA', u'Richmond'], [u'William Henry Harrison', u'Charles City County']))
(u'Virginia', ([u'VA', u'Richmond'], [u'John Tyler', u'Charles City County']))
(u'Virginia', ([u'VA', u'Richmond'], [u'Zachary Taylor', u'Barboursville']))
(u'Virginia', ([u'VA', u'Richmond'], [u'Woodrow Wilson', u'Staunton']))
(u'Hawaii', ([u'HI', u'Honolulu'], [u'Barack Obama', u'Honolulu']))
(u'Massachusetts', ([u'MA', u'Boston'], [u'John Adams', u'Braintree']))
(u'Massachusetts', ([u'MA', u'Boston'], [u'John Quincy Adams', u'Braintree']))
(u'Massachusetts', ([u'MA', u'Boston'], [u'John F. Kennedy', u'Brookline']))
(u'Massachusetts', ([u'MA', u'Boston'], [u'George H. W. Bush', u'Milton']))
(u'Ohio', ([u'OH', u'Columbus'], [u'Ulysses S. Grant', u'Point Pleasant']))
(u'Ohio', ([u'OH', u'Columbus'], [u'Rutherford B. Hayes', u'Delaware']))
(u'Ohio', ([u'OH', u'Columbus'], [u'James A. Garfield', u'Moreland Hills']))
(u'Ohio', ([u'OH', u'Columbus'], [u'Benjamin Harrison', u'North Bend']))
(u'Ohio', ([u'OH', u'Columbus'], [u'William McKinley', u'Niles']))
(u'Ohio', ([u'OH', u'Columbus'], [u'William Howard Taft', u'Cincinnati']))
(u'Ohio', ([u'OH', u'Columbus'], [u'Warren G. Harding', u'Blooming Grove']))
(u'Vermont', ([u'VT', u'Montpelier'], [u'Chester A. Arthur', u'Fairfield']))
(u'Vermont', ([u'VT', u'Montpelier'], [u'Calvin Coolidge', u'Plymouth']))
(u'Kentucky', ([u'KY', u'Frankfort'], [u'Abraham Lincoln', u'Sinking spring']))
(u'North Carolina', ([u'NC', u'Raleigh'], [u'Andrew Jackson', u'Waxhaws Region']))
(u'North Carolina', ([u'NC', u'Raleigh'], [u'James K. Polk', u'Pineville']))
(u'North Carolina', ([u'NC', u'Raleigh'], [u'Andrew Johnson', u'Raleigh']))
(u'Connecticut', ([u'CT', u'Hartford'], [u'George W. Bush', u'New Haven']))
(u'Texas', ([u'TX', u'Austin'], [u'Dwight D. Eisenhower', u'Denison']))
(u'Texas', ([u'TX', u'Austin'], [u'Lyndon B. Johnson', u'Stonewall']))
(u'Georgia', ([u'GA', u'Atlanta'], [u'Jimmy Carter', u'Plains']))
(u'New York', ([u'NY', u'Albany'], [u'Martin Van Buren', u'Kinderhook']))
(u'New York', ([u'NY', u'Albany'], [u'Millard Fillmore', u'Summerhill']))
(u'New York', ([u'NY', u'Albany'], [u'Theodore Roosevelt', u'Manhattan']))
(u'New York', ([u'NY', u'Albany'], [u'Franklin D. Roosevelt', u'Hyde Park']))
(u'New York', ([u'NY', u'Albany'], [u'Donald Trump', u'Queens']))
(u'Nebraska', ([u'NE', u'Lincoln'], [u'Gerald R. Ford', u'Omaha']))
(u'Illinois', ([u'IL', u'Springfield'], [u'Ronald Reagan', u'Tampico']))
(u'Missouri', ([u'MO', u'Jefferson City'], [u'Harry S. Truman', u'Lamar']))

Full Outer Join: returns all the matching records from both the RDDs and remaining records from left & right RDD. When no match exists, None is returned.

>>> rdd_full = rdd_state_pair.fullOuterJoin(rdd_pres_pair)
>>> rdd_full.count()
75
>>> for i in rdd_full.collect():
...     print (i)
...
(u' North Dakota', ([u'ND', u'Bismarck'], None))
(u'Wisconsin', ([u'WI', u'Madison'], None))
(u'California', ([u'CA', u'Sacramento'], [u'Richard M. Nixon', u'Yorba Linda']))
(u'Pennsylvania', ([u'PA', u'Harrisburg'], [u'James Buchanan', u'Cove Gap']))
(u'Utah', ([u'UT', u'Salt Lake City'], None))
(u'New Hampshire', ([u'NH', u'Concord'], [u'Franklin Pierce', u'Hillsborough']))
(u'Florida', ([u'FL', u'Tallahassee'], None))
(u'New Jersey', ([u'NJ', u'Trenton'], [u'Grover Cleveland', u'Caldwell']))
(u'New Jersey', ([u'NJ', u'Trenton'], [u'Grover Cleveland', u'Caldwell']))
(u'Arkansas', ([u'AR', u'Little Rock'], [u'Bill Clinton', u'Hope']))
(u'South Carolina', ([u'SC', u'Columbia'], None))
(u'Iowa', ([u'IA', u'Des Moines'], [u'Herbert Hoover', u'West Branch']))
(u'Louisiana', ([u'LA', u'Baton Rouge'], None))
(u'Washington', ([u'WA', u'Olympia'], None))
(u'Colorado', ([u'CO', u'Denver'], None))
(u'Virginia', ([u'VA', u'Richmond'], [u'George Washington', u'Westmoreland County']))
(u'Virginia', ([u'VA', u'Richmond'], [u'Thomas Jefferson', u'Shadwell']))
(u'Virginia', ([u'VA', u'Richmond'], [u'James Madison', u'Port Conway']))
(u'Virginia', ([u'VA', u'Richmond'], [u'James Monroe', u'Monroe Hall']))
(u'Virginia', ([u'VA', u'Richmond'], [u'William Henry Harrison', u'Charles City County']))
(u'Virginia', ([u'VA', u'Richmond'], [u'John Tyler', u'Charles City County']))
(u'Virginia', ([u'VA', u'Richmond'], [u'Zachary Taylor', u'Barboursville']))
(u'Virginia', ([u'VA', u'Richmond'], [u'Woodrow Wilson', u'Staunton']))
(u'Hawaii', ([u'HI', u'Honolulu'], [u'Barack Obama', u'Honolulu']))
(u'Alaska', ([u'AK', u'Juneau'], None))
(u'Minnesota', ([u'MN', u'St. Paul'], None))
(u'Massachusetts', ([u'MA', u'Boston'], [u'John Adams', u'Braintree']))
(u'Massachusetts', ([u'MA', u'Boston'], [u'John Quincy Adams', u'Braintree']))
(u'Massachusetts', ([u'MA', u'Boston'], [u'John F. Kennedy', u'Brookline']))
(u'Massachusetts', ([u'MA', u'Boston'], [u'George H. W. Bush', u'Milton']))
(u'Ohio', ([u'OH', u'Columbus'], [u'Ulysses S. Grant', u'Point Pleasant']))
(u'Ohio', ([u'OH', u'Columbus'], [u'Rutherford B. Hayes', u'Delaware']))
(u'Ohio', ([u'OH', u'Columbus'], [u'James A. Garfield', u'Moreland Hills']))
(u'Ohio', ([u'OH', u'Columbus'], [u'Benjamin Harrison', u'North Bend']))
(u'Ohio', ([u'OH', u'Columbus'], [u'William McKinley', u'Niles']))
(u'Ohio', ([u'OH', u'Columbus'], [u'William Howard Taft', u'Cincinnati']))
(u'Ohio', ([u'OH', u'Columbus'], [u'Warren G. Harding', u'Blooming Grove']))
(u'Indiana', ([u'IN', u'Indianapolis'], None))
(u'Maine', ([u'ME', u'Augusta'], None))
(u'New Mexico', ([u'NM', u'Santa Fe'], None))
(u'Mississippi', ([u'MS', u'Jackson'], None))
(u'Oregon', ([u'OR', u'Salem'], None))
(u'Michigan', ([u'MI', u'Lansing'], None))
(u'Vermont', ([u'VT', u'Montpelier'], [u'Chester A. Arthur', u'Fairfield']))
(u'Vermont', ([u'VT', u'Montpelier'], [u'Calvin Coolidge', u'Plymouth']))
(u'Kentucky', ([u'KY', u'Frankfort'], [u'Abraham Lincoln', u'Sinking spring']))
(u'Oklahoma', ([u'OK', u'Oklahoma City'], None))
(u'North Carolina', ([u'NC', u'Raleigh'], [u'Andrew Jackson', u'Waxhaws Region']))
(u'North Carolina', ([u'NC', u'Raleigh'], [u'James K. Polk', u'Pineville']))
(u'North Carolina', ([u'NC', u'Raleigh'], [u'Andrew Johnson', u'Raleigh']))
(u'Connecticut', ([u'CT', u'Hartford'], [u'George W. Bush', u'New Haven']))
(u'Texas', ([u'TX', u'Austin'], [u'Dwight D. Eisenhower', u'Denison']))
(u'Texas', ([u'TX', u'Austin'], [u'Lyndon B. Johnson', u'Stonewall']))
(u'Maryland', ([u'MD', u'Annapolis'], None))
(u'Alabama', ([u'AL', u'Montgomery'], None))
(u'Idaho', ([u'ID', u'Boise'], None))
(u'Wyoming', ([u'WY', u'Cheyenne'], None))
(u'Tennessee', ([u'TN', u'Nashville'], None))
(u'Georgia', ([u'GA', u'Atlanta'], [u'Jimmy Carter', u'Plains']))
(u'New York', ([u'NY', u'Albany'], [u'Martin Van Buren', u'Kinderhook']))
(u'New York', ([u'NY', u'Albany'], [u'Millard Fillmore', u'Summerhill']))
(u'New York', ([u'NY', u'Albany'], [u'Theodore Roosevelt', u'Manhattan']))
(u'New York', ([u'NY', u'Albany'], [u'Franklin D. Roosevelt', u'Hyde Park']))
(u'New York', ([u'NY', u'Albany'], [u'Donald Trump', u'Queens']))
(u'West Virginia', ([u'WV', u'Charleston'], None))
(u'Kansas', ([u'KS', u'Topeka'], None))
(u'Nebraska', ([u'NE', u'Lincoln'], [u'Gerald R. Ford', u'Omaha']))
(u'Illinois', ([u'IL', u'Springfield'], [u'Ronald Reagan', u'Tampico']))
(u'Arizona', ([u'AZ', u'Phoenix'], None))
(u'Montana', ([u'MT', u'Helena'], None))
(u'Nevada', ([u'NV', u'Carson City'], None))
(u'Missouri', ([u'MO', u'Jefferson City'], [u'Harry S. Truman', u'Lamar']))
(u'Delaware', ([u'DE', u'Dover'], None))
(u'South Dakota', ([u'SD', u'Pierre'], None))
(u'Rhode Island', ([u'RI', u'Providence'], None))

I have tried to cover common operations possible on RDD and it shall cover most of the scenarios. If there is anything else you think I shall cover, feel free to leave a comment.

Leave a Reply

Your email address will not be published. Required fields are marked *