PySpark

Basic RDD operations in PySpark

A Resilient Distributed Dataset (RDD) is the basic abstraction in Spark. In other words, we can say it is the most common structure that holds data in Spark. RDD is distributed, immutable , fault tolerant, optimized for in-memory computation. Let’s see some basic example of RDD in pyspark.

Load file into RDD. Path should be HDFS path and not local.

>>> rdd_usa = sc.textFile("/tmp/US_President_data.txt")

Check count of records in RDD:

>>> rdd_usa.count()
46

Check sample records from RDD:

>>> rdd_usa.take(2)
[u'President_ID~President_Name~Pres_DOB~President_Birthplace~President_StateColony~Pres_In~President_Outoffice', u'1~George Washington~1732-02-22~Westmoreland County~Virginia~1789-04-30~1797-03-04']

Traverse each record in RDD. You cannot directly iterate through records in RDD. To bring all the records to DRIVER you can use collect() action.

>>> for i in rdd_usa.collect():
...     print (i)
...
President_ID~President_Name~Pres_DOB~President_Birthplace~President_StateColony~Pres_In~President_Outoffice
1~George Washington~1732-02-22~Westmoreland County~Virginia~1789-04-30~1797-03-04
2~John Adams~1735-10-30~Braintree~Massachusetts~1797-03-04~1801-03-04
3~Thomas Jefferson~1743-04-13~Shadwell~Virginia~1801-03-04~1809-03-04
4~James Madison~1751-03-16~Port Conway~Virginia~1809-03-04~1817-03-04
5~James Monroe~1758-04-28~Monroe Hall~Virginia~1817-03-04~1825-03-04
6~John Quincy Adams~1767-07-11~Braintree~Massachusetts~1825-03-04~1829-03-04
7~Andrew Jackson~1767-03-15~Waxhaws Region~North Carolina~1829-03-04~1837-03-04
8~Martin Van Buren~1782-12-05~Kinderhook~New York~1837-03-04~1841-03-04
9~William Henry Harrison~1773-02-09~Charles City County~Virginia~1841-03-04~1841-04-04
10~John Tyler~1790-03-29~Charles City County~Virginia~1841-04-04~1845-03-04
11~James K. Polk~1795-11-02~Pineville~North Carolina~1845-03-04~1849-03-04
12~Zachary Taylor~1784-11-24~Barboursville~Virginia~1849-03-04~1850-07-09
13~Millard Fillmore~1800-01-07~Summerhill~New York~1850-07-09~1853-03-04
14~Franklin Pierce~1804-11-23~Hillsborough~New Hampshire~1853-03-04~1857-03-04
15~James Buchanan~1791-04-23~Cove Gap~Pennsylvania~1857-03-04~1861-03-04
16~Abraham Lincoln~1809-02-12~Sinking spring~Kentucky~1861-03-04~1865-04-15
17~Andrew Johnson~1808-12-29~Raleigh~North Carolina~1865-04-15~1869-03-04
18~Ulysses S. Grant~1822-04-27~Point Pleasant~Ohio~1869-03-04~1877-03-04
19~Rutherford B. Hayes~1822-10-04~Delaware~Ohio~1877-03-04~1881-03-04
20~James A. Garfield~1831-11-19~Moreland Hills~Ohio~1881-03-04~1881-09-19
21~Chester A. Arthur~1829-10-05~Fairfield~Vermont~1881-09-19~1885-03-04
22~Grover Cleveland~1837-03-18~Caldwell~New Jersey~1885-03-04~1889-03-04
23~Benjamin Harrison~1833-08-20~North Bend~Ohio~1889-03-04~1893-03-04
24~Grover Cleveland~1837-03-18~Caldwell~New Jersey~1893-03-04~1897-03-04
25~William McKinley~1843-01-29~Niles~Ohio~1897-03-04~1901-09-14
26~Theodore Roosevelt~1858-10-27~Manhattan~New York~1901-09-14~1909-03-04
27~William Howard Taft~1857-09-15~Cincinnati~Ohio~1909-03-04~1913-03-04
28~Woodrow Wilson~1856-12-28~Staunton~Virginia~1913-03-04~1921-03-04
29~Warren G. Harding~1865-11-02~Blooming Grove~Ohio~1921-03-04~1923-08-02
30~Calvin Coolidge~1872-07-04~Plymouth~Vermont~1923-08-02~1929-03-04
31~Herbert Hoover~1874-08-10~West Branch~Iowa~1929-03-04~1933-03-04
32~Franklin D. Roosevelt~1882-01-30~Hyde Park~New York~1933-03-04~1945-04-12
33~Harry S. Truman~1884-05-08~Lamar~Missouri~1945-04-12~1953-01-20
34~Dwight D. Eisenhower~1890-10-14~Denison~Texas~1953-01-20~1961-01-20
35~John F. Kennedy~1917-05-29~Brookline~Massachusetts~1961-01-20~1963-11-22
36~Lyndon B. Johnson~1908-08-27~Stonewall~Texas~1963-11-22~1969-01-20
37~Richard M. Nixon~1913-01-09~Yorba Linda~California~1969-01-20~1974-08-09
38~Gerald R. Ford~1913-07-14~Omaha~Nebraska~1974-08-09~1977-01-20
39~Jimmy Carter~1924-10-01~Plains~Georgia~1977-01-20~1981-01-20
40~Ronald Reagan~1911-02-06~Tampico~Illinois~1981-01-20~1989-01-20
41~George H. W. Bush~1924-06-12~Milton~Massachusetts~1989-01-20~1993-01-20
42~Bill Clinton~1946-08-19~Hope~Arkansas~1993-01-20~2001-01-20
43~George W. Bush~1946-07-06~New Haven~Connecticut~2001-01-20~2009-01-20
44~Barack Obama~1961-08-04~Honolulu~Hawaii~2009-01-20~2017-01-20
45~Donald Trump~1946-06-14~Queens~New York~2017-01-20~

Apply Map function to convert all columns to upper case

>>> rdd_pres = rdd_usa.map(lambda x: x.upper())
>>> rdd_pres.take(2)
[u'PRESIDENT_ID~PRESIDENT_NAME~PRES_DOB~PRESIDENT_BIRTHPLACE~PRESIDENT_STATECOLONY~PRES_IN~PRESIDENT_OUTOFFICE', u'1~GEORGE WASHINGTON~1732-02-22~WESTMORELAND COUNTY~VIRGINIA~1789-04-30~1797-03-04']

Remove the header or FILTER any record

>>> rdd_pres = rdd_pres.filter(lambda x: not(x.startswith("PRESIDENT_ID")))
>>> rdd_pres.take(2)
[u'1~GEORGE WASHINGTON~1732-02-22~WESTMORELAND COUNTY~VIRGINIA~1789-04-30~1797-03-04', u'2~JOHN ADAMS~1735-10-30~BRAINTREE~MASSACHUSETTS~1797-03-04~1801-03-04']
>>> rdd_pres.count()
45

Transform RDD by specifying delimiter present in record

>>> rdd_pres=rdd_pres.map(lambda x:x.split("~"))
>>> rdd_pres.take(2)
[[u'1', u'GEORGE WASHINGTON', u'1732-02-22', u'WESTMORELAND COUNTY', u'VIRGINIA', u'1789-04-30', u'1797-03-04'], [u'2', u'JOHN ADAMS', u'1735-10-30', u'BRAINTREE', u'MASSACHUSETTS', u'1797-03-04', u'1801-03-04']]

2 thoughts on “Basic RDD operations in PySpark”

Leave a Reply