One of the most common operation in any DATA Analytics environment is to generate sequences. There are multiple ways of generating SEQUENCE numbers however I find zipWithIndex as the best one in terms of simplicity and performance combined. Especially when requirement is to generate consecutive numbers without any gap. Below is the detailed code which shall help in generating surrogate keys/natural keys/sequence numbers.

Step 1: Create a dataframe with all the required columns from the table.

df_0= sqlContext.sql(“select pres_name,pres_dob,pres_bp,pres_bs,pres_in,pres_out from usa_prez”)
df_0.printSchema()

df_0.printSchema()
root
|– pres_name: string (nullable = true)
|– pres_dob: date (nullable = true)
|– pres_bp: string (nullable = true)
|– pres_bs: string (nullable = true)
|– pres_in: date (nullable = true)
|– pres_out: date (nullable = true)

Step 2: Add a new column for sequence and set the datatype for it.Presently we are using INT, if you want bigger number just change lit(1) to lit(long(1)). Also import lit method from sql package

from pyspark.sql.functions import lit
df_0_schema = df_0.withColumn(“pres_id”, lit(1))
df_0_schema.printSchema()

root
|– pres_name: string (nullable = true)
|– pres_dob: date (nullable = true)
|– pres_bp: string (nullable = true)
|– pres_bs: string (nullable = true)
|– pres_in: date (nullable = true)
|– pres_out: date (nullable = true)
|– pres_id: integer (nullable = false)

Step 3: zipWithIndex is method for Resilient Distributed Dataset (RDD). So we have to convert existing Dataframe into RDD. Since zipWithIndex start indices value from 0 and we want to start from 1, we have added 1 to “[rowId+1]”. Replace 1 with your offset value if any. Also we have to add newly generated number to existing row list. Hence used lambda function.

rdd_1 = df_0.rdd.zipWithIndex().map(lambda (row,rowId): ( list(row) + [rowId+1]))

Step 4: Convert rdd back to dataframe. Since Dataframes have schema information associated with it we will impose a structure on our data calculated in Step 2.

df_1 = sqlContext.createDataFrame(rdd_1, schema=df_0_schema.schema)
df_1.printSchema()

Step 5: Check the data in dataframe. Also you can convert it into temp table if you want to use sqlContext

df_1.show(50)
df_1.registerTempTable(“usa_prez_tmp”)
sqlContext.sql(“select * from usa_prez_tmp”).show(50)

Keeping everything together; below is the complete code:

from pyspark.sql.functions import lit
df_0= sqlContext.sql(“select pres_name,pres_dob,pres_bp,pres_bs,pres_in,pres_out from usa_prez”)
df_0.printSchema()
df_0_schema = df_0.withColumn(“pres_id”, lit(1))
df_0_schema.printSchema()
rdd_1 = df_0.rdd.zipWithIndex().map(lambda (row,rowId): ( list(row) + [rowId+1]))
df_1 = sqlContext.createDataFrame(rdd_1, schema=df_0_schema.schema)
df_1.printSchema()
df_1.show(50)
df_1.registerTempTable(“usa_prez_tmp”)
sqlContext.sql(“select * from usa_prez_tmp”).show(50)

+——————–+———-+——————-+——————–+———-+———-+——-+
| pres_name| pres_dob| pres_bp| pres_bs| pres_in| pres_out|pres_id|
+——————–+———-+——————-+——————–+———-+———-+——-+
| George Washington|1732-02-22|Westmoreland County| Virginia|1789-04-30|1797-03-04| 1|
| John Adams|1735-10-30| Braintree| Massachusetts|1797-03-04|1801-03-04| 2|
| Thomas Jefferson|1743-04-13| Shadwell| Virginia|1801-03-04|1809-03-04| 3|
| James Madison|1751-03-16| Port Conway| Virginia|1809-03-04|1817-03-04| 4|
| James Monroe|1758-04-28| Monroe Hall| Virginia|1817-03-04|1825-03-04| 5|
| John Quincy Adams|1767-07-11| Braintree| Massachusetts|1825-03-04|1829-03-04| 6|
| Andrew Jackson|1767-03-15| Waxhaws Region|South/North Carolina|1829-03-04|1837-03-04| 7|
| Martin Van Buren|1782-12-05| Kinderhook| New York|1837-03-04|1841-03-04| 8|
|William Henry Har…|1773-02-09|Charles City County| Virginia|1841-03-04|1841-04-04| 9|
| John Tyler|1790-03-29|Charles City County| Virginia|1841-04-04|1845-03-04| 10|
| James K. Polk|1795-11-02| Pineville| North Carolina|1845-03-04|1849-03-04| 11|
| Zachary Taylor|1784-11-24| Barboursville| Virginia|1849-03-04|1850-07-09| 12|
| Millard Fillmore|1800-01-07| Summerhill| New York|1850-07-09|1853-03-04| 13|
| Franklin Pierce|1804-11-23| Hillsborough| New Hampshire|1853-03-04|1857-03-04| 14|
| James Buchanan|1791-04-23| Cove Gap| Pennsylvania|1857-03-04|1861-03-04| 15|
| Abraham Lincoln|1809-02-12| Sinking spring| Kentucky|1861-03-04|1865-04-15| 16|
| Andrew Johnson|1808-12-29| Raleigh| North Carolina|1865-04-15|1869-03-04| 17|
| Ulysses S. Grant|1822-04-27| Point Pleasant| Ohio|1869-03-04|1877-03-04| 18|
| Rutherford B. Hayes|1822-10-04| Delaware| Ohio|1877-03-04|1881-03-04| 19|
| James A. Garfield|1831-11-19| Moreland Hills| Ohio|1881-03-04|1881-09-19| 20|
| Chester A. Arthur|1829-10-05| Fairfield| Vermont|1881-09-19|1885-03-04| 21|
| Grover Cleveland|1837-03-18| Caldwell| New Jersey|1885-03-04|1889-03-04| 22|
| Benjamin Harrison|1833-08-20| North Bend| Ohio|1889-03-04|1893-03-04| 23|
| Grover Cleveland|1837-03-18| Caldwell| New Jersey|1893-03-04|1897-03-04| 24|
| William McKinley|1843-01-29| Niles| Ohio|1897-03-04|1901-09-14| 25|
| Theodore Roosevelt|1858-10-27| Manhattan| New York|1901-09-14|1909-03-04| 26|
| William Howard Taft|1857-09-15| Cincinnati| Ohio|1909-03-04|1913-03-04| 27|
| Woodrow Wilson|1856-12-28| Staunton| Virginia|1913-03-04|1921-03-04| 28|
| Warren G. Harding|1865-11-02| Blooming Grove| Ohio|1921-03-04|1923-08-02| 29|
| Calvin Coolidge|1872-07-04| Plymouth| Vermont|1923-08-02|1929-03-04| 30|
| Herbert Hoover|1874-08-10| West Branch| Iowa|1929-03-04|1933-03-04| 31|
|Franklin D. Roose…|1882-01-30| Hyde Park| New York|1933-03-04|1945-04-12| 32|
| Harry S. Truman|1884-05-08| Lamar| Missouri|1945-04-12|1953-01-20| 33|
|Dwight D. Eisenhower|1890-10-14| Denison| Texas|1953-01-20|1961-01-20| 34|
| John F. Kennedy|1917-05-29| Brookline| Massachusetts|1961-01-20|1963-11-22| 35|
| Lyndon B. Johnson|1908-08-27| Stonewall| Texas|1963-11-22|1969-01-20| 36|
| Richard M. Nixon|1913-01-09| Yorba Linda| California|1969-01-20|1974-08-09| 37|
| Gerald R. Ford|1913-07-14| Omaha| Nebraska|1974-08-09|1977-01-20| 38|
| Jimmy Carter|1924-10-01| Plains| Georgia|1977-01-20|1981-01-20| 39|
| Ronald Reagan|1911-02-06| Tampico| Illinois|1981-01-20|1989-01-20| 40|
| George H. W. Bush|1924-06-12| Milton| Massachusetts|1989-01-20|1993-01-20| 41|
| Bill Clinton|1946-08-19| Hope| Arkansas|1993-01-20|2001-01-20| 42|
| George W. Bush|1946-07-06| New Haven| Connecticut|2001-01-20|2009-01-20| 43|
| Barack Obama|1961-08-04| Honolulu| Hawaii|2009-01-20|2017-01-20| 44|
| Donald Trump|1946-06-14| Queens| New York|2017-01-20| null| 45|
+——————–+———-+——————-+——————–+———-+———-+——-+

Check the last column “pres_id”. It is sequence number generated. Hope this helps. Feel free to leave comment if any query/feedback.

2 Comments

  1. Nice representation….

    one question overhere…. how can we pass a parameter to generate the sequence from that particular point. is there any way to achieve.

    Also this code is belongs to python2.7… is there any way to run through python 3.6?


Leave a Reply

Your email address will not be published. Required fields are marked *