Apache Spark

Spark Dataframe Update Column Value

We all know that UPDATING column value in a table is a pain in HIVE or SPARK SQL especially if you are dealing with non-ACID tables. However in Dataframe you can easily update column values. You can update a dataframe column value with value from another dataframe. For this purpose, we have to use JOINS between 2 dataframe and then pick the updated value from another dataframe.

In the example below we will update “pres_bs” column in dataframe from complete StateName to State Abbreviation.

Original Dataframe

scala> df_pres.show(5)
 +-------+--------------------+----------+-------------------+--------------+----------+----------+
 |pres_id|           pres_name|  pres_dob|            pres_bp|       pres_bs|   pres_in|  pres_out|
 +-------+--------------------+----------+-------------------+--------------+----------+----------+
 |      1|   George Washington|1732-02-22|Westmoreland County|      Virginia|1789-04-30|1797-03-04|
 |      2|          John Adams|1735-10-30|          Braintree| Massachusetts|1797-03-04|1801-03-04|
 |      3|    Thomas Jefferson|1743-04-13|           Shadwell|      Virginia|1801-03-04|1809-03-04|
 |      4|       James Madison|1751-03-16|        Port Conway|      Virginia|1809-03-04|1817-03-04|
 |      5|        James Monroe|1758-04-28|        Monroe Hall|      Virginia|1817-03-04|1825-03-04|
 +-------+--------------------+----------+-------------------+--------------+----------+----------+

Look-Up Dataframe

scala> df_states.show(5)
 +--------+--------------+----------+--------------+---------------+----------------+
 |state_id|    state_name|state_abbr| state_capital|state_larg_city|state_population|
 +--------+--------------+----------+--------------+---------------+----------------+
 |       1|       Alabama|        AL|    Montgomery|     Birmingham|         4874747|
 |       2|        Alaska|        AK|        Juneau|      Anchorage|          739795|
 |       3|       Arizona|        AZ|       Phoenix|        Phoenix|         7016270|
 |       4|      Arkansas|        AR|   Little Rock|    Little Rock|         3004279|
 |       5|    California|        CA|    Sacramento|    Los Angeles|        39536653|
 +--------+--------------+----------+--------------+---------------+----------------+

Join 2 DATAFRAME and Update the value

Now we will Join Original Dataframe with Lookup Dataframe and if State Name exists then we will update it with State Abbr. Let’s see it below:

val upd_df_pres = df_pres
  .as("tb1")
  .join(df_states.as("tb2"), $"tb1.pres_bs" === $"tb2.state_name", "leftouter")
  .withColumn(
    "pres_bs",
    when($"tb1.pres_bs" === $"tb2.state_name", $"tb2.state_abbr")
      .otherwise($"tb1.pres_bs")
  )
  .select(
    $"pres_id",
    $"pres_name",
    $"pres_dob",
    $"pres_bp",
    $"pres_bs",
    $"pres_in",
    $"pres_out"
  )
scala> upd_df_pres.show(5)
 +-------+--------------------+----------+-------------------+-------+----------+----------+
 |pres_id|           pres_name|  pres_dob|            pres_bp|pres_bs|   pres_in|  pres_out|
 +-------+--------------------+----------+-------------------+-------+----------+----------+
 |      1|   George Washington|1732-02-22|Westmoreland County|     VA|1789-04-30|1797-03-04|
 |      2|          John Adams|1735-10-30|          Braintree|     MA|1797-03-04|1801-03-04|
 |      3|    Thomas Jefferson|1743-04-13|           Shadwell|     VA|1801-03-04|1809-03-04|
 |      4|       James Madison|1751-03-16|        Port Conway|     VA|1809-03-04|1817-03-04|
 |      5|        James Monroe|1758-04-28|        Monroe Hall|     VA|1817-03-04|1825-03-04|
 +-------+--------------------+----------+-------------------+-------+----------+----------+

You can see State Name is updated with short name like Virginia changed to VA.

Leave a Reply