Spark Dataframe – Explode

In Spark, we can use “explode” method to convert single column values into multiple rows. Explode can be used to convert one row into multiple rows in Spark. Recently I was working on a task to convert Cobol VSAM file which often has nested columns defined in it. In Spark my requirement was to convert single column value (Array of values) into multiple rows. So let’s see an example to understand it better:

Create a sample dataframe with one column as ARRAY

scala> val df_vsam = Seq((1,"abc",Array("p","q","r")),(2,"def",Array("x","y","z"))).toDF("id","col1","col2")
df_vsam: org.apache.spark.sql.DataFrame = [id: int, col1: string ... 1 more field]

scala> df_vsam.printSchema()
root
 |-- id: integer (nullable = false)
 |-- col1: string (nullable = true)
 |-- col2: array (nullable = true)
 |    |-- element: string (containsNull = true)

scala> df_vsam.show()
+---+----+---------+
| id|col1|     col2|
+---+----+---------+
|  1| abc|[p, q, r]|
|  2| def|[x, y, z]|
+---+----+---------+

Now run the explode function to split each value in col2 as new row.

scala> df_vsam.withColumn("col2",explode($"col2")).show()
+---+----+----+
| id|col1|col2|
+---+----+----+
|  1| abc|   p|
|  1| abc|   q|
|  1| abc|   r|
|  2| def|   x|
|  2| def|   y|
|  2| def|   z|
+---+----+----+

So using explode function, you can split one column into multiple rows.

Leave a Reply