Spark Dataframe – Explode

In Spark, we can use “explode” method to convert single column values into multiple rows. Explode can be used to convert one row into multiple rows in Spark. Recently I was working on a task to convert Cobol VSAM file which often has nested columns defined in it. In Spark my requirement was to convert single column value (Array of values) into multiple rows. So let’s see an example to understand it better:

Create a sample dataframe with one column as ARRAY

scala> val df_vsam = Seq((1,"abc",Array("p","q","r")),(2,"def",Array("x","y","z"))).toDF("id","col1","col2")
df_vsam: org.apache.spark.sql.DataFrame = [id: int, col1: string ... 1 more field]

scala> df_vsam.printSchema()
root
 |-- id: integer (nullable = false)
 |-- col1: string (nullable = true)
 |-- col2: array (nullable = true)
 |    |-- element: string (containsNull = true)

scala> df_vsam.show()
+---+----+---------+
| id|col1|     col2|
+---+----+---------+
|  1| abc|[p, q, r]|
|  2| def|[x, y, z]|
+---+----+---------+

Now run the explode function to split each value in col2 as new row.

scala> df_vsam.withColumn("col2",explode($"col2")).show()
+---+----+----+
| id|col1|col2|
+---+----+----+
|  1| abc|   p|
|  1| abc|   q|
|  1| abc|   r|
|  2| def|   x|
|  2| def|   y|
|  2| def|   z|
+---+----+----+

So using explode function, you can split one column into multiple rows.

Thank you for sharing this. I will give it a try as well.

Might be interesting to add a PySpark dialect to SQLglot https://github.com/tobymao/sqlglot https://github.com/tobymao/sqlglot/tree/main/sqlglot/dialects

try something like df.withColumn("type", when(col("flag1"), lit("type_1")).when(!col("flag1") && (col("flag2") || col("flag3") || col("flag4") || col("flag5")), lit("type2")).otherwise(lit("other")))

It will be great if you can have a link to the convertor. It helps the community for anyone starting…

I am wondering if there is a way to preserve time information when adding/subtracting days from a datetime. If I…

Leave a ReplyCancel reply