Spark Dataframe concatenate strings

Spark concatenate is used to merge two or more string into one string. In many scenarios, you may want to concatenate multiple strings into one. For example, you may want to concatenate "FIRST NAME" & "LAST NAME" of a customer to show his "FULL NAME". In Spark SQL Dataframe, we can use concat function to join multiple string into one string.

Spark Concat Function

Concat function in Spark is used to merge or combine two or more strings into one string.

scala> df_pres.select(concat($"pres_id",$"pres_name")).show(5)
+-------------------------+
|concat(pres_id,pres_name)|
+-------------------------+
|       1George Washington|
|              2John Adams|
|        3Thomas Jefferson|
|           4James Madison|
|            5James Monroe|
+-------------------------+

Now in above output,we were able to join two columns into one column. However the output looks little uncomfortable to read or view. Most of the times, we may want a delimiter to distinguish between first and second string. In order to introduce a delimiter between strings, we will use concat_ws function. The first parameter is the delimiter. It could be a single character or multi character delimiter.
Let's look at the example where we will use "-" as delimiter while concatenating two columns.

Spark Concat_WS Function

Concat_ws function in Spark is used to merge or combine two or more strings into one string with a separator joining different strings together. You can pass desired delimiter as first argument to concat_ws function.

scala> df_pres.select(concat_ws("-",$"pres_id",$"pres_name")).show(5)
+------------------------------+
|concat_ws(-,pres_id,pres_name)|
+------------------------------+
|           1-George Washington|
|                  2-John Adams|
|            3-Thomas Jefferson|
|               4-James Madison|
|                5-James Monroe|
+------------------------------+

So we use concat to merge multiple strings into one and concat_ws to merge multiple strings into one with a delimiter. I prefer to use concat_ws most of the time as it gives the flexibility to combine two or more strings with the delimiter. And if you don't want delimited just pass BLANK or EMPTY string as separator and it will behave as concat function only.

Let's see few more examples to further strengthen our understanding of Spark Concat functions.

Create a dataframe

import org.apache.spark.sql.functions.col
// if you are running in pyspark then use below import statement
// from pyspark.sql.functions import col
val df_shoes = spark.read.parquet("s3://amazon-reviews-pds/parquet/product_category=Shoes/").limit(50)
df_shoes.printSchema()
root
 |-- marketplace: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- review_id: string (nullable = true)
 |-- product_id: string (nullable = true)
 |-- product_parent: string (nullable = true)
 |-- product_title: string (nullable = true)
 |-- star_rating: integer (nullable = true)
 |-- helpful_votes: integer (nullable = true)
 |-- total_votes: integer (nullable = true)
 |-- vine: string (nullable = true)
 |-- verified_purchase: string (nullable = true)
 |-- review_headline: string (nullable = true)
 |-- review_body: string (nullable = true)
 |-- review_date: date (nullable = true)
 |-- year: integer (nullable = true)

Spark concatenate two columns

You can combine two columns in spark using concat_ws function and pass first parameter as BLANK ("").

scala> df_shoes.select(col("marketplace"),col("review_headline"),concat_ws("",col("marketplace"),col("review_headline")).as("concat_op")).show(5)
+-----------+--------------------+--------------------+
|marketplace|     review_headline|           concat_op|
+-----------+--------------------+--------------------+
|         US|BEST WALLET EVER....|USBEST WALLET EVE...|
|         US|         Love these!|       USLove these!|
|         US|          Five Stars|        USFive Stars|
|         UK|I am a 4 and a ha...|UKI am a 4 and a ...|
|         US|Happy Granddaughter-|USHappy Granddaug...|
+-----------+--------------------+--------------------+

Spark concatenate two columns with space

You can combine two columns in spark using concat_ws function and pass first parameter as space (" ") to be used as delimiter.

scala> df_shoes.select(col("marketplace"),col("review_headline"),concat_ws(" ",col("marketplace"),col("review_headline")).as("concat_op")).show(5)
+-----------+--------------------+--------------------+
|marketplace|     review_headline|           concat_op|
+-----------+--------------------+--------------------+
|         US|BEST WALLET EVER....|US BEST WALLET EV...|
|         US|         Love these!|      US Love these!|
|         US|          Five Stars|       US Five Stars|
|         UK|I am a 4 and a ha...|UK I am a 4 and a...|
|         US|Happy Granddaughter-|US Happy Granddau...|
+-----------+--------------------+--------------------+
only showing top 5 rows

Spark concatenate two columns with separator

You can combine two columns in spark using concat_ws function and pass first parameter as desired separator ("~") to be used as delimiter. In the example below we have used ~ as separator.

scala> df_shoes.select(col("marketplace"),col("review_headline"),concat_ws("~",col("marketplace"),col("review_headline")).as("concat_op")).show(5)
+-----------+--------------------+--------------------+
|marketplace|     review_headline|           concat_op|
+-----------+--------------------+--------------------+
|         US|BEST WALLET EVER....|US~BEST WALLET EV...|
|         US|         Love these!|      US~Love these!|
|         US|          Five Stars|       US~Five Stars|
|         UK|I am a 4 and a ha...|UK~I am a 4 and a...|
|         US|Happy Granddaughter-|US~Happy Granddau...|
+-----------+--------------------+--------------------+
only showing top 5 rows

Spark concatenate two columns with different datatype

So far we have concatenated two string columns. We can also combine different datatype columns using concat function in Spark. In the example below , we have combined string column with a date column to create single string.

scala> df_shoes.select(col("marketplace"),col("review_date"),concat_ws("~",col("marketplace"),col("review_date")).as("concat_op")).show(5)
+-----------+-----------+-------------+
|marketplace|review_date|    concat_op|
+-----------+-----------+-------------+
|         US| 2014-11-17|US~2014-11-17|
|         US| 2015-05-03|US~2015-05-03|
|         US| 2014-11-17|US~2014-11-17|
|         UK| 2015-05-03|UK~2015-05-03|
|         US| 2014-11-17|US~2014-11-17|
+-----------+-----------+-------------+
only showing top 5 rows

Spark concatenate string to column

You can also add "static" or "fixed" string value to existing column in Spark. In the example below we have merged value "I-Love-Spark" to marketplace column as prefix using concat_ws function. You can add more values or columns in concat_ws function with comma delimiter.

scala> df_shoes.select(col("marketplace"),concat_ws("~",lit("I-Love-Spark"),col("marketplace")).as("concat_op")).show(5)
+-----------+---------------+
|marketplace|      concat_op|
+-----------+---------------+
|         US|I-Love-Spark~US|
|         US|I-Love-Spark~US|
|         US|I-Love-Spark~US|
|         US|I-Love-Spark~US|
|         US|I-Love-Spark~US|
+-----------+---------------+
only showing top 5 rows

Hope the above examples clear how concat function works in Spark. The same approach will work for PySpark too.

Leave a Reply

Your email address will not be published. Required fields are marked *