PySpark-How to Generate MD5 of entire row with columns

I was recently working on a project to migrate some records from on-premises data warehouse to S3. The requirement was also to run MD5 check on each row between Source & Target to gain confidence if the data moved is accurate. In this post I will share the method in which MD5 for each row in dataframe can be generated.

I will create a dummy dataframe with 3 columns and 4 rows. Now my requirement is to generate MD5 for each row. Below is the sample code extract in PySpark.

columns = ["emp_id","emp_name","emp_dept"]
data = [("1", "Mark","Admin"), ("2", "Roger","HR"), ("3","Wanda", "Technology"),("4","Vision","Data Analytics")]
rdd = spark.sparkContext.parallelize(data)
df_employee = rdd.toDF(columns)

for i in df_employee.columns:

print (col_list)

from pyspark.sql.functions import md5, concat_ws
df_employee = df_employee.withColumn("md5", md5(concat_ws("", *col_list)))

Output will look like below:

|emp_id|emp_name|emp_dept      |md5                             |
|1     |Mark    |Admin         |a9bd622a5a7c17233df735117f27d30b|
|2     |Roger   |HR            |9c8cf64baa414f4327340cfcaf5da4b4|
|3     |Wanda   |Technology    |ff31440088895102bfb805c5a57b21fe|
|4     |Vision  |Data Analytics|ba6788d1e13c6e98cb8f8814ff533828|

You can also use hash-128, hash-256 to generate unique value for each.

Watch the below video to see the tutorial for this post.

Leave a Reply

Your email address will not be published. Required fields are marked *