I was recently working on a project to migrate some records from on-premises data warehouse to S3. The requirement was also to run MD5 check on each row between Source & Target to gain confidence if the data moved is accurate. In this post I will share the method in which MD5 for each row in dataframe can be generated.
I will create a dummy dataframe with 3 columns and 4 rows. Now my requirement is to generate MD5 for each row. Below is the sample code extract in PySpark.
columns = ["emp_id","emp_name","emp_dept"] data = [("1", "Mark","Admin"), ("2", "Roger","HR"), ("3","Wanda", "Technology"),("4","Vision","Data Analytics")] rdd = spark.sparkContext.parallelize(data) df_employee = rdd.toDF(columns) df_employee.printSchema() df_employee.show(truncate=False) col_list=[] for i in df_employee.columns: col_list.append(i) print (col_list) from pyspark.sql.functions import md5, concat_ws df_employee = df_employee.withColumn("md5", md5(concat_ws("", *col_list))) df_employee.show(truncate=False)
Output will look like below:
+------+--------+--------------+--------------------------------+ |emp_id|emp_name|emp_dept |md5 | +------+--------+--------------+--------------------------------+ |1 |Mark |Admin |a9bd622a5a7c17233df735117f27d30b| |2 |Roger |HR |9c8cf64baa414f4327340cfcaf5da4b4| |3 |Wanda |Technology |ff31440088895102bfb805c5a57b21fe| |4 |Vision |Data Analytics|ba6788d1e13c6e98cb8f8814ff533828| +------+--------+--------------+--------------------------------+
You can also use hash-128, hash-256 to generate unique value for each.
Watch the below video to see the tutorial for this post.
4 thoughts on “PySpark-How to Generate MD5 of entire row with columns”
can you please make the video available to learn
Done. Please check now.
What about null values and Date time datatypes.
Tried below with DATE & NULL values: