I was recently working on a task where I have to read more than a Terabyte of data spread across multiple parquet files. Also some… Read More »Spark Case Study – optimise executor memory and cores per executor
I was recently working on a project to migrate some records from on-premises data warehouse to S3. The requirement was also to run MD5 check… Read More »PySpark-How to Generate MD5 of entire row with columns
While working with Spark, I hear it so many times when client or my team "complaints" that single Spark job is taking all resources. So… Read More »Spark single application consumes all resources – Good or Bad for your cluster ?
Spark is distributed data processing engine which relies a lot on memory available for computation. Also if you have worked on spark, then you must… Read More »Spark Performance Tuning with help of Spark UI
If you migrate from any RDBMS platform to another, one technical challenge you may face is different Decimal Rounding on both the platforms. I was… Read More »Problem with Decimal Rounding & solution
Recently, I was working on one project where the ETL requirement was to have daily snapshot of the table. It was 15+ years old data… Read More »Never run INSERT OVERWRITE again – try Hadoop Distcp
You may need to add new columns in the existing SPARK dataframe as per the requirement. This new column can be initialized with a default… Read More »Spark Dataframe add multiple columns with value
Spark dataframe add row number is very common requirement especially if you are working on ELT in Spark. You can use monotonically_increasing_id method to generate… Read More »Spark Dataframe – monotonically_increasing_id