I was recently working on a task where I have to read more than a Terabyte of data spread across multiple parquet files. Also some… Read More »Spark Case Study – optimise executor memory and cores per executor
This is second part of PySpark Tutorial series. In this post, we will talk about : Fetch unique values from dataframe in PySpark Use Filter… Read More »PySpark Tutorial – Distinct , Filter , Sort on Dataframe
Introduction PySpark is becoming obvious choice for the enterprises when it comes to moving to Spark. As per my understanding , this is primarily for… Read More »PySpark Tutorial – Introduction, Read CSV, Columns
I was recently working on a project to migrate some records from on-premises data warehouse to S3. The requirement was also to run MD5 check… Read More »PySpark-How to Generate MD5 of entire row with columns
While working with Spark, I hear it so many times when client or my team "complaints" that single Spark job is taking all resources. So… Read More »Spark single application consumes all resources – Good or Bad for your cluster ?
Spark is distributed data processing engine which relies a lot on memory available for computation. Also if you have worked on spark, then you must… Read More »Spark Performance Tuning with help of Spark UI
In PySpark, you can run dataframe commands or if you are comfortable with SQL then you can run SQL queries too. In this post, we… Read More »PySpark -Convert SQL queries to Dataframe
If you migrate from any RDBMS platform to another, one technical challenge you may face is different Decimal Rounding on both the platforms. I was… Read More »Problem with Decimal Rounding & solution