Blog – Page 3 – SQL & Hadoop

PySpark

How to convert SQL Queries into PySpark

Leave a Comment / PySpark / Raj

In the previous post, we saw many common conversions from SQL to Dataframe in PySpark. In this post, we will see the strategy which you can follow to convert typical SQL query to dataframe in PySpark. If you have not checked previous post, I will strongly recommend to do it as we will refer to […]

How to convert SQL Queries into PySpark Read More »

PySpark

PySpark Read Write Parquet Files

Leave a Comment / PySpark / Raj

In this post, we will see how you can read parquet files using pyspark and will also see common options and challenges which you must consider while reading or writing parquet files. What is Parquet File Format ? Parquet is a columnar file format and is becoming very popular because of the optimisations it brings

PySpark Read Write Parquet Files Read More »

PySpark

Rename Column Name case in Dataframe

Leave a Comment / PySpark / Raj

Requirement: To change column names to upper case or lower case in PySpark Create a dummy dataframe Convert column names to uppercase in PySpark You can use “withColumnRenamed” function in FOR loop to change all the columns in PySpark dataframe to uppercase by using “upper” function. Convert column names to lowercase in PySpark You can

Rename Column Name case in Dataframe Read More »

PySpark

Spark Case Study – optimise executor memory and cores per executor

2 Comments / Apache Spark / Raj

I was recently working on a task where I have to read more than a Terabyte of data spread across multiple parquet files. Also some filters were applied on that data to get the required result set. I did a small test where I ran the same spark read command with filter condition multiple times.

Spark Case Study – optimise executor memory and cores per executor Read More »

Apache Hadoop

Namenode is in safe mode – Hadoop

Leave a Comment / Amazon EMR / Raj

The most common reason for namenode to go into safemode is due to under-replicated blocks. This is generally caused by storage issues on hdfs or when some jobs like Spark applications are suddenly aborted that leaves temp files which are under-replicated. If your namenode is in safemode then your hadoop cluster is in read-only mode

Namenode is in safe mode – Hadoop Read More »