About

Introduction

I am Nitin ‘Raj’ Srivastava – founder/author of SQLandHadoop.com. This website shall help you in understanding basics of ETL on Hadoop primarily using Hive & Spark. More focus on the SQL part as I want to share my experience as ETL-Developer and Data Engineer. If you are already aware of SQL and just wish to see the difference between different flavours of SQL available in Hive/Spark then also you will find the website helpful.

I have done various projects of migrating existing data warehouse into Hadoop and I have shared my experience in this website. Some of the projects are Teradata to Hive or Teradata to Spark , Netezza to Spark , Informatica/Netezza to Spark. So I try to share my learnings from such projects in this website. I am very well versed with the challenges in migration projects – data migration, etl jobs migration , data validation , performance tuning , user acceptance, consumption queries handling etc.

I would like to take this opportunity to answer few FAQs:

How often do you update your website?

Whenever I find something interesting in my day to day work and I feel others may too want to learn it then I put it on blog. Also I do revisit my old posts once I have gained more knowledge and I can add to existing posts.

How can I subscribe to your website’s content?

You can subscribe to our blog via the subscription widget on the sidebar/footer. Just enter your email and we will inform you whenever we post a new blog.

How about Hive/Spark Online Training?

If you are looking for any specific topic then feel free to give us a shout out with topic details and we will try to cover it in our future posts. However if you wish to cover several topics and feel training is much more beneficial than just reading a blog then you may contact us too. I am always happy to help you.

Why you should reach out to me?

If the blog you have read here isn’t enough.
If you have any specific blog post request.
If you want us to conduct Hive/Spark Interview on your behalf. We give exhaustive Interview feedback for each candidate.
If you want us to create Hive/Spark assessment for your organisation. This could be objective or subjective type.
If you want us to help you in cracking Hive/Spark Interviews. Our Mock Interviews cover various Spark/Hive specific questions and SQL questions to make you feel more confident.
If you are looking for Hive/Spark Freelancer who can assist you with various ETL activities.

How can I contact you ?

Contacting us is easy. Just leave a comment below and we will get back to you in no time.

Connect on linkedin

Kranti

December 4, 2021 at 3:41 am

please share the “shoes” parquet file mentioned here (https://sqlandhadoop.com/pyspark-filter-25-examples-to-teach-you-everything/)

Raj
December 6, 2021 at 10:45 am

Hi Kranti

You can use below path:
s3://amazon-reviews-pds/parquet/product_category=Shoes/

It is mentioned in the start of the post as well.
df_shoes = spark.read.parquet(“s3://amazon-reviews-pds/parquet/product_category=Shoes/”)

Best,
Raj

Raj

November 16, 2021 at 1:54 am

Hello Nitin,
This conversion utility is an excellent help for newbies. I wish it was more advanced to include joins etc. Is there another version around?

Regards.
Raj

Saravanan

September 5, 2021 at 5:30 am

Hi Nitin,
We are currently using spark sql shell to submit our sql in client mode but We realised that is not suitable for production jobs , so we think to switch from client to cluster mode using pyspark .
Since we have all scripts in spark sql and can you help me with how to convert it to pyspark with minimal code changes.

Can we write any wrapper scripts to pass our SQL and happy to touch with you to get your advice on this.

sherly

July 16, 2021 at 4:45 am

good, a query that page you recommend to understand how to transform betq from teradata to pyspark

Ash

July 10, 2021 at 2:14 am

I have the following HIVE sql. Need to get the calculation at 5 decoimal places

select Acctnumber, (1/total_Orders) as order_ratio
from customer.Orders ;

I tried using format_number((1/total_Orders),5) as order_ratio, but it is giving me
the following error. I can make any changes to the setup of the program

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask

vivek

May 13, 2021 at 10:31 pm

Hi Nitin ,

Are u providing training ?

Raj
May 25, 2021 at 3:59 pm

Not at the moment and not in near future.
But you can mention any specific topic you want me to cover in next few blog posts.

March 21, 2021 at 11:04 am

Hi nitin ,

From where I can download data file for practice your scenarios

Raj
May 6, 2021 at 3:29 pm

Hi Vivek
We have used usa president data csv file for most of the examples used in the posts. To download the file click here

Best
Nitin

Ramkumar on Spark Performance Tuning with help of Spark UIFebruary 3, 2025
Great. Keep writing more articles.
Raj on Free Online SQL to PySpark ConverterAugust 9, 2022
Thank you for sharing this. I will give it a try as well.
John K-W on Free Online SQL to PySpark ConverterAugust 8, 2022
Might be interesting to add a PySpark dialect to SQLglot https://github.com/tobymao/sqlglot https://github.com/tobymao/sqlglot/tree/main/sqlglot/dialects
Meena M on Spark Dataframe WHEN caseJuly 28, 2022
try something like df.withColumn("type", when(col("flag1"), lit("type_1")).when(!col("flag1") && (col("flag2") || col("flag3") || col("flag4") || col("flag5")), lit("type2")).otherwise(lit("other")))
tagu on Free Online SQL to PySpark ConverterJuly 20, 2022
It will be great if you can have a link to the convertor. It helps the community for anyone starting…