Introduction
I am Nitin ‘Raj’ Srivastava – founder/author of SQLandHadoop.com. This website shall help you in understanding basics of ETL on Hadoop primarily using Hive & Spark. More focus on the SQL part as I want to share my experience as ETL-Developer and Data Engineer. If you are already aware of SQL and just wish to see the difference between different flavours of SQL available in Hive/Spark then also you will find the website helpful.
I have done various projects of migrating existing data warehouse into Hadoop and I have shared my experience in this website. Some of the projects are Teradata to Hive or Teradata to Spark , Netezza to Spark , Informatica/Netezza to Spark. So I try to share my learnings from such projects in this website. I am very well versed with the challenges in migration projects – data migration, etl jobs migration , data validation , performance tuning , user acceptance, consumption queries handling etc.
I would like to take this opportunity to answer few FAQs:
How often do you update your website?
Whenever I find something interesting in my day to day work and I feel others may too want to learn it then I put it on blog. Also I do revisit my old posts once I have gained more knowledge and I can add to existing posts.
How can I subscribe to your website’s content?
You can subscribe to our blog via the subscription widget on the sidebar/footer. Just enter your email and we will inform you whenever we post a new blog.
How about Hive/Spark Online Training?
If you are looking for any specific topic then feel free to give us a shout out with topic details and we will try to cover it in our future posts. However if you wish to cover several topics and feel training is much more beneficial than just reading a blog then you may contact us too. I am always happy to help you.
Why you should reach out to me?
- If the blog you have read here isn’t enough.
- If you have any specific blog post request.
- If you want us to conduct Hive/Spark Interview on your behalf. We give exhaustive Interview feedback for each candidate.
- If you want us to create Hive/Spark assessment for your organisation. This could be objective or subjective type.
- If you want us to help you in cracking Hive/Spark Interviews. Our Mock Interviews cover various Spark/Hive specific questions and SQL questions to make you feel more confident.
- If you are looking for Hive/Spark Freelancer who can assist you with various ETL activities.
How can I contact you ?
Contacting us is easy. Just leave a comment below and we will get back to you in no time.
please share the “shoes” parquet file mentioned here (https://sqlandhadoop.com/pyspark-filter-25-examples-to-teach-you-everything/)
Hi Kranti
You can use below path:
s3://amazon-reviews-pds/parquet/product_category=Shoes/
It is mentioned in the start of the post as well.
df_shoes = spark.read.parquet(“s3://amazon-reviews-pds/parquet/product_category=Shoes/”)
Best,
Raj
Hello Nitin,
This conversion utility is an excellent help for newbies. I wish it was more advanced to include joins etc. Is there another version around?
Regards.
Raj
Hi Nitin,
We are currently using spark sql shell to submit our sql in client mode but We realised that is not suitable for production jobs , so we think to switch from client to cluster mode using pyspark .
Since we have all scripts in spark sql and can you help me with how to convert it to pyspark with minimal code changes.
Can we write any wrapper scripts to pass our SQL and happy to touch with you to get your advice on this.
good, a query that page you recommend to understand how to transform betq from teradata to pyspark
I have the following HIVE sql. Need to get the calculation at 5 decoimal places
select Acctnumber, (1/total_Orders) as order_ratio
from customer.Orders ;
I tried using format_number((1/total_Orders),5) as order_ratio, but it is giving me
the following error. I can make any changes to the setup of the program
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
Hi Nitin ,
Are u providing training ?
Not at the moment and not in near future.
But you can mention any specific topic you want me to cover in next few blog posts.
Hi nitin ,
From where I can download data file for practice your scenarios
Hi Vivek
We have used usa president data csv file for most of the examples used in the posts. To download the file click here
Best
Nitin