Welcome to my website. I am Nitin Srivastava. A Data Engineer by profession with 14+ years of professional experience.I have worked with multiple enterprises using various technologies supporting Data Analytics requirements.
As a Data Engineer, primary skill has always been SQL. So when I started working on Hadoop projects I was excited to explore different SQL options available in it. I worked a lot on Apache Hive & Apache Spark.
During early days of Hadoop, it was on-premises Hadoop infrastructure in which enterprises invested heavily. So I got the opportunity to work on Hortonworks, Cloudera & MapR distribution.
From all that experience enterprises realised that Apache Spark is the best bet. Hence Apache Spark turns out to be the best thing coming out of that era. Now Spark is widely used by different enterprises for different data analytics requirements.
After few years, I got the opportunity to work on Apache Spark/Hive on AWS platform primarily leveraging AWS Glue & Amazon EMR.
Get started on Apache Spark with these free stuff
SQL to PySpark Convertor
Do you want to convert SQL into PySpark Dataframe code ?
I created this utility as my weekend project. I was able to convert basic sql queries into pyspark code.
I have shared the code used for the project and you are free to use it , customise it as per your requirement.
Spark Memory Configuration Generator
I created this utility when I was learning about optimising spark memory and about memory management.
Try this utility to generate optimised Spark memory configuration for your spark application.
SQL JDBC Connection String Generator
Do you connect Spark to different RDBMS via JDBC ?
Then this utility will help you in quickly generating Spark JDBC connection string for Importing & Exporting data.
PySpark Cheat Sheet
Check my blog post list
In this website I have shared my experience with SQL on “Hadoop” platform. I share posts about Apache Hive, Apache Spark, PySpark , Amazon EMR & AWS Glue.
Apache Hive Basics:
- hive sql tutorial
- hive variables
- hive partition
- hive select query
- hive distinct
- hive where
- hive subquery example
- hive between
- bucketized tables do not support
Apache Hive Date/Timestamp
Apache Hive Table Design
Apache Spark Basics
- spark_major_version
- spark.sql.optimizer.maxiterations
- spark recursive query
- spark sql round
- spark performance tuning
- spark dynamicallocation enabled
- spark executor cores
- spark configuration
- spark insert overwrite
Apache Spark Dataframe
- spark select
- spark alias
- spark dataframe filter
- spark isin
- spark rlike
- spark case when
- spark dataframe orderby
- spark replace
- spark concat
- spark drop duplicates
- spark join
- spark update column value
- spark aggregate functions
- spark union
- spark column to list
- spark show
- spark explode
- spark dataframe null value
- monotonically_increasing_id
- spark repartition
- spark add multiple columns
Apache Spark JDBC
PySpark Basics
- first pyspark script
- pyspark script
- zipwithindex
- convert from teradata to pyspark
- pyspark rdd operations
- pyspark map_filter
- pyspark md5 hash
- pyspark read csv
- pyspark read parquet
- sql to pyspark converter – Concept
- sql to dataframe conversion – Manual
- sql to pyspark converter – Automation
PySpark Dataframe
- pyspark distinct
- pyspark lowercase
- pyspark filter
- pyspark cheat sheet
- pyspark format number
- pyspark apply function to column
PySpark Date