Blog – SQL & Hadoop

AWS Glue

AWS Glue create dynamic frame

Leave a Comment / AWS Glue / Raj

We can create AWS Glue dynamic frame using data present in S3 or tables that exists in Glue catalog. In addition to that we can create dynamic frames using custom connections as well. In this post, we will create new Glue Job that will read S3 & Glue catalog table to create new AWS Glue […]

AWS Glue create dynamic frame Read More »

AWS Glue

AWS Glue read files from S3

Leave a Comment / AWS Glue / Raj

You can use aws glue crawler to read file from S3 and create corresponding table in the Glue catalog. In this tutorial we will read few files present in S3 and will create corresponding tables in AWS Glue catalog. We will use Glue crawler to identify the S3 file schema and create tables. Check the

AWS Glue read files from S3 Read More »

Apache Spark

How to check Spark run logs in EMR

Leave a Comment / Amazon EMR / Raj

Situation: Someone in my team has executed Spark application in EMR and the job failed. The user is new to EMR and does not have much idea how to check the Spark logs. Now he has asked me to debug it and find the error. Only information I have is the yarn application_id. In this

How to check Spark run logs in EMR Read More »

PySpark

PySpark apply function to column

Leave a Comment / PySpark / Raj

PySpark apply function to column in dataframe to get desired transformation as output. In this post, we will see 2 of the most common ways of applying function to column in PySpark. First is applying spark built-in functions to column and second is applying user defined custom function to columns in Dataframe. PySpark apply spark

PySpark apply function to column Read More »

Run Spark applications using Airflow

Run Spark Job in existing EMR using AIRFLOW

Leave a Comment / Amazon EMR / Raj

In this post, we will see how you can run Spark application on existing EMR cluster using Apache Airflow. The most basic way of scheduling jobs in EMR is CRONTAB. But if you have worked with crontab you know how much pain it is to manage and secure it. I will not talk in depth

Run Spark Job in existing EMR using AIRFLOW Read More »