AWS Glue

What is AWS Glue ?

AWS Glue is fully managed AWS service which is primarily used for ETL processes. AWS Glue is used to perform different technical and business transformations over your data. AWS Glue is server-less so you don’t have to worry about maintaining servers and other operations overhead like installing updates & security patches.

What kind of jobs can I create in AWS Glue ?

AWS Glue presently supports Spark (Batch & Streaming) engine & Python shell to build different jobs. You can use Python or Scala for developing Spark jobs.

If you don’t have any preferences then I will recommend you to pick Python over Scala. The main reason is the boto3 library which you can easily use in Python applications.

How AWS Glue is used for ETL jobs ?

AWS Glue is used to build ETL Pipelines using Spark. You can leverage spark SQL api or dataframe api to develop ETL Jobs. You can also use Python Shell to do any pre-processing or post-processing of data.

Also other AWS services integrate with AWS Glue very easily.

What is AWS Glue Catalog ?

AWS Glue catalog is used to store metadata information. It is a repository with details of all the databases & tables created as part of Glue process. Also Glue catalog can be used by different services like AWS Athena, Amazon EMR, Amazon Redshift Spectrum etc.

What is AWS Glue crawler ?

AWS Glue crawler is used to retrieve metadata information from the Source data and create tables in AWS Glue. This information includes column name, column data type, data size, creation date, partitions and other related information.

You can think of aws glue crawler as the program which will go to the data location and will fetch relevant information to create tables in Glue catalog. You don’t have to write any code. Just few basic information is required like Source & Target details are required.

You can also create tables manually but that is time consuming and error prone.

What is AWS Glue architecture ?

AWS Glue is server-less hence the underlying aws glue architecture is not much significant as compared to other services like Amazon Redshift. The users have to select number of nodes and type of nodes to be used for aws glue process.

AWS Glue does have components. The main components of AWS Glue are :

  • Catalog
    • Databases / Tables
    • Crawlers / Classifiers
    • Connections
  • ETL
    • Jobs
    • Triggers
    • Workflows
AWS Glue

AWS Glue Tutorial for Beginners

In this AWS Glue beginners level tutorial series I have limited Glue tutorials to AWS Glue and Amazon S3 only. It does not mean that Glue capability is limited but I have restricted it to keep it simple.