AWS Glue

AWS Glue create dynamic frame

We can create AWS Glue dynamic frame using data present in S3 or tables that exists in Glue catalog. In addition to that we can create dynamic frames using custom connections as well. In this post, we will create new Glue Job that will read S3 & Glue catalog table to create new AWS Glue dynamic frames.

AWS Glue create dynamic frame from S3

In AWS Glue console, click on Jobs link from left panel.
Click on “Add Job” button.
A new window will open and fill the name & select the role we created in previous tutorial.
Select Type as Spark and select “new script” option.
Now click on Security section and reduce number of workers to 3 in place of 10.

AWS Glue create new Job

Click on next at the bottom.
Next screen will be for connections. If it is new AWS Glue account, you will see it as BLANK.
Click on save job and edit script button.

Now you have created an empty job. Let’s write some Glue code to create a dynamic from from S3 data.

Copy the below code. This code you can add to almost all Glue Spark jobs.

# creating dynamic frame from S3 data & Glue catalog table

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

glueContext = GlueContext(SparkContext.getOrCreate())

Now to create dynamic frame from S3 use below code

# creating dynamic frame from S3 data

dyn_frame_s3 = glueContext.create_dynamic_frame_from_options(
    connection_type="s3",
    connection_options =    { 
    "paths": ["s3://<bucket name>/data/sales/"],
    "inferSchema": "true"
    },
    format = "csv",
    format_options={
        "separator": "\t"
        }, 
   transformation_ctx="")

print (dyn_frame_s3.count())

AWS Glue create dynamic frame from Glue catalog table

We will use one of the table which we created in previous tutorial.

# creating dynamic frame from Glue catalog table

dyn_frame_catalog = glueContext.create_dynamic_frame_from_catalog(
           database = "db_readfile",
           table_name = "sales",
           transformation_ctx = "")

print (dyn_frame_catalog.count())

Complete Glue job code

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

glueContext = GlueContext(SparkContext.getOrCreate())

# creating dynamic frame from S3 data

dyn_frame_s3 = glueContext.create_dynamic_frame_from_options(
    connection_type="s3",
    connection_options =    { 
    "paths": ["s3://<bucket name>/data/sales/"],
    "inferSchema": "true"
    },
    format = "csv",
    format_options={
        "separator": "\t"
        }, 
   transformation_ctx="")

print (dyn_frame_s3.count())

# creating dynamic frame from Glue catalog table

dyn_frame_catalog = glueContext.create_dynamic_frame_from_catalog(
           database = "db_readfile",
           table_name = "sales",
           transformation_ctx = "")

print (dyn_frame_catalog.count())

Save the Glue job and click on Run Job.
You can check from the print output that the count is matching in both the cases.

Leave a Reply