We can create AWS Glue dynamic frame using data present in S3 or tables that exists in Glue catalog. In addition to that we can create dynamic frames using custom connections as well. In this post, we will create new Glue Job that will read S3 & Glue catalog table to create new AWS Glue dynamic frames.
AWS Glue create dynamic frame from S3
In AWS Glue console, click on Jobs link from left panel.
Click on “Add Job” button.
A new window will open and fill the name & select the role we created in previous tutorial.
Select Type as Spark and select “new script” option.
Now click on Security section and reduce number of workers to 3 in place of 10.
Click on next at the bottom.
Next screen will be for connections. If it is new AWS Glue account, you will see it as BLANK.
Click on save job and edit script button.
Now you have created an empty job. Let’s write some Glue code to create a dynamic from from S3 data.
Copy the below code. This code you can add to almost all Glue Spark jobs.
# creating dynamic frame from S3 data & Glue catalog table import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job glueContext = GlueContext(SparkContext.getOrCreate())
Now to create dynamic frame from S3 use below code
# creating dynamic frame from S3 data dyn_frame_s3 = glueContext.create_dynamic_frame_from_options( connection_type="s3", connection_options = { "paths": ["s3://<bucket name>/data/sales/"], "inferSchema": "true" }, format = "csv", format_options={ "separator": "\t" }, transformation_ctx="") print (dyn_frame_s3.count())
AWS Glue create dynamic frame from Glue catalog table
We will use one of the table which we created in previous tutorial.
# creating dynamic frame from Glue catalog table dyn_frame_catalog = glueContext.create_dynamic_frame_from_catalog( database = "db_readfile", table_name = "sales", transformation_ctx = "") print (dyn_frame_catalog.count())
Complete Glue job code
import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job glueContext = GlueContext(SparkContext.getOrCreate()) # creating dynamic frame from S3 data dyn_frame_s3 = glueContext.create_dynamic_frame_from_options( connection_type="s3", connection_options = { "paths": ["s3://<bucket name>/data/sales/"], "inferSchema": "true" }, format = "csv", format_options={ "separator": "\t" }, transformation_ctx="") print (dyn_frame_s3.count()) # creating dynamic frame from Glue catalog table dyn_frame_catalog = glueContext.create_dynamic_frame_from_catalog( database = "db_readfile", table_name = "sales", transformation_ctx = "") print (dyn_frame_catalog.count())
Save the Glue job and click on Run Job.
You can check from the print output that the count is matching in both the cases.