You can use aws glue crawler to read file from S3 and create corresponding table in the Glue catalog. In this tutorial we will read few files present in S3 and will create corresponding tables in AWS Glue catalog. We will use Glue crawler to identify the S3 file schema and create tables.
Check the files on S3
Go to Amazon S3 path which has the files. In this tutorial I have 3 folders with text file present in each.
Create a glue crawler to read file from S3
Go to AWS Glue console and click on crawlers on the left pane.
- In crawler info, give name “readS3file” and click on next.
- Select data stores as crawler source type and crawl all folders. Click on next.
- In Data Store, select S3 and in specified path give S3 path where all 3 folders exists eg: s3://<bucket-name>/data. Click on next.
- Select NO for add another data store and click on next.
- If you have not created any role first for glue , it will ask you to create one. Select create an IAM role and write “readS3” in checkbox. This will create new IAM role with name “AWSGlueServiceRole-readS3”. Click on next.
- Frequency you can select run on demand. Click on next
- Select an existing database or give name of new database. Click on add database and type “db_readfile” as name of new database and create it. If the database drop-down does not refreshes automatically then click on back and next and next again.
- Review all the info and click on Finish.
Now you have created your first aws glue crawler. Next step is to trigger it.
How to run the AWS Glue crawler ?
Select the crawler you just created and click on “Run crawler” button.
The crawler status will change from READY –> STARTING –> RUNNING –> STOPPING–> READY.
You can also view the Logs by clicking on hyperlink that will open Cloudwatch management console in new tab. This is really helpful to identify the root cause and error description if crawler fails.
How to check AWS Glue crawler tables ?
You can see that the crawler added 3 Tables to AWS Glue Catalog. Let’s verify the table and preview the data. Go to databases and select the database you created. Click on view tables. You can see the 3 tables corresponding to the 3 folders on Source S3 exists.
How to preview data of AWS Glue table ?
Go to Tables in Data Catalog section. Select the table for which you want preview the data.
From Action drop-down list click on View data. This will open AWS Athena page.
If this is the first time you are using Athena then you must specify a S3 bucket to be used by Athena to store query results.
Go to Settings tab and click on Manage button.
Select an existing bucket or create a new bucket to be used by Athena.
Now go to Editor tab and run the sample query.
How to check table structure in AWS Glue ?
Select the table from Tables menu and click on View Details this time from Action drop-down list.
Now click on “Edit Schema” button on Top right section.
Change the name of the columns as per requirement and click on SAVE button once done.
Preview the data again in AWS Athena
What is Classifiers in AWS Glue Crawler ?
AWS Glue crawlers use classifiers to determine the schema / metadata information of any given Source data. By default, AWS Glue crawlers uses in-built classifiers so the users don’t have to specify anything explicitly.
When you are reading a complex file which the glue crawler is not able to read properly then you can create custom classifiers and use it in crawlers.
Conclusion
In this tutorial you saw how to read a file from S3 and create tables in Glue catalog. To preview the data, you have to use AWS Athena. You can also easily edit the schema from AWS Glue console itself.
The Glue crawler is able to identify most of the common files – csv, tsv, textfile, parquet and other formats.
You can download the data files used in this tutorial from this link.