Skip to main content

Amazon Web Services

Introduction to AWS GLUE : A cloud ETL tool

Feature Aws

Glue is a fully managed ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores and data streams. AWS Glue consists of a central metadata repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python or Scala code, and a flexible scheduler that handles dependency resolution, job monitoring, and retries.

Aws Glue Console

AWS Glue works with structured and semi-structured data very well. It is serverless, there’s no infrastructure to set up or manage. A dynamic frame is introduced, which you can use in your ETL scripts.

Classifiers

Classifier specify the schema for a specific file type.

Connections

Connections store the required connection metadata to establish the connection between glue and the source.

Crawlers

A crawler scans a folder and compares it to classifiers to identify the source file. In case your crawler runs more than once, it will look for newly created or changed files or tables in your data store.

Data catalog

Glue jobs can access data from Data Catalog, a central repository in AWS.

Aws Glue Architechture

Cloud Watch

It is monitoring service provided by AWS to keep tract of the activities.

Workflow

The workflow runs the artifacts in a sequential manner defined by the user.

Glue Jobs

The AWS Glue job is a script that connects to the source data, processes it, and writes it to the target data. Glue job uses the python and scala language. AWS Glue can write output files in several data formats, including JSON, CSV, ORC (Optimized Row Columnar), Apache Parquet, and Apache Avro.

    • There are three types of jobs in Glue: SparkStreaming ETL, and Python shell.
    • Streaming ETL: A streaming ETL job is similar to a Spark job, except that it performs ETL on data streams. It uses the Apache Spark Structured Streaming framework. Some Spark job features are not available to streaming ETL jobs.
    • Python shell: A Python shell job runs Python scripts as a shell and supports a Python version that depends on the AWS Glue version you are using. You can use these jobs to schedule and run tasks that don’t require an Apache Spark environment.
    • Spark: AWS Glue manage spark job running in an Apache Spark environment. In Spark jobs, scripts can be defined as follows:
      • Build in Scripts
      • Adding glue job (already written in local and uploading it)
      • Editing the spark script in the AWS glue.

For Creating the glue job, we need to define some parameters:

    • Job Name
    • Source data
    • Worker type
    • Number Workers
    • Spark version

For more info you can go through the documentation of AWS .

Happy coding!


 

Thoughts on “Introduction to AWS GLUE : A cloud ETL tool”

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Akshay Mokalwar

Akshay Mokalwar works as ETL Developer at Perficient in the Nagpur GDC, India. Akshay is passionate about exploring new technologies. He understands technologies like PySpark, Python, Django, HTML, CSS, Bootstrap, JavaScript, web development, Big Data, Databricks, Machine Learning, SQL, and AWS services.

More from this Author

Follow Us