AWS Glue: How it Works? Serverless Data Integration

Blog > AWS Glue: How it Works? Serverless Data Integration

SVCIT Editorial Apr 22, 2021

A Decrease font size. A Reset font size. A Increase font size.

What is AWS Glue?

AWS Glue is a fully manageable ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize data, clean it, enrich it, and move it reliably between various data streams. AWS Glue’s design is ideal for working with semi-structured data. Here we are going to discuss how Amazon AWS Glue works for enterprise data maintenance.

When should we use AWS Glue?

We can use AWS Glue to organize, cleanse, validate, and format data for storage in a data warehouse or data lake. It also allows to transform and move AWS Cloud data into our data lake. We can also load data from disparate static or streaming data sources into our data warehouse or data lake for regular reporting and analysis.

To store data in a data warehouse or data lake, we integrate information from different parts of our business and provide a shared data source for decision-making and analysis.

Data Sources that AWS Glue Supports

AWS Glue supports at data stores:

Amazon S3
Amazon Relational Database Service that is amazon RDS
Third-party JDBC accessible databases
Amazon DynamoDB

Data Streams Supports by AWS Glue

Amazon Kinesis data streams
Apache Kafka data streams

AWS Glue Environment

AWS Glue calls API operations to transform our data, create run-time logs, store user’s job logics, and create a notification to help users monitor their job runs.

They can define AWS Glue jobs to accomplish the work required, such as extract data, transform and load data from a data source to a data target. Here the user performs the action for data store sources, defines a crawler to populate AWS Glue data catalogue with metadata table definitions.

It is faster, cheaper, and easier to use. Migrate to AWS Glue is 10x faster, and it is serverless means users do not need to worry about poisoning any cluster or server.

AWS Glue Usage

To build a data warehouse to organize, cleanse, validate, and format data.
An enterprise connects AWS Glue to runs serverless queries against the user’s Amazon S3 data lake.
AWS Glue allows its users to create event-driven FTI pipelines.
To understand data assets.

AWS Glue Benefits for Enterprise

Cost-Effective
Less Hassle
Easy Management
Superior Functionality

Glue Data Catalog

AWS Glue has a data catalogue, so basically, it has all the metadata in the form of a database and tables.

AWS Glue Crawler

The crawler connects to a particular service to retrieve data; the service can be amazon S3, RDS, Redshift or dynamo DB, or any other JDBC connection. So, the crawler does it crawls through the data. For example:

Suppose an enterprise stores its data into a CSV file in S3 with like 100 million rows of data. The crawler infers the file’s schema, creates the tables, and stores it in the data catalogue. The data catalogue can then integrate with an S3 service to run the organization’s sequel queries to perform data analysis.

The AWS Data Glue catalogue can act as centralize metadata repository. This catalogue is not a database; it stores only metadata of tables such as table name, column name, and type of data. So, this metadata uses to create tables in AWS Athena. With AWS Athena, the user can run their SQL queries to perform data analysis on their organizational data.

Glue ETL Jobs

Extract, Transform and Load
Leverage Spark
Can be authored using Python or Scala
Server-less

AWS Glue Components

Extract, Transform and Load

Server-less Execution
Uses Apache Spark / Python shell
Interactive Development & Auto-generate ETL code

Glue Data Catalog

Apache Hive meta-store compatible
Many integrated analytic services

Crawlers

Load and maintain data catalogue
Infer metadata schema, table structure
Supports schema evolution

Workflow Management

Orchestrate triggers, crawlers, and jobs
Build and monitor complex flows
Reliable execution

Author: SVCIT Editorial
Copyright Silicon Valley Cloud IT, LLC.