Akhil Newton
- May 10, 2023
- 5 min read

AWS GLUE DATABREW

In today's data-driven world, efficient and streamlined data preparation is essential, and AWS Glue DataBrew offers a powerful solution for businesses to transform their data with ease. AWS Glue DataBrew is a visual data preparation tool from Amazon Web Services (AWS) that enables users to hone and normalize data without requiring extensive coding or technical expertise. It allows users to visually explore and transform their data using a simple point-and-click interface.

The tool also provides data profiling features, which can help users identify issues in their data and suggest appropriate data transformations.

FEATURES

Dataset: With DataBrew, users can connect to multiple data sources and generate datasets that include the necessary information. The datasets can be used for data preparation tasks such as polishing, normalization, and transformation.

Profiling the data: DataBrew provides data profiling capabilities that help users identify issues with their data, such as missing or inconsistent values. The profiling features also provide suggestions for data transformations that can help address these issues. Profiling gives a mini statistics about each column in the database which includes counts, max value, deviation, range, mode etc.
Projects and Recipes: Users can utilize DataBrew to group their datasets and transformations into projects, simplifying the management and teamwork involved in data preparation assignments. Within a project, users can create and manage recipes, which are sets of transformations applied to a dataset.

Jobs: DataBrew helps users to automate their data preparation tasks using jobs. A job is a scheduled or on-demand execution of a recipe on a specified dataset.
Fully managed ETL service: AWS Glue DataBrew is a fully managed ETL service that makes it easy for users to extract, transform, and load data from various sources. It eliminates the need for users to manage infrastructure or perform manual updates, making it a cost-effective solution for data preparation.
Serverless: DataBrew runs on a fully managed, auto-scaling Spark environment, which makes it a serverless solution. This means that users do not need to provision or manage any servers, and they only pay for the resources that they consume.
Simple and cost-effective: DataBrew has a simple, point-and-click interface that makes it easy for users to prepare their data without requiring extensive coding or technical expertise. It also has a pay-as-you-go pricing model that allows users to pay only for the resources that they consume, making it a cost-effective solution.

NEED FOR AWS DATABREW

Hand-coding ETL jobs can be time-consuming, error-prone, and difficult to maintain over time. It can be challenging to scale hand coded ETL jobs to handle large volumes of data or to incorporate new data sources.

AWS Glue DataBrew addresses these challenges by providing a fully managed, serverless data preparation tool that simplifies the process of distilling and modifying data. This can help organizations to reduce the time and resources required to prepare data for analysis, improve the quality and accuracy of their data, and derive insights that can drive business growth.

PROS AND CONS OF HAND CODING

Hand-coding ETL jobs can provide greater flexibility and control over the data preparation process. Developers can use their preferred programming languages and development tools to build custom data pipelines, and they can perform unit testing to ensure that the code is working as expected. Hand coded ETL jobs can be optimized for specific use cases, which can lead to better performance and efficiency.

However, there are also several drawbacks to using hand-coded methods for ETL. For example, hand-coded ETL jobs can be brittle and error-prone, particularly when dealing with large or complex datasets. It can be challenging to identify and fix errors in the code, and updates or changes to the code can be time-consuming and laborious.

Another significant issue with hand coded ETL jobs is the hardware management overhead. Developers must manage the infrastructure required to run the ETL jobs, including servers, storage, and networking resources. This can be a significant burden, particularly for smaller organizations with limited IT resources.

RECIPES AND JOBS

A recipe is a set of sequential data transformations or a set of instructions that a user performs to enhance, reconstruct, or engineer features from a dataset. A recipe consists of a sequence of recipe steps, each of which performs a specific transformation operation on the data.

DataBrew provides a wide range of recipe steps, such as filtering rows, renaming columns, and transforming data types.

Recipes in DataBrew are designed to be reusable and shareable across different projects and datasets. Once a recipe is created, it can be versioned, and multiple versions of the recipe can coexist. This can be useful for tracking changes to the recipe over time and ensuring that everyone is using the most up-to-date version of the recipe. Recipes can be shared between people and multiple accounts.

By running a job in DataBrew, the transformations or recipes created by the user will be applied to the whole dataset. DataBrew jobs are designed to be scalable and efficient, and they can be run on a schedule or on-demand. When a job is executed, DataBrew automatically provisions the necessary resources and applies the specified recipe steps to the dataset

OUR USE CASE: PREPARING THE FREDDIE MAC DATASET FOR PREPAYMENT MODELLING

The Freddie Mac dataset is a vast collection of mortgage-related data that enables analysis and research into the United States housing market.

The dataset contained around 49.9 million mortgages originated between January 1999 and September 2021, splitted across two domains: origination and performance datasets.

The process of working with large datasets can be quite challenging, especially when it comes to storing, optimizing, and transforming the data. In this project, we had to work with a dataset that was approximately 40 GB in size. To tackle this challenge, we turned to AWS Glue Data Catalog to store the dataset as tables, and AWS Glue DataBrew.

The UI-based approach offered by AWS Glue DataBrew was incredibly convenient for working with such a large dataset. The visual feedback provided by the preview feature helped us track the impact of each operation on the data, making it easier to optimize and transform the data as needed. With over 40 million rows of data, we were able to work seamlessly on DataBrew, without any need to worry about hardware requirements.

One of the best things about DataBrew is that it requires no programming experience. This made it easy for us to get started with the tool, and we were able to complete our work quickly and efficiently. Our client was extremely happy with the results, and we were able to deliver high-quality work in a fraction of the time it would have taken using traditional programming methods.

PRICING

AWS Glue DataBrew pricing for projects is $1 per 30-minute session. Jobs are priced based on the number of nodes used, with a default of 5 nodes, each providing 4 virtual CPUs and 16 GB memory, costing $0.48 per hour, billed on a per-minute basis. There are no upfront costs and no resources to manage.

Conclusion:

AWS Glue DataBrew is a powerful tool that can be used by businesses to prepare data for analysis and other tasks. It allows for the refining, transforming, and feature engineering of large datasets in a quick and efficient manner. Our team is here to help you make the most of this powerful tool and to provide any assistance you may need in utilizing AWS Glue DataBrew for your business needs. Whether you're looking to optimize your data, transform it into new formats, or perform other data preparation tasks, we have the expertise and experience to help. Don't hesitate to reach out to us for more information or to get started with AWS Glue DataBrew today.

DOTPRODUCT

AWS GLUE DATABREW

Recent Posts

DOTPRODUCT

GET IN TOUCH