Live TechTalk

Join experts from Google Cloud:  How to Scale Data Integration to and from Google BigQuery: Thursday, May 30th, 2PM EST/ 11AM PST

Register

Data Engineering: Automation of Common Tasks

Data engineering automation is the process of bringing automation to the otherwise manual steps involved in constructing the data pipelines used to extract, transform, load, analyze, and monitor data. Data pipeline creation is a multi-step process: It begins with data collection and extraction, then continues with transformations like cleansing, modeling, and masking as well as loading data for analysis, machine learning, and reporting. 

Enterprises seek speed in decision-making, which is driven by data value. However, delays or long sync times can be detrimental to utilizing data for time-sensitive use cases, such as competitive analysis and customer-focused marketing. Furthermore, errors in construction or delays in processing can cause data to be outdated or obsolete by the time it arrives at its destination. Enterprises can expect better time to value by automating pipeline construction and management; they can also expect a decrease in expenses, errors, and delays throughout the process.

This article explains how automation applies to various stages of the data pipeline and recommends best practices for implementing data engineering automation.

Common data engineering tasks

There are a number of common tasks that data engineers perform while building and maintaining data pipelines and systems. These tasks fall into four main categories: extraction, transformation, loading, and monitoring. Transformation itself encompasses three subcategories: data cleansing, data masking/sanitization, and data modeling. We focus on each task and present ideas and examples illustrating how automation and orchestration can help accomplish that particular task efficiently and reliably.

The table below summarizes these task groups and subgroups.

Data pipeline task type How automation can help
Extraction Data management tools periodically check a data source’s status, automatically triggering a data extraction process using established connectors and returning standard output formats to make subsequent steps more efficient.

Auto-generated connectors are a key innovation in this area because a lack of established connectors has been a common roadblock.

Loading Automation can help load extracted data into target data stores for downstream consumers in a consistent, reliable way by developing data products that are capable of handling data with different formats and velocities. 
Transformation Cleansing Automatic identification and correction (or removal) of incorrect or inconsistent data makes the data more reliable for downstream consumers.
Masking and sanitization Malicious code injected into data due to vulnerabilities in front-end systems or mistakenly entered sensitive data like credit card numbers can be automatically removed before causing security and privacy issues. Additionally, masking can help ensure that personal information is not shared with company personnel who shouldn’t have visibility.
Modeling Modeling automation includes the automated creation of data products. Entire databases and tables, where key business objects are modeled, can be automatically modeled with consistent data types and naming conventions using script templates.
Monitoring The increasing number and complexity of data pipelines leads to a requirement for constant and automated monitoring as well as immediate notification through various channels such as emails, Slack, etc..

Of course, this is not an exhaustive list of a data engineer’s daily tasks. There are many other areas with tasks that engineers are sometimes expected to complete, such as addressing schema drift (the evolution of the data schema over time), but the same principles apply to them as well. The overall goal is to remove the need to do tasks manually in favor of building or buying solutions that perform them automatically. Automating these tasks reduces errors and enables organizations to scale their data engineering processes.

Having a flexible and extensible framework while building your data automation solution is particularly helpful in reducing cost and effort in the longer term. 

Automation in common data engineering Tasks

Data extraction

How do you know your data is ready at the source to begin extraction? These are some commonly used methods:

  • Manually refreshing a shared network or SFTP folder to see if the file you are looking for has arrived
  • Running SQL queries and checking if they return anything for a given period
  • Receiving a message, email, notification, or call from a colleague or system
  • Receiving data from a continuous stream

Automatic triggers

In general, every time you must rely on manual work or notifications to start your pipeline—such as getting an email or message from someone telling you that data is ready to be extracted—there is an opportunity to automate. One way to automate the process is with automatic triggers, which are predefined conditions or events that initiate a pipeline when met. These triggers eliminate the need for manual intervention for notifications, allowing the pipeline to be automatically started when data is ready to be extracted.

For example, automatic triggers are helpful when you want to schedule the execution of your pipelines based on specific events, such as the arrival of new data, the completion of a previous activity, or a specific time or interval. 

However, automatic triggers may not be suitable for scenarios that require complex dependencies or fine-grained control over the execution of tasks in the pipeline. Also, they can lead to issues such as resource contention or high costs if not adequately managed, mainly when multiple data pipelines are triggered simultaneously or when triggers initiate pipelines too frequently.

On the other hand, cutting-edge platforms such as Nexla have capabilities that extend beyond rule-based automation. The system is ready to identify and respond to anomalies. For instance, it is well-equipped to detect a sudden surge in requests or any activity that deviates from the norm. Not only does it monitor these irregularities, Nexla also leverages machine learning algorithms to understand, adapt, and make informed recommendations based on these unexpected occurrences. This dual capability of Nexla—implementing automatic triggers and detecting anomalies—ensures a comprehensive, intelligent, and efficient data engineering solution.

Control jobs

Another way to automate a data pipeline is by using control jobs: specialized tasks within a data pipeline or workflow management system that help manage and orchestrate the execution of other jobs or tasks. They control the pipeline’s flow by ensuring that specific conditions are met before allowing other tasks to proceed. Control jobs can include monitoring data availability, checking the success or failure of previous tasks, or enforcing dependencies between tasks. 

Control jobs are useful when you need to ensure that certain conditions are met before executing subsequent tasks. For example, you may need to wait for data to be available in a specific location or for an external API to return the expected results. By using control jobs, you can create more efficient and automated data pipelines. Implementing control jobs enhances the efficiency of data pipelines, taking automation a notch above what is conventionally possible. 

While automation inherently minimizes the requirement for manual involvement, Nexla’s features transcend this standard offering. Nexla fundamentally bolsters the reliability of the overall process, providing a superior level of stability and trustworthiness. In Nexla’s framework, automation is not merely about replacing manual tasks but also fortifying the dependability of the data pipeline. This makes for a robust data management solution that promises consistency and reduces the risk of disruptions.

Note that control jobs can add complexity to the pipeline, requiring careful configuration and monitoring to ensure that they function as intended. Additionally, they can increase the overall execution time of the pipeline since tasks may need to wait for the control job to be complete or for specific conditions to be met.

What is the impact of GenAI on Data
Engineering?

WATCH EXPERT PANEL

Choosing between automatic triggers and control jobs

This choice depends on your data pipeline’s specific scenario and requirements. For instance, imagine that you have an ecommerce website and need to process and analyze sales data daily to generate reports for the management team. In this situation, data is generated continuously as customers make purchases, and the analysis should be performed at the end of each day.

Using automatic triggers would be beneficial here because they can be set to initiate the data pipeline at a specific time (e.g., midnight) or when a particular condition is met (e.g., the daily sales data is ready). This eliminates the need for manual intervention and ensures that the analysis is performed promptly and consistently daily without delays.

Now consider a complex data pipeline where data is sourced from multiple systems, and each system generates data at different intervals. In this scenario, tasks within the pipeline have complex dependencies on data availability, and the pipeline should only proceed when all required data sets are available.

In this case, control jobs can manage the dependencies between tasks. For example, control jobs can be configured to check for data availability at specific intervals and trigger the downstream functions only when the required data sets are ready. This strategy embodies both control and checks within the pipeline execution. It guarantees meticulous control over the sequence of operations, ensuring that tasks are launched only upon fulfilling their dependencies. 

This twin approach acts as a double safeguard, substantially reducing the risk of errors and enhancing the workflow’s overall reliability. It’s not just about controlling the execution but also about validating the readiness and correctness of each step, creating a system that embodies both precision and accuracy.

It is crucial to carefully assess the pros and cons of each approach to determine the most suitable solution for your particular use case. In some cases, you may need to combine both control jobs and automatic triggers to achieve the desired level of automation and control in your data pipeline.

Data loading

A common issue when reading data from a source is having many different types of data. Some are traditional databases that return tables, some are streaming sources, and others are REST APIs that return data in JSON files or using other machine-readable formats. Occasionally, you may need to read data from log files that could be in a custom file format. In an ideal scenario, the output would be in a standard format that you can feed into downstream tasks; the same holds for target systems where you load the data.

One solution to this problem is using a polyglot approach that can handle data with different formats and velocities. A mix of data tools allows you to read data from various sources, transform it into a standardized format, and then load it into the target system. However this can require significant time and resources, which may not be feasible for all organizations.

The question is: Should you build source and target system connectors and standardize the data formats yourself? 

There are solutions in the market that already solve this engineering problem, so you don’t have to design and develop a solution on your own. For example, an open-source framework like Singer generalizes all data sources under the simple concepts of “taps” and “targets” to help streamline the process. However, its implementation requires time, resources, and technical knowhow that you may not have available.

An ideal solution lets your data pipeline tool standardize data formats for you. For example, the Python Prefect library has Results as first-class objects, Azure Data Factory has Datasets as one of its main components, and Nexla has Nexsets as the building block for modern data architectures. 

There is no need to invent and maintain a novel implementation when many tools already provide a solution.

Is your Data Integration ready to be Metadata-driven?

Download Free Guide

Data cleansing

The value of the output generated by any data system is directly correlated with the quality of the data that it ingests: “Garbage in, garbage out,” as the saying goes. Data cleansing (or cleaning) ensures that any data that has any of the following issues in the table below is captured and immediately addressed before propagating further downstream.

Data quality issue Example
Inaccuracy Date of birth differs from the source-of-truth record.
Incompleteness Critical data is missing from the record (e.g., title, last name, or email).
Inconsistency Multiple social security numbers exist for the same individual.
Invalidity Entries are not true, e.g., “France” is not a state of the United States.
Redundancy Duplicated columns have been sourced from the same data.
Obsolescence Data was delayed and is no longer relevant.
Anomalous values Certain data values are far outside an average or a range. e.g., an order amount is five times larger than the average.

What the business considers a data quality issue can be hard to define or implement by a data engineering team, but it’s generally possible to start with the following high-level steps:

  1. Create validation checks along the lines of these examples:
    1. Statistical or expected ranges for numerical values in a database column
    2. The presence of a null value in a column (or the maximum allowed null rate) 
    3. An allowed set of values inside a low-cardinality column (e.g., US states)
    4. Duplication of data
  2. Document the business decisions to apply for each case, e.g., “replace null values with the mean of the column.” The alternative is to flag them for later inspection.
  3. Develop SQL or programming scripts for each source data set. Apply the rule and save the cleansed results when there is a match.

When it comes to automating data validation and profiling, there are many tools available in the market that can assist data engineers in performing these tasks. One such tool is Great Expectations, a Python framework that attempts automated profiling with a certain degree of success. However, it requires learning a lot of new concepts—like “data context,” “expectation suites,” and “checkpoints”—to perform a simple validation check. 

If your data pipeline tool provides an easy-to-use feature for automated data validation, use it. If not, you can add simple inspection tasks to your pipelines and trigger them automatically when the upstream job completes. The method for adding simple inspection tasks and triggering them automatically will also vary based on the tool being used. Some tools may allow for automated data validation through scripting, while others may have a built-in feature for it.

While automated data validation and simple inspection tasks can help ensure data quality, there are also some challenges to consider. For example, setting up and maintaining these tasks can require additional time and resources. Additionally, it can be difficult to identify all the potential issues that might arise in complex data pipelines. However, by utilizing a platform with built-in quality checks, such as Nexla, you can streamline this process and overcome some of these challenges.

The simplest approach is to rely on a platform such as Nexla that features built-in quality checks, such as monitoring the percentage of change in data volume or the delay in loading data from a particular source. You can learn more about these features here

Data masking and sanitization

Data engineers usually work with second-hand data. This is because the (initial) data collection happens at the application or device level, and the people who are involved are usually software engineers developing applications and database administrators managing relational databases. In an ideal world, data engineers will take care of sensitive data and malicious scripts (like Little Bobby Tables) before they creep into primary data storage systems. When data engineers find themselves in situations where data is not sanitized or needs to be shared within the company, they should develop automated scripts to deal with it before that data goes further downstream.

Before we delve into data sanitization categories and review some examples, it’s worth noting that the terms “cleansing” and “sanitization” are often used interchangeably but have different connotations. Cleansing relates more to correcting errors in the data, whereas sanitization is primarily about hiding sensitive or malicious data matching specific patterns or characteristics.

Guide to Metadata-Driven Integration

FREE DOWNLOAD

Learn how to overcome constraints in the evolving data integration landscape

Shift data architecture fundamentals to a metadata-driven design

Implement metadata in your data flows to deliver data at time-of-use

Data requiring sanitization can be organized into two broad categories:

  • The first is data that we did not intend to collect but users somehow entered it into a free-form field or in a similar way. This type of data should be removed, just like maliciously injected code. 
  • The second is data that we need but that we should mask or anonymize after running certain aggregations or doing other processing.

There may be other needs for data sanitization, such as profanity filters. Ultimately, the goal is to detect the presence of data requiring sanitization automatically and take appropriate actions to remove, replace, mask, or anonymize it. 

To give you a concrete sense of how this process can work in practice, let’s walk through a specific example. We’ll explore how to detect potential credit card numbers within a dataset and replace them with “X” placeholders, using both Snowflake SQL and Python.

It’s important to note that this is just a basic illustration. In a real-world application, you would likely want to use a specialized library designed to handle various credit card formats for more robust sanitization.

To use the code below, we can set up a job in our orchestration tool to automatically run this SQL query or Python script on the source data as soon as we read it and have it ready for downstream processing. Tools like dbt can help reduce code complexity by chaining multiple transformation steps. 

SQL solution

Note that the REGEXP_REPLACE function may not be available in all SQL dialects.

WITH sanitization_test AS (
    SELECT 'My card number is 1234-5678-9012-3456.' AS cc
)
SELECT REGEXP_REPLACE(cc, '(\\d{4}-){3}\\d{4}', 'X') AS customer_comment_sanitized
FROM sanitization_test

This returns:

CUSTOMER_COMMENT_SANITIZED
My card number is X.

Python solution (using pandas library)

import re
import pandas as pd

def sanitize_comment(comment: str) -> str:
    """Replaces potential credit card numbers with X."""
    return re.sub(r'(\d{4}-){3}\d{4}', 'X', comment)

df = pd.read_csv('data/customer_comments.csv')
df['comment'] = df['comment'].apply(sanitize_comment)
df.to_csv('data/customer_comments_sanitized.csv', index=False)

Data modeling

Data modeling is the process of creating a conceptual representation of data and its relationships, often in the form of a schema. Traditionally, data modeling has been one of those “one and done” tasks, so one might ask what there is to automate in the first place. The answer relates mainly to agile models, like Data Vault 2.0, where you have a lot of standard-type tables (hubs, links, and satellites) that are updated frequently to meet changing business demands. Tools like dbtvault are built on top of dbt to automatically generate and populate all the model tables from the metadata (table names and mapping details). 

Replicating a production database to be used for development or user-acceptance testing purposes is very similar to data modeling in that you create related tables from scratch and populate them with a (masked) subset of production data. Whenever we identify a repeating pattern, we can usually find a way to automate it, and it seems like there is a pattern here too. In this case, tools like Terraform can come to the rescue. Terraform is an infrastructure as code (IaC) solution that can automate the creation and management of various systems, including databases using declarative templates. 

Data modeling process workflow (source)

Data modeling process workflow (source)

Monitoring

All automated systems require constant monitoring to ensure that they continue to function without errors. As builders and maintainers of such systems, data engineers should not need to manually check execution logs and target systems to see if a task has succeeded. Every task in the pipeline should have standard notification options for when they fail or are delayed long enough to miss their SLAs. 

Workflow orchestration tools like Apache Airflow have built-in support for logging and monitoring. They can automatically trigger notification actions using execution, success, failure, retry, or SLA miss callbacks. If one needs to develop a similar automation feature for a custom data pipeline tool, this is a simple yet powerful way to do it. Nexla includes monitoring as a standard integrated feature of its data engineering platforms, which can be enabled and configured easily.

Automation best practices

When it comes to data pipelines, automation is crucial to achieving scalability. Be sure to check your pipeline tool for available automation features and utilize as many as possible. Consider a platform like Nexla that allows for the extension of core automation functionality without the need for extensive development. Additionally, seek out tips and tricks from user groups, and don’t hesitate to reach out to the vendor’s support team to request any necessary features that are not already available.

With automation, data engineering tasks that were previously done manually can be executed much faster and at a larger scale. Let’s explore another scenario to see how this works.

If you had to manually extract data from a source system, transform it into a standard format, and then load it into a target system, this process could take hours or even days. With automation, this process can be done in a matter of minutes or seconds, depending on the complexity of the task. This allows you to process more data in less time, so you can make better decisions and stay ahead of the competition.

Identify automation features

The first and most important of the best practices is to check which automation features your data pipeline tool already supports:

  • Make use of as many automation features that you have at your disposal.
  • Select a platform like Nexla that allows extension of its core automation functionality without the need for hard coding.
  • Join data engineering communities online to find tips and tricks. 
  • Reach out to the vendor’s product management organization via its support team and ask for any feature you need if it is not present.

Identify automatable tasks

Look for any task you manually perform daily to see if you can automate it. For example, if you need to send an email to inform analysts that their dashboards have been updated with the most recent data, automate it. See if your data pipeline tool supports on-success actions, and let it send the email on your behalf. If it doesn’t, write a simple emailing script and let your orchestration tool run it after the last data-loading job succeeds.

Match tasks and tools

Use the best tool for a particular automation need, starting small and building on top of it as needed. We have listed many open-source tools above for different needs. Having said that, consistency is critical for automation, so reuse the same solution wherever possible. Having to use different methods, tools, or frameworks for automating processes can turn into a manual task itself and become a huge time sink. 

Use comprehensive tools when possible

Finally, note that while quick and custom solutions may save the day, they can be costlier to maintain over the longer term. Comprehensive data pipeline tools and platforms with significant support for automation, such as Nexla, are preferred over custom-built solutions.

Empowering Data Engineering Teams

Free Strategy
Session

Platform

Data Extraction

Data Warehousing

No-Code Automation

Auto-Generated Connectors

Metadata-driven

Multi-Speed Data Integration

Informatica

Fivetran

Nexla

Conclusion

Data engineers face many challenges, such as dealing with large volumes of data, maintaining data quality, ensuring data security and privacy, handling complex data formats and structures, and keeping up with changing business needs and demands. Automation can help overcome these challenges by increasing efficiency, reducing errors, and enabling faster delivery of insights.

Data engineers perform many tasks and handle various applications and systems. Some of the common tasks involve moving data from one system to another; checking data quality; running multiple transformations on the data to cleanse, sanitize, and improve it; monitoring the status of pipelines; and keeping the development, test, and production environment warehouses synced. Automating these tasks has become necessary to meet ever-increasing business needs and demands.

Effective automation is accomplished when it is available at all stages of the data journey. Wherever it is missing, the data engineer must fill the gap by manually performing the task or developing a custom solution. Over time, these custom solutions may need to be built with different methods and frameworks, making maintaining them and ensuring mutual compatibility a task in itself. Investing in a comprehensive data engineering automation platform can yield substantial cost savings. In addition, it helps agile and data-driven organizations of all sizes stay ahead of their competitors by turning data into insights as fast as possible via automation.

Like this article?

Subscribe to our LinkedIn Newsletter to receive more educational content

Subscribe now