Automated Data Integration: Concepts & Strategies
- Chapter 1: Data Engineering Best Practices
- Chapter 2: Data Pipeline Tools
- Chapter 3: Kafka for Data Integration
- Chapter 4: What is Data Ops?
- Chapter 5: Redshift vs Snowflake
- Chapter 6: AWS Glue vs. Apache Airflow
- Chapter 7: Data Mesh
- Chapter 8: Data Connectors
- Chapter 9: Data Management Best Practices
- Chapter 10: Automated Data Integration
- Chapter 11: Data products
- Chapter 12: Data Engineering Automation
Data integration refers to a collection of business and technical processes that combine data from disparate sources to generate valuable, meaningful, and reusable data sets. While there are various data integration methods, each operating at different layers of the data stack, we will focus in this article on the most modern iteration. In this form of integration, data is loaded to a single central repository—a data warehouse—after undergoing stages like cleansing, transformation, and data quality checks.
According to a market study in 2020 conducted by Dresner Advisory Associates, data integration is critical to the ongoing operations of over 80% of enterprise business leaders. Two-thirds (67%) of enterprises rely on data integration to support analytics and BI platforms, while 65% of organizations prefer cloud platforms or hybrid cloud data integration solutions.
Automation plays an essential part in every stage of a data integration pipeline. This article will cover five data integration concepts, discuss associated best practices, and present the benefits of automation.
Summary of key automated data integration concepts
There is growing demand to integrate data from various sources in organizations of every size, from small-medium enterprises (SMEs) to large corporations. The main reason is the fact that data is created everywhere, in every action we take. From the turns and stops a delivery truck performs in the traffic, to the number of times we move while sleeping, we create data all the time. While this has always been the case, we now have the technology to capture and analyze a variety of textual, geospatial, audio and video data. Data-driven organizations want to tap into this data as quickly as possible no matter how large it is and where it is stored. Data integration tools and platforms have emerged to meet this demand. As the size and complexity of integrations have increased, automation has become a necessity instead of just a nice-to-have feature.
Data Integration Concept | How Automation Helps |
---|---|
ETL / ELT | Automation ensures timely and robust data transfers among multiple sources. |
Unified Data Model | Disparate data is integrated into a single coherent model or into data sets that are continuously updated via automation. |
Data Quality | Invalid, inaccurate, or inconsistent data is detected, and appropriate actions are taken automatically. |
Schema Drift | The integration becomes flexible; changes to the source and target data model are handled automatically. |
Monitoring | Failures and delays are noticed immediately. Automatic retries resolve temporary issues such as network glitches. |
Automated Data Integration
ETL / ELT
ETL stands for extract, transform and load and is a process used to move data from one system to another. The main difference between ETL and ELT is where and when the transformation step occurs.
Traditionally, ETL has been the preferred method as it ensures that the loaded data is already cleansed, integrated, of high quality, and ready to use. The most common ETL use cases are data warehousing, business intelligence (BI) applications and cloud migrations. While the ETL approach is best suited for such use cases, the major downside is that when the business needs a slightly different version of the loaded data model, another ETL pipeline must be built, and the data needs to go through most of the same transformation steps.
With the introduction of fully managed and highly scalable data warehouses such as Snowflake and BigQuery and data lakehouses such as Databricks, the trend has shifted toward ELT. In this approach, the data is extracted from various sources and loaded into the warehouse. Following that, transformation steps, which are usually code-based, clean and integrate data to create a flexible and unified data model. This model is then used to build semantic views tailored to answer specific business questions.
What is the impact of GenAI on Data Engineering?
ETL vs ELT (source)
Regardless of where the transformation occurs, extracting and loading data still remains a challenge of its own. All data sources provide one or more ways of reading data from and writing data into it. These are called connectors which can be developed using the technologies and protocols supported by the data source. Examples of these technologies and protocols are HTTP, TCP, JDBC, ODBC, etc. or wrappers around them in high-level programming languages like Python or Java. For every unique data source, data engineers or developers need to build custom connectors in order to interact with that source. Developing custom connectors and keeping them up to date and performant often turn into a time sink. Going with open source integration solutions is an option, but care should be taken to ensure that managing that solution does not move the focus away from business needs. The best practice is to go with fully managed data integration platforms.
While an integration solution is necessary, it is not sufficient by itself. A higher return on investment is achieved when the integration solution is complemented with automation and orchestration. Automation enables building low-cost and high-speed data integration pipelines and also helps in scale and reduces time and effort by data engineers. Orchestration is the process of managing and coordinating the different components of a data pipeline and it allows highly efficient dependency management. It is best when the same data integration platform provides all these features, so they work smoothly with each other.
Unified Data Model
A unified data model combines data from multiple sources into a single, coherent, centralized structure. This view can be used to support various business processes and applications; the unified data model improves data quality and reduces data redundancy.
Ideally, the unified data model is created within the data warehouse. As discussed in the previous section, this approach enables the extending and updating of the model as business needs change. However, data models that provide the same utility can also be built within a data integration platform via reusable data sets. This approach is more suitable for use cases where individual data sets are sufficient for the business case.
Data sets are well-organized conceptual building blocks that are effortlessly generated from raw data and then chained and joined to build complex pipelines that can be easily reused in other pipelines with support for schema changes. Some examples of these data sets include:
- Azure Data Factory Datasets
- Nexla Nexsets
- dbt Models (although they usually live in a data warehouse)
Once again, regardless of where these data sets live, automation is crucial for generating and updating them. Automation ensures that the data sets are up to date, reliable, and ready to use.
Is your Data Integration ready to be Metadata-driven?
Data Quality
Data quality refers to the overall accuracy, completeness, consistency, timeliness, and validity of data. It is crucial to ensure that data is of high quality to use it for making sound decisions. Some examples of data quality checks include checking the null rate of a column and validating the minimum and maximum values of numerical fields. Data integration tools and platforms provide many built-in statistical and rule-based checks; they also allow adding custom checks and taking appropriate actions automatically.
Data quality checks can happen before or after extracting, loading, or transforming data. For example, in some scenarios, it makes more sense to check the validity of the data at the source by probing it with lightweight queries or checking the metadata before extraction. In other scenarios, where extensive checks are needed before processing begins, the data quality checks should happen in the data warehouse after the loading step. The best practice is to perform these checks automatically and as early as possible but certainly no later than any point where the rest of the pipeline should not be executed in the presence of low-quality data.
Conducting multiple data quality checks across various stages of a pipeline and analyzing the outcomes can be a daunting task. The consequences of forgetting to run certain checks or incorrectly calculating the results can lead to the inclusion of low-quality data, potentially resulting in incorrect business decisions. Automation mitigates these risks by ensuring that all necessary checks are performed on the data before it is processed downstream.
Schema Drift
Schema drift is a term describing the gradual changes that occur to the structure of a database table over time. These include added, removed, or renamed columns; changes to column data types or lengths, or the reordering of columns. If not handled properly, these drifts can cause data pipelines to fail.
There are two main techniques to allow schema drift to occur without breaking the data pipeline:
- Using schemaless pipelines
- Implementing late binding of columns
In a schemaless pipeline, the core principle is to not define a schema for the source, the target, or both. For example, when extracting data from a table and using SQL, instead of writing each column name in a particular order as below:
SELECT student_id, name, date_of_birth FROM physics_dept.students WHERE enrollment_year = '2022' ;
You would write:
SELECT * FROM physics_dept.students WHERE enrollment_year = '2022' ;
The first version will fail when any column of interest is renamed or removed. It will also not be aware of new columns added to the table, and the query might cause issues downstream if any of the columns change. The “*” in the second version is a special SQL character that means “all columns” regardless of their name, number, or type, thus allowing all sorts of schema drift in the source. This version will continue to extract data and let the downstream consumers deal with the changes.
Guide to Metadata-Driven Integration
-
Learn how to overcome constraints in the evolving data integration landscape -
Shift data architecture fundamentals to a metadata-driven design -
Implement metadata in your data flows to deliver data at time-of-use
Schemaless queries work in tandem with late binding, which means column mapping from the source to the target is ignored until absolutely needed. The data pipeline is built in a way that flexibly handles changes in the source, and everything between the source and target, until it reaches a target that uses a rigid schema.
The techniques are simple, but implementing them in every data pipeline can be a tedious, time-consuming, and error-prone task. This is where automation helps tremendously: Data integration tools that support schema drift have built-in functionality to define the level of flexibility with just a few clicks.
Monitoring
During data pipeline design and development, data engineers usually work in an interactive environment where they receive immediate feedback from the source and target systems when something goes wrong. They can check any error message, fix the underlying issue, and proceed with the rest of the work. However, after pipelines are deployed, the automation solution runs the pipelines regularly, so the errors are no longer visible.
This is where monitoring and notification features become essential; the same is also true for any automated data integration solution. Continuous monitoring for delays and failures ensures that data integration pipelines work smoothly without anyone having to regularly check the health of the systems.
There are a few best practices to follow when it comes to monitoring data integration pipelines:
- Collect data at every stage of the pipeline. Some example metrics to look for are job start time and total execution time.
- Monitor for anomalies and errors. A symptom of an unhealthy system is when jobs drift significantly from the regular start times and take longer than usual. Any anomalies and temporary errors should be considered even if the jobs eventually succeed.
- Use tools to visualize the pipeline. Visually seeing the upstream and downstream dependencies helps put the necessary checks in the right places.
- Automate data quality checks. We have already covered the importance of automated data quality checks above. Complementing data quality assurance with continuous monitoring allows us to “set and forget” pipelines so we can focus on other work.
Empowering Data Engineering Teams
Platform | Data Extraction | Data Warehousing | No-Code Automation | Auto-Generated Connectors | Metadata-driven | Multi-Speed Data Integration |
---|---|---|---|---|---|---|
Informatica | + | + | - | - | - | - |
Fivetran | + | + | + | - | - | - |
Nexla | + | + | + | + | + | + |
Conclusion
Data integration is vital to most organizations’ operational and analytical data workloads. There is a growing demand for building integrations among multiple data sources. ETL, and more recently ELT, are the most common techniques for loading data to a central repository such as a data warehouse or lakehouse. Automating data pipelines ensures that data arrives on time and without errors.
Data is more valuable when it is ready to use for analytics. The goal of transformation is to process the data so that it is in the most suitable form to be used for answering business questions. Unified data models and reusable data sets are the most common forms of storing analytics-ready data. The benefit of automation here is that these data sets and models are continuously kept up to date.
Data quality checks can happen at any stage of the data integration pipeline—usually, the sooner, the better. They ensure that data is accurate, reliable, and consistent. Automation helps perform these data quality checks over large sets of data efficiently.
Schema drift refers to the gradual changes that naturally occur to the structure of a data source over time. These changes come in the form of added, removed, or renamed columns, changes to column data types or lengths, or reordering of columns. Left unchecked, these changes can cause significant disruptions to data integration pipelines. Building schemaless pipelines and delaying column mappings until the very end of the pipeline simplifies dealing with schema drift. Data integration platforms that support these features and automate the process can save a great deal of time.
Automating data integration pipelines reduces costs, improves data delivery speeds, and increases overall efficiency. As more and more processes are automated, continuous monitoring becomes essential. We need to be aware of failures and delays as soon as they occur. When monitoring itself is automated, the loop is complete, allowing data professionals to focus on solving more challenging business problems.