Data Pipeline Tools
A data pipeline is a collection of primarily serial steps to extract data from one or more systems, optionally applying multi-step transformations like filtering, aggregation, and merging. Data pipelines then take some action with the data, like generating a report, updating a dashboard, calling an API or, more often than not, loading the data into a system, which can be the source system itself. Data pipeline tools are specialized applications that can help build and automate these extract, transform, and load (ETL) tasks as well as others.
First-generation data warehousing tools collected data from multiple legacy systems and loaded it into a centralized warehouse, where data was integrated, historical, and always up to date. With the continual increase in the volume, velocity, and variety of data, second-generation big data tools dealt with this increased complexity by inventing distributed solutions like Hadoop. Data warehouses also evolved, so denormalized forms like star schema became the norm. Data pipeline tools evolved in parallel, and ELT-based approaches rose in popularity.
With the structure of data changing more frequently, transformations are more diverse, and there has been a rapid increase in both the sources and consumers of data. For these and other reasons, centralized data warehouse solutions are having a hard time catching up. Data lakehouses emerged as a hybrid solution that provided the atomicity, consistency, isolation and durability (ACID) transactions of warehouses combined with the scale, cost-efficiency, and flexibility of data lakes. As an alternative, the Data Mesh framework, where data is treated as a product managed by teams, is also gaining traction. As this field continues to grow and expand, traditional data pipeline tools and methods also evolve to meet its demands.
Before we move on to discuss the essential features of data pipeline tools to help us navigate this dynamic subject, one final point to consider is whether you should build your data pipeline tool yourself or buy one of the many readily-available solutions. The answer is that it depends on your circumstances, which may vary significantly from one organization to another. For many common cases, though, an existing solution is sufficient and can be a huge time and cost saver.
Features of Data Pipeline Tools
There are a number of essential features associated with data pipeline tools. Below, we have described the ones that are the most relevant for many common use cases.
|Data source integrations
|Providing connections to various data sources and combining data from multiple sources into a unified view.
|Programmatic authoring and management of data pipelines.
|Keeping track of pipeline modifications and executions.
|Ability to run and manage an increased number of pipelines to deal with increased volume of data
|Scheduling, Orchestration and Monitoring
|Scheduling, orchestration and monitoring of pipelines and sending notifications when they fail.
|Modifying data by joining, filtering, enriching, or aggregating it before loading it to its final destination.
Data Source Integrations
Every data pipeline tool out there boasts about the number of source systems that it can integrate. For example, Nexla lists over 130 connectors with the option for users to create new connectors through configuration, and Azure Data Factory lists more than 100. Some data pipeline tools stopped counting because we see new storage solutions and applications emerge every day.
Tools like Singer take this one step further by enabling the creation of custom data “taps” (sources) and “targets” using a standard method. In doing so, it solves the compatibility issue and provides virtual integration between any two data sources.
In practice, most organizations need to regularly push/pull data to/from only a few unique source systems. Though this may change in the future, the current status quo makes this feature a relatively low priority.
Leaving numbers aside, before making a decision on a data pipeline solution, make sure it supports (or can be extended to support) your current and upcoming data sources as well as integrate them into a unified view.
This feature refers to the ability to create and manage data pipelines by the data pipeline tool’s own API or any supported programming language. The alternative is to use UI-based, point-and-click, drag-and-drop, low-code/no-code solutions, which require crafting pipelines manually.
Automation is a deal-maker or deal-breaker for some organizations depending on their talent pool and engineering culture (or lack thereof). If your organization already follows some type of “[technical term] as code” approach, it will be in favor of data pipeline tools that can be used and managed purely via code. This powerful feature can be the biggest enabler of scalability. On the other hand, it can be the biggest source of frustration when debugging if the people involved are not following CI/CD best practices.
Apache Airflow is one of the most popular tools that have this feature though it definitely wasn’t the first. Apache Azkaban and Apache Oozie had similar features, which depended on configuration files and the file system rather than just code.
Prefect is another contender in this area, as its developers created a framework from the ground up, fixing many of the drawbacks of Apache Airflow, especially its rigid directed acyclic graph (DAG) API and the separation of DAG creation and scheduling.
Every now and then, there will be a need to know who modified which data pipeline, when, how, and why. Whether this is part of a regular auditing process or is needed to document the post-mortem of an incident, having access to unmodified logs of all actions taken on a data pipeline during a period of time can be a big help.
Data pipeline tools keep a record of the execution status of scheduled pipelines; any failures and automated retries must be recorded as well. This feature is particularly useful to discover unknown side effects of running the same pipeline multiple times. A perfectly idempotent pipeline (one that has no further impact when executed more than once) may still cause a change in the state of a system it interacts with. Thus, it becomes important to know which pipeline was executed when and how.
Changes to production data pipelines should ideally go through a well-defined review process. If a version control software, such as git, is part of that review process—meaning the changes to the pipeline definition are tracked over time—this will provide perfect auditability.
This feature is particularly prominent in code-based data pipeline tools. Since everything is defined via code or config files, data engineers can use any source control tools in their organizations to track changes in their pipelines and revert to a particular version easily if necessary.
Azure Data Factory, despite being a mostly UI-based data pipeline tool, can keep track of changes if source control integration is enabled.
A simple development cycle where changes to the data pipeline are tracked
Depending on how they were built and what value they provide, data pipelines can grow very large very quickly. Scalability often refers to the ability of a data pipeline tool to process greater amounts of data. To do this, the data pipeline tool should either support a distributed architecture where tasks are run in parallel or have elastic compute resources that can scale on demand.
While scaling up the resources is the primary expectation when looking at the scalability features of a pipeline tool, the ability to deal with the increased complexity of managing more tasks is also needed. For example, imagine a simple pipeline that was built to collect some marketing data, perform some joins and transformations, and finally refresh a dashboard by pushing the latest data to a table in an analytical warehouse. Now imagine this simple experimental pipeline turning out to be very valuable. As a result, you—the data engineer—are asked to refactor or copy this pipeline (in its entirety or some of its steps) so that it can provide the same value for hundreds of other historical, current, or future marketing campaigns. A modern data pipeline tool should be able to help you build tasks that can scale quickly and easily. One common way of scaling pipelines is by reusing the same component or tasks wherever possible and focusing on the dynamic parts. The ability to define variables in tasks, dynamically retrieving those variables from any source and then repeating dynamic tasks with those variables, provides significant scalability.
Scheduling, Orchestration, and Monitoring
Designing and building a data pipeline is not the end; we also need to schedule, orchestrate, and monitor our pipelines. Scheduling is only relevant for batch pipelines because the data in streaming pipelines is immediately consumed as it arrives. However, orchestration and monitoring are needed for all types of pipelines.
Scheduling defines when a pipeline runs. During the Hadoop era, this was usually described in terms of days, but with more demand for near real-time analytics, the frequency is now hours if not minutes.
Orchestration defines the dependencies between pipelines and starts them when all the dependencies are satisfied. Data pipeline tools with automatic dependency resolution can be really helpful to make sure your pipeline has everything it needs by the time it runs. This is usually possible if all pipelines are using the same tool; otherwise, an external orchestration table (example below) might be needed to keep track of which tool ran which pipeline when.
A sample table for orchestrating tasks managed by disparate systems
Monitoring is needed to constantly check the health of the pipelines, including failure rates, average runtime, and number of SLA misses. Some type of notification capability is often present along with monitoring. When a pipeline fails or misses its SLA, the responsible members are notified via email, SMS, phone call or other channels.
A data pipeline tool can provide all or none of these auxiliary features. When a feature is not present in the data pipeline tool but the organization has dedicated applications or services instead, then the data pipeline tool should be able to make use of them.
Cloud-based warehouses reduce the time and cost required to set up a warehouse from scratch. This trend has caused a shift from ETL-based pipelines to ELT-based ones. Warehouses became the de facto place where the heavy lifting (transformations) happened. This situation tremendously helped the scalability feature of data pipelines because you no longer needed to scale the infrastructure used for your pipelines.
On the other hand, warehouses mostly require some variant of SQL to transform data. A lack of SQL talent or other resources might make it really hard to move all transformations to a SQL-based warehouse. For example, translating a for loop that iterates over data in an existing pipeline to a SQL statement that uses cursors (to imitate looping behavior) might not be a feasible solution. Therefore, having a data pipeline tool that provides powerful transformation functions, such as applying conditional logic or adjusting a nested JSON object, will be a huge help.
Some data pipeline tools, such as Fivetran, don’t even support transformations. Meanwhile, tools like dbt are purely concerned with transformations.
Multi-Speed Data Integration
Selecting the right data pipeline tool can be a daunting task. There are many tools out there, each covering different parts of the modern data stack. The best approach is to focus on the ones that are most relevant to your business use cases and immediate needs.
There are many data systems and applications; consequently, many data pipeline tools provide over a hundred built-in connectors. Make sure your tool of choice supports your source and target systems as well as the different authentication methods.
Not all data pipeline tools provide a programmatic interface to build and run pipelines via code. In addition, not all data pipelines can be extended with custom integrations and features by building upon the core functionality. Automation can be a great time saver for very little effort, so look for this feature if you intend to manage every aspect of your pipelines via code.
Auditability can be built on top of any data pipeline tool to a certain degree, but it is best if the tool provides this feature out of the box. It is even better if the changes are tracked via a source control application like git.
Point-and-click tools greatly simplify building data pipelines, but they may not be as scalable as the tools that prefer the pipeline-as-code approach. If near-term growth is expected, it is best to invest in tools that can scale along with your business.
Scheduling, orchestration, and monitoring (SOM) are all broad topics, and there are dedicated tools that provide these services in a tool-agnostic manner. If you are looking for a “batteries-included” data pipeline tool, look for these features. Alternatively, if you are planning to integrate your pipeline tool into an existing SOM application, then make sure they are compatible.
Transformations are a less sought-after feature, with the transition to ELT-based pipelines. If you can’t make use of ELT and are looking for transformations beyond the basic aggregations and joins, a data pipeline tool with advanced transformations is what you need.
Subscribe to our LinkedIn Newsletter to receive more educational contentSubscribe now