Authored by 20 AI + Data Leaders

Modern Data + AI Integration:Strategies and Architectures

Free Download

Data Ops is a new methodology that is rapidly growing in popularity. It aims to apply the best practices of lean manufacturing, agile development, and process orientation to data science and analytics. To that end, Data Ops has many similarities to DevOps. In addition to the scope of DevOps, Data Ops includes components specific to working with data, i.e., data integration across domains, data quality, and data governance. 

Data Ops promises to increase the reliability of data extraction and transformation, simplify its presentation, and make it accessible to all stakeholders within an organization and beyond. Adopting Data Ops reduces the cost of developing and maintaining data-driven applications and increases the return on investment of data projects.

In this article, we will review the key components of implementing Data Ops in an enterprise and illustrate their use with examples.

Summary Table

Concepts and Practice Description
Data Pipelines Sequential processes for data treatment and transformation, with the purpose of making it consumable.
Data as a Product or Data Product Ready-for-consumption datasets that are discoverable, addressable, trustworthy, reliable, interoperable, self-describable, and secure.
Data Mesh                 Data products from different domains that are connected through a universal interoperability/semantic layer make up a data mesh.
Data Fabric The core principle of Data Fabric is to continuously analyze metadata to bring intelligence and automation to data operations.
Data Quality Process Monitoring data using different tools, the establishment of preproduction and production testing, and automated control of data going into production.
Semantic Layer A logical layer in which data is clarified in a unique, unambiguous way that can be understood by all people within an enterprise.
Data governance Data monitoring, access, and control.
Infrastructure as Code Containerization of staging/validation/production, source and version control, and parameterization.

Data Ops: Concepts and Practice

Data Ops achieves its goals by focusing on resolving the issues of repetition, automation, and standardization of processes. 

Traditional data pipeline development is repetitive: developers, data scientists, and analysts create separate pipelines to access often similar data sources. In the end, there are several nearly identical data pipelines, each maintained within its respective project or task. Data Ops replaces these disparate pipelines with data products that are easily consumable by multiple projects and tasks. Together, these data products, connected through a semantic interoperability layer, create a data mesh.

Traditional data pipelines are often developed in an ad hoc manner to serve a particular purpose. While it is faster to start this way, pipelines often break down due to unexpected data being introduced into the source or changes in the source. Finding errors often requires considerable effort, while downstream products suffer downtime and cost the company its reputation and customers. Data Ops promotes data products that are version-controlled with automatic continuous integration and continuous delivery (CI/CD) and testing as part of the build. 

By creating a universal semantic layer, Data Ops ensures that data format is standardized and clear to all stakeholders. This standardization includes using metadata that follows the same naming conventions. Although the metadata tags can be different in different domains, some commonly used examples are data location, provenance, creation date, input conditions, and usage.

Data Pipelines

Following the terminology and principles of lean manufacturing, data pipelines in Data Ops are organized in a data factory. A data factory means additional functionality, including continuous quality and efficiency monitoring and transparency built into the data factory. While a traditional data pipeline consists of extract-transform-load / extract-load-transform (ETL/ELT) processes that deliver data to the point of consumption, a data factory is a collaboration between data creators and data consumers. The result of this collaboration is a process that takes into account the interests of the different stakeholders, the specifics of the business process, the constant evolution of requirements and technology, and quality control. This process is supported by CI/CD, change management and version control, quality control, and tracking tools. 

Let’s look at an example. Most companies have customer data, which is traditionally processed in many ways, depending on who needs it and for what purpose. Analysts have a PowerBI or Tableau pipeline to create brand personas from existing customer demographics and survey results. Financial analysts pull data into spreadsheets to calculate costs, revenues, profits, and more by department or for the entire company. The ML team creates a pipeline integrating the data with a GIS dataset, census data, and socioeconomic data to predict churn and then eventually send the enriched data back to the analysts for additional marketing analysis. 

If the source data changes—for instance, if the company moves from SAP to Oracle—the analysts, data scientists, and developers need to redo the ETL/ELT pipelines and retest their dashboards, spreadsheets, and applications. These changes increase the risk of error and potential downtime. Lack of cooperation and communication between data producers and data consumers often leads to disruption in downstream services.

The data factory approach tries to solve these issues by orchestrating data in data products in collaboration with data creators and data consumers. These parties agree on changes and universally acceptable semantics. Automated quality control prevents the breaking of downstream applications, and data governance ensures easy data access and monitoring of the impact of changes.

Data as a Product

Data as a product, or a data product for the purposes of this article, is a dataset with a logical layer that makes the data ready for consumption by end-users from different domains and with differing levels of technical and data knowledge. The attributes that the logical layer implements are: discoverability, addressability, trustworthiness, reliability, interoperability, self-describing design, and security. 

To better understand the concept of a data product, we can look at an example of an API that returns stock market data. This API looks very similar to a data product because of the following attributes:

  • A dataset
  • Discoverable: One can find it through Google search
  • Addressable: One can find the API at its address
  • Trustworthy: It’s backed by a reputable company
  • Reliable: Data is correct, and uptime is close to 100%
  • Interoperable: It can be easily integrated with analytical tools or custom-built software
  • Self-Describing: The documentation is part of the product
  • Secure: Authentication, authorization, and secure data transfer are part of the product
What is the impact of GenAI on Data
Engineering?

WATCH EXPERT PANEL

However, a stock market API lacks some of the abstraction/logical layer aspects that a data product needs to be in an environment where there is a wide variety of APIs and other data systems in play.

While an API interface is one of the most common in the public space, data products can be addressed through a file interface or a data table interface. A data product can also offer several interfaces at the same time.

The convenience of data products resulted in the development of various tools that offer the conversion of data into data products. For example, GraphQL makes it easy to add a searchable API layer with data product attributes to virtually any data source. Similarly, cloud-based platforms such as AWS SageMaker or Azure ML Studio facilitate publishing ML-enhanced data as searchable endpoints.

Data Mesh

The combination of data products from different domains connected through a universal semantic layer constitutes a data mesh. A data mesh is a distributed data architectural framework based on the following principles: data separated by knowledge domains, data represented by distributed data products, federated data governance, and self-service architecture. Its objective is to provide a view into the available data to extract business value. As such, it follows the goals previously implemented by data warehouses and data lakes but allows for the extraction of value from an ever-changing and growing data domain. To account for the developing data domain, data products within the data mesh implement data quality processes and data governance, and the semantic layer clarifies the data to ensure that its meaning is clear to all stakeholders. 

An example of a Data Mesh is shown below.

Data mesh diagram

Data mesh diagram (Source). 

Data Fabric

Distributed data access can also be organized using a data fabric, which is an architectural framework whose goal is to enable data access across different platforms and technologies. While the data fabric aims to achieve many of the same goals as the data mesh, it approaches the task from a technological perspective and focuses less on adjusting organizational practices. The underlying principle of data fabric is to use technology to enable the automation of data operations and the ongoing analysis of metadata. Data fabric and data mesh are not substitutes for each other; they are different approaches to solving the data problems of a modern enterprise. Data fabric can also enable and facilitate the implementation of data mesh in an enterprise.

Data Quality Process

From lean manufacturing comes an emphasis on data quality. Data quality process in Data Ops is built into the data factory and ensured by automated tests and statistical benchmarks that control the data entering the production analysis stage to guarantee that the data is free of mechanical errors and conforms to business logic rules. This control ensures two major improvements: production uptime is close to 100% and error correction of broken data feeds in production is close to 0%. Data engineers can work on improvements and corrections and be sure that the changes will not interrupt the work of analysts, data scientists, and other data consumers. 

Data quality is also the result of working on a virtual infrastructure where each developer gets a replica of the production environment where they can work on their problems without the risk of interrupting the work of others.

Data quality can be implemented by various tools that provide different options for data engineers. These tools include, for example, Great Expectations, a library that offers data testing, profiling, and documentation. Various false input generation libraries, such as LibFuzzer or Faker, are used to test the pipeline against all kinds of inputs. For example, in ML augmented data products, ML performance and quality can be monitored using Tensorboard or the AWS Sagemaker and Azure ML Experiments infrastructure. These quality monitoring tools plug into the data process, often through dedicated callbacks, which extract and analyze the data and make it available for further review.

Semantic Layer

Data products are particularly useful to consumers due to the implementation of a semantic layer, which is a layer of abstraction between a data source and the end-user. End-users of data may not have the knowledge or time to process raw data feeds, even if the data comes from data products. Data products target many customers at once, so there is a need to simplify, filter, and transform the output of data products into something the end-user understands without spending a lot of time editing the data. Implementing a semantic layer requires collaboration between data producers and different types of data consumers to ensure the adoption of an unambiguous taxonomy.

While many analytics tools allow a semantic layer to be built within the tool’s infrastructure (e.g., Tableau Server, PowerBI), the Data Ops approach is to build this layer outside of an analytics tool. This makes the layer accessible not only to analysts working within the analytics tool but also to a broader audience of users, such as data scientists and business professionals. 

A layer that is tool-agnostic is known as a universal semantic layer; Nexla is an example of a tool for creating such an abstract layer. This tool provides a drag-and-drop interface that allows a user to take data from a variety of sources (data lakes, APIs, files, etc.) and publish the transformed datasets to a multitude of destinations (files, data tables, emails, services, etc.).

Universal semantic layer

Universal semantic layer

Is your Data Integration ready to be Metadata-driven?

Download Free Guide

Data Governance

Data Ops data governance is federated, meaning that each data domain can apply rules, rights, and responsibilities to data products within the domain. Data governance includes tools that allow sharing data products, monitoring changes within the pipeline, and collecting stats. These tools ensure transparency in each processing step and help maintain the pipeline and fix errors.

An additional aspect of data governance is control, which includes benchmarking data flows, finding bottlenecks, and fixing performance errors. These measures can be very granular and enabled at each step of the process: data extraction, load, and transformation. The historical tracking of performance enables finding potential problems by using statistical estimates of what is normal for a particular dataset. Additionally, data governance tools allow the setting up of alerts and notifications based on abnormalities entering the data flow. Where necessary, data engineers and other users can write custom data handling scripts in various programming languages and integrate them with the benchmarking tools.

Data governance tools also enable granular and easy access management. New users can be added and removed by data product.

Data governance diagram

Infrastructure as Code

Infrastructure as Code (IaC) is the main commonality between Data Ops and DevOps. 

IaC allows each developer to get a virtual replica of the production environment where they can work on their problems without the risk of interrupting the work of others. As seen from the discussion above, IaC is the technology that aids the Data Quality Process. The production environment is also containerized and can be reimplemented by simply running a script, which will create a virtual environment, extract the relevant code, create environmental variables, restore a copy of the data, and set up rights and access controls. 

Guide to Metadata-Driven Integration

FREE DOWNLOAD

Learn how to overcome constraints in the evolving data integration landscape

Shift data architecture fundamentals to a metadata-driven design

Implement metadata in your data flows to deliver data at time-of-use

Environment creation scripts are also version-controlled, which allows reverting to an earlier version at any time. 

There are plenty of examples of IaC in modern cloud environments, such as AWS, Azure, and within enterprises. For example, the AWS CloudFormation service allows the creation of a complete network of services (virtual machines, network routers, databases, etc.) spread across multiple regions if necessary. Similar functionality is provided by airbnb.io, pulumi.co, Kubernetes services on Azure cloud, and others.

AWS CloudFormation designe

AWS CloudFormation designer (Source)

Next Steps

Given the benefits of Data Ops, more and more organizations are looking to implement the practice. Where should they start?

Implementing Data Ops is a strategic goal. Like any strategy, implementing Data Ops requires understanding one’s environment and implementation options. It’s important to remember that there is no one-size-fits-all solution and that acquiring an off-the-shelf solution may be unnecessary or even harmful if it doesn’t meet the organization’s needs.

The following steps can serve as a guide, but not a prescription, for organizations looking to implement Data Ops:

  1. Understand your environment: Identify the data stakeholders, identify the human interactions that take place, understand the needs and pain points of the stakeholders, and take note of the current and expected data flow patterns.
  2. Define your process: Document desired data flows, existing and potential bottlenecks, and required monitoring, access, and security parameters. 
  3. Prepare your tools: Purchase, develop or adapt existing tools to enable the processes defined in step 2 as well as source control, CI/CD, and IaC.
  4. Empower your staff: Assess skill gaps, create a training plan, and hire the right people to make the transformation successful.
  5. Implement: Develop data pipelines and set up testing and monitoring in collaboration with all stakeholders.
  6. Evaluate: Continually review your Data Ops performance, communicate achievements and challenges, and seek feedback.
Empowering Data Engineering Teams

Free Strategy
Session

Platform

Data Extraction

Data Warehousing

No-Code Automation

Auto-Generated Connectors

Metadata-driven

Multi-Speed Data Integration

Informatica

Fivetran

Nexla

Conclusion

Data Ops promises to truly harness the value contained in data. Through automation and quality controls, Data Ops enables rapid and meaningful experimentation and change management in response to new trends and emerging data streams. Through distributed data governance, Data Ops enables cooperation and the sharing and extraction of information previously hidden by layers of processing complexity. Data Ops standardization enables the emergence of multiple tools that reduce the complexity of various steps in the data analysis process; these tools can be applied almost universally without requiring custom development. Data Ops makes data accessible to business analysts, data scientists, and even the public, all without the risk of data corruption, unauthorized access, or business interruption due to performance overhead from additional queries.

Like this article?

Subscribe to our LinkedIn Newsletter to receive more educational content

Subscribe now