Live TechTalk

Join experts from Google Cloud:  How to Scale Data Integration to and from Google BigQuery: Thursday, May 30th, 2PM EST/ 11AM PST

Register

The Science of Practical Data Fabric – Part 2

In the previous blog we discussed how data fabric accelerates data adoption through accurate and reliable automated integrations, allowing business users to consume data with confidence and enabling less-skilled citizen developers to become more involved in the integration and modeling process. In this blog, we will discuss metadata intelligence and introduce a use case to explain how data fabric can help you streamline your data engineering practices with autoscaling pipelines.

Before we talk about how to autoscale pipelines, we need to understand the secret sauce that powers it: metadata intelligence.

What is Metadata, and What is Metadata Intelligence?

Metadata-driven data fabric has significant business value potential to reduce data management efforts, including design, deployment, and operations. According to Gartner Analyst Mark Beyer, metadata is “data about data, data systems, data processing” — what, when, where, who and how aspects of data are all metadata.

The data fabric listens, learns, and acts on the metadata. However, metadata is all over the place. It is everywhere. Traditionally, people thought that metadata only came from database systems but metadata can exist in the query logs, data schemas, and many other sources. When you apply continuous analytics over existing, discoverable, and inferred metadata assets, you get more metadata intelligence.

In short, the metadata intelligence layer observes data at a record-level, infers metadata from data, and combines that with system metadata to generate a deep understanding of data. This layer in the data fabric helps enrich the semantics of the underlying data. This helps the data fabric generate alerts and recommendations that can be actioned by people and systems. As a result, it improves data accuracy and usability  among the data consumers. 

Now let’s discuss how metadata intelligence helped customers exponentially scale their pipelines. 

Customer case study: Scaling pipelines

One of the major challenges data engineers face is scaling pipelines. While a data pipeline may originally be built to process hundreds of million records, over time that need may grow to a billion, or even 10 billion records. Scaling to this level is not as simple as creating parallelization because the pipeline was not constructed to support load increases and managing this can be super complex. In fact, it can be impossible.

This is where metadata intelligence plays a role.

Let’s take a look at how data fabric, powered by metadata intelligence, aids in scaling pipelines through a delivery company use case.

Delivery companies such as Doordash, Instacart, and Glovo have become the new norm (see diagram above). As an end user, we go to an app, look at what’s available, purchase items, and things get delivered to us. According to Statista, more than 161 million people used meal delivery apps in 2022, with 116 million using grocery delivery apps.

These companies have 3 main challenges:

  • Need to streamline data exchange
  • Ability to keep with increase in data volume
  • Keeping up with data user experience

Scaling Pipelines Challenge #1 : The Need to Streamline Data Exchange

There is a lot of data exchange happening behind the scenes to get a package or meal delivered. The delivery companies need to ensure that they get data from different stores, such as CVS, Walgreens, or Safeway, to fully and accurately understand what products are available in which store.

Here are some of the types of data needed for this exchange: 

  • Store inventories from multiple stores and franchises, including location data to ensure stock within delivery radius
  • Different stores of the same merchant have different sets of products available
  • Current promotions or specials for each location
  • Product availability can vary from day to day during daily business
  • Store hours per location and proximity to the user

These datasets can come from different sources from each provider. These feeds can arrive in different ways, such as APIs, files, or custom formats. With different schemas coming in, being able to unify all data into a single structure that is compatible with the services company’s logistics and/or ecommerce application is a necessity. Data engineers address this by  creating a standardized data format and a consumable construct that connects to all your existing applications. 

How did they do it? 

Answer: with data fabric and data products. 

We will talk in detail about data products in the next blog, but for now, just think of them as a construct that makes it easy for any data user to consume data they can trust, in any format they choose.

Solution

Data fabric helps the delivery system easily connect to these data feeds. The fabric interprets and structures the data that is coming through that feeds into a semantic layer, what Eckerson calls an abstracted data object. This semantic layer makes it easier for both the data engineer and data user (e.g., customer account manager) to understand the data regardless of source. 

The data fabric “normalizes” the data into a consistent form. Whether it is a JSON feed coming to an API or a flat file CSV coming from FTP, the user is able to understand and utilize the data easily. In this case, Nexla’s universal connectors made it easy for data users to use low-code/no-code interfaces to connect, transform and deliver data from any source to any source.  

Scaling Pipelines Challenge #2: Increase in data volume 

Another challenge is the ability to handle increased data volume as new merchants onboard. This happens rapidly and constantly with delivery companies. With a single acquisition, delivery services could easily gain thousands of stores, and all their relevant data, at once. Imagine a data engineer trying to onboard 10,000+ CVS stores nationwide or thousands of different local stores! 

It is not just simple connectivity. It is also important to have each location’s data update constantly, especially in the case of delivery apps. Keeping up with inventory during and between days is a necessity for the data engineer to ensure the best user or application experience. 

Solution

When inventory updates have to happen constantly and data keeps increasing, automation becomes essential. Automation helps with autoscaling pipelines.  

So how do we bring in automation? Answer: with data fabric.

The metadata intelligence helps you to quickly map vendor data to the service company’s data platform. By auto-creating the logical representation of the data, intelligent metadata helps accelerate this process.

This empowers data users to use Nexla’s low-code/ low-code interface to work with schema (metadata) to create pipelines and data flows and to automatically scale to keep up with the data generated. Even as the amount of data increases significantly, automated metadata intelligence makes it easy for the specified datasets to be used in ETL, ELT, and/or reverse ETL data pipelines at scale. The learning curve is small, due to the user-friendly interface, enabling users to get started quickly. 

Conclusion

In this blog, we’ve discussed how intelligent metadata enables autoscaling of pipelines in a data fabric architecture. Between this and Part 1, where we went in depth into how data fabric accelerates data adoption and helps scale pipelines, we’ve covered the basics of data fabric. In the next part, we will continue our discussion on delivery company use cases and discuss how data products help with end user experience. You can view the recording of the webinar by the author of Data Fabric: The Next Step in the Evolution of Data Architectures here.

Unify your data operations today!

Discover how Nexla’s powerful data operations can put an end to your data challenges with our free demo.