Data Mesh: Tutorial and Best Practices
Data lakes and data warehouses have been the go-to solutions for data storage and analysis for many years. Data lakes are often criticized for their lack of structure and the difficulty of data retrieval. Data warehouses have limitations in scalability; as the size of the database increases, so does the complexity of the queries. They are also slow to update, and changes in the underlying database structure require significant time and effort.
Both require a small army of dedicated data engineers and database developers who understand the data being ingested and stored. This team has to act as the “guardian and keeper” of the organization’s data. They get called upon whenever new data needs to be acquired, cleaned, or stored…when existing data needs to be maintained, new applications are built on top of the data, or any part of the organization requests access to any data set. That’s a hefty task for a small team of people.
It is little wonder that many organizations struggle with their data management. Few people outside the data engineering team know their company’s data, its location, or its meaning. This is a real problem for businesses. Valuable data stays undiscovered, unexplored, and untapped. Organizations are less likely to discover new insights with fewer eyes on the data. When data is trapped in silos encouraged by data lakes and warehouses, it is still harder to connect the dots between data points and make data-driven decisions.
In this article, you will learn how data mesh architecture solves these problems by helping create an ecosystem of data products and services. You will be introduced to the core principles of data mesh, what they are and why we need them, and how the data mesh architecture can help organizations develop data-driven products, services, and insights while ensuring data is handled responsibly.
Summary of key differences between traditional data storage and data mesh
Data lakes and data warehouses | Data mesh | |
---|---|---|
Data Quality Management | Typically the responsibility of a centralized data team or data steward. This team ensures the data’s accuracy, completeness, and consistency in the data lake or warehouse. | Shared responsibility of all teams that use the data. Each team is responsible for ensuring the quality of the data they use and for sharing their data in a way that allows other teams to trust and use it. |
Access to Data | Often limited to a select group of data analysts or data scientists. This can be due to technical limitations or data governance policies restricting access to specific data types. | Decentralized and democratized. All teams have access to the data they need to do their work, and there is a culture of transparency and collaboration around data. |
Ownership of Data | Centralized and siloed. For example, a marketing team might own the data related to customer interactions, while a sales team might own data related to sales transactions. | Decentralized and shared. Each team owns the data they produce and is responsible for making it available to other teams. There is a culture of collaboration and shared ownership of data across the organization. |
Understanding Data | Can be challenging due to the lack of context and metadata associated with the data. It can also be challenging to trace the lineage of the data and understand how it has been transformed over time. | Understanding data is a priority. Data comprises rich metadata and context, focusing on data lineage and traceability. This makes it easier for teams to understand and trust the data they are using. |
Data as a Product | Data is often treated as an input for analysis and decision-making. It is not necessarily seen as a product in and of itself. | Data is treated as a product that can be created, shared, and even monetized. Teams are encouraged to think about using data to create value for their customers or users. |
Dark Data | Dark data refers to data that is collected but not used. In traditional data storage, dark data is often left unanalyzed and can be difficult to access. | Dark data is actively sought out and used in a data mesh. There is a culture of experimentation and continuous learning, and teams are encouraged to try new things and see what works. This helps ensure that all data is used to its full potential. |
Data Silos | May suffer from data silos, where different teams or departments have their own separate data stores, and there is little integration or sharing of data. | There is a focus on breaking down data silos and promoting collaboration and integration across teams. Data is shared and used to promote transparency and cross-functional collaboration. |
Data Governance Responsibility | Typically, the responsibility of a centralized data team or data steward. This team sets policies and standards for collecting, storing, and using data. | Decentralized and shared across teams. Each team is responsible for adhering to data governance policies and ensuring the quality and integrity of the data they use and produce. There is a focus on transparency and collaboration around data governance. This can be difficult in practice, as it is impacted by regulations, company size, and the need to balance risks and agility. |
Data Production and Consumption | Data Producer and Consumers were tightly tied and had to work together (imagine emails, meetings, and discussions between producer and consumer). | Separates data producer and consumer with a focus on packaging data into a product and putting it out there for consumers. This is a key to decentralization. |
Self-service Platform Design | Relies on a centralized data team or data scientists to provide access to data and build reports and analytics. This can create bottlenecks and delays in getting access to data and insights. | Focuses on building self-service platforms that allow teams to access the data they need and build their own reports and analytics without relying on a centralized data team. This enables teams to be more agile and independent in their use of data. |
Engineering?
Messy Inc.’s challenge…
To illustrate the value data mesh implementation can bring to an organization, we will use an example of a fictitious large enterprise, Messy, Inc., throughout this article.
Messy’s current data lives in disconnected silos, in different formats, on various platforms, and with different access policies. Sales data is in Salesforce.com. Marketing is in Hubspot. Product release data is in Atlassian. HR data is in Workday. Customer success data is in Gainsight. Financial info is in Netsuite. They all have different export file formats and APIs with evolving schemas.
Information within each silo is understood by the owner department’s users but mainly by the development team that maintains it. Several data sets were understood only by Messy’s principal engineer Ashi, who left the company months ago. Now that “dark data” sits unused in its silo. No one knows why it exists, but everyone quietly maintains it. As people join and leave the firm, precious knowledge about the data continues to deplete.
Messy’s senior management now decides to track departmental performance trends on a centralized KPI dashboard, including metrics such as sales forecast, customer success, gross margin, product release, marketing leads, and manufacturing efficiency. Connecting these silos to the KPI dashboard mandated by management would require heroic efforts from multiple departments and the development team. They will have to extract data from all the disparate data sources, understand and make sense of it, and normalize data for the dashboard before it can even begin to provide value to the firm. Even if they succeed by some miracle, the dark data, potentially transforming the firm’s business, will sit untapped, undiscovered, and probably not even make it to the dashboard.
We will highlight how each data mesh principle solves unique challenges and how solving these problems at Messy without data mesh can get just plain messy!
Data mesh principle 1 – data as a product
The data mesh architecture treats data as a “product” and its consumers as “customers” to address the high friction and cost associated with discovering, understanding, trusting, and using quality data. For data to be considered a “product,” it must be discoverable and secure. At the same time, it must be explorable, understandable, trustworthy, and ready to use.
Organizations must appoint domain data product owners to ensure their data is presented with the relevant contextual meta-data required to make it easy to understand and use by consumers in other departments.
These data product owners would leverage pre-built data connectors (to popular ERP, CRM, or HR applications) available on platforms such as Nexla to
- describe their customized schema
- share sample data, and
- decide who should have access to the data.
The platform would then enhance the data product with features such as
- audit logs to trace usage and
- alerts to notify consumers when data is missing or duplicated.
The generic screenshot below shows how a data product would be presented on such a data-sharing platform:
To tie this back with our fictitious company’s challenge, Messy, Inc., pre-built data connectors would integrate into the applications used by various departments such as
- Netsuite
- Salesforce.com
- Hubspot
- Atlassian
- Workday
- Gainsight
and present the corresponding data products intuitively and safely in a format that is ready to ingest into a centralized cross-departmental KPI dashboard.
Data mesh principle 2 – domain-driven data ownership
Domain-driven data ownership is a data governance approach that emphasizes the importance of assigning data ownership to individual business units, or domains, within an organization. It encourages organizations to view data as a strategic asset and encourages business units to take ownership of their data and be accountable for its quality, accuracy, and currency. It also fosters collaboration between domains to ensure the data is consistent and accurate across the enterprise.
Messy, Inc. would have to figure out a way to democratize its data and make it more visible, sharable, and easy to work with. Doing so manually would require an excessive amount of time and resources. Once again, their relief lies in deploying tooling that can help expose data from anywhere for everyone on their team. Only after they have access to all the data can they start to understand it, own it, and gain insights into what makes it inaccurate or stale and how they can improve its quality.
Data mesh principle 3 – self-service platform design
Data products require a lot of tools and technology to build, run, and access. It takes special skills to make sure all these pieces work together. To ensure teams can control their data products, they need an easy way to set up and manage this technology. This is what a Self-serve platform design is all about – it’s a platform that removes or reduces the need for software development so teams can focus on consuming data products.
A self-serve data platform is a system that helps people who have limited specialized knowledge create, maintain, and run analytical data products. It provides tools like data storage and pipelines so people can build data products without spending much time and money getting special training. It also helps people keep track of where their data comes from and how it’s used.
To implement the principle of data democratization at Messy, Inc., they would have to
- Identify and catalog data sources – by working with business units to identify and describe (such as format, access policies, and APIs) the data sources they use and create a directory of such sources,
- Create data access and governance policies – which means establishing policies and guidelines for how data can be accessed and used within the organization.
- Enable data discovery and access – by creating or buying off-the-shelf tools and mechanisms to enable users to discover and access data in a self-serve manner.
This includes creating a user-friendly interface or portal that allows users to search for and access data and providing tools and APIs that enable users to integrate data into their systems and applications.
Data mesh principle 4 – federated governance
While domain-driven data ownership ensures the discoverability and usability of data products, organizations need to establish rules for ensuring data products across all parts of the organization are uniformly accessible. Data products will remain chaotic and challenging to use without rules or conventions. Data mesh is supposed to connect different pieces of data from different sources. The federated governance data principle defines the rules that data products follow. It involves standardizing the data products across the whole organization so that all data products comply with industry regulations and organizational rules. This helps to create a data ecosystem that is safe, secure, and efficient.
To implement a data mesh architecture, Messy, Inc. would need to
- Identify data stewards and owners – to define and enforce data policies, standards, and governance processes within their domains,
- Establish data governance processes and policies – could include guidelines for data quality, accessibility, and security, as well as mechanisms for resolving disputes and ensuring compliance with data governance standards, and
- Create a data governance council – responsible for coordinating data governance efforts across the organization, resolving disputes, and ensuring compliance with data governance standards and policies) to oversee data governance across the organization.
Recommendations
When you are ready to begin your data mesh journey, here are some recommendations to consider:
-
- Choose the right pilot project: Start small with one pilot project on one domain team. Choose a data product with clear, quantifiable business value, and ensure the domain team you partner with has the right skills to build and support it from day one. Focus on developing ownership and data product solutions for both data producers and consumers.
- Don’t wait for the “perfect” platform: When implementing data mesh, be ambitious but realistic. Set achievable KPIs, and don’t try to overhaul the entire tech stack at once. Take an incremental approach, starting with one silo at a time, and focus on creating best practices for domain teams to follow.
- Define self-service: A data mesh’s specific architecture and capabilities depend on the organization’s business needs. When defining the domain-oriented architecture and self-service data infrastructure, asking questions about what capabilities will give you the most bang for the buck is essential.
- Define domains to thrive independently: To effectively decentralize data teams, start by defining them and then staff each team with the relevant cross-functional talent and domain expertise. Bring the domain experts and tech people as close as possible to enable them to create value quickly.
- Focus on building trustworthy data products: Teams must document rules for building data products and the associated governance of the system. These rules should emphasize trust and reliability of the data, as well as understanding and building back from what the business needs. Additionally, teams should focus on answering the why before they build products, as this will help ensure that the proper requirements are met, and trustworthiness is achieved.
- Governance will be about striking a balance between agility and risk: Decentralization doesn’t mean free-for-all access to data, but the critical change in your pilot efforts would be to start with a more open stance and have a goal to reduce friction. Remember that Data Mesh is also about enabling a culture of collaboration.
Platform
|
Data Extraction |
Data Warehousing |
No-Code Automation |
Auto-Generated Connectors |
Metadata-driven |
Multi-Speed Data Integration
|
---|---|---|---|---|---|---|
Informatica |
✔
|
✔
|
||||
Fivetran |
✔
|
✔
|
✔
|
|||
Nexla |
✔
|
✔
|
✔
|
✔
|
✔
|
✔
|
Conclusion
The technical complexity of Data Lakes, Data Warehouses, and Data Integration had resulted in a centralized team structure where every data user had to depend on scarce technical resources to get to their data. That approach has several limitations, such as poor data quality, limited access to data, siloed ownership of data, and difficulty understanding and trusting the data. Additionally, these systems can be complex and require specialized skills, leading to bottlenecks and delays.
Data mesh is a new approach built around the concept of Data Products. It offers a more agile, flexible, and effective way of managing data. It emphasizes decentralized ownership of data, shared responsibility for data quality, and transparency and collaboration around data. It also focuses on building self-service platforms that enable teams to access and use data independently without relying on a centralized data team.