Virtual TechTalk

Hear from Google Cloud Experts:  How to Scale Data Integration to and from Google BigQuery: Watch On-Demand

Watch Now

The same way an accountant’s job is to audit an organization’s finances to certify their legitimacy and regulatory compliance, data must also go through auditing to establish validity, trust, and proof of effective data governance. In other words, data audits ensure data security, and compliance with data regulations and enable a proactive approach to data privacy.

There are many types of data audits that organizations use. Data quality audits are among the most important because subsequent tasks, such as regulatory compliance, depend on them. Data quality issues could also lead to inaccurate reporting and, ultimately, wrong business decisions. Additionally, not having data lifecycle policies that deal with customer data retention laws can lead to fines.

This article will cover data audit key concepts and best practices, including how to implement data auditing techniques and the benefits of a modern data mesh platform with continuous data governance policy enforcement. 

 Summary of key data audit concepts

Data audits can be daunting at first because there are many variables to consider. Audits require attention to detail, especially for system design. Investing enough time to understand and implement a strategic approach is key to a successful data audit. The table below summarizes the key data audit concepts we will explore in this article.

Concept Benefit
Data governance & policies  Ensure that data lifecycle and data output can be trusted and enforce policies to improve security and compliance.
Data access Auditing of data access enables compliance reporting with clear records of who accessed what and when.
Data security Reduces the risk of data breaches with controls including data encryption.
Data retention Data retention policies help ensure that data privacy (e.g., CCPA/GDPR) laws are obeyed.

Data audit concepts in detail 

The sections below review the five key data audit concepts in detail. 

What is the impact of GenAI on Data
Engineering?

WATCH EXPERT PANEL

Data governance & policies 

Data governance is a set of methodologies and policies are a fundamental requirement for any organization working with data. Governance has to be applied throughout the data lifecycle, from when data enters the system to when data is processed and used in its final form. Governance becomes more critical as data privacy concerns continue to rise and data leaks continue to cost millions in damages.

Data governance policies establish rules, processes, and responsibilities for data management. They cover data access, security, privacy, and usage, providing classification, ownership, and stewardship guidelines. These policies address compliance requirements, regulatory standards, and industry best practices, ensuring data compliance with relevant laws and regulations. Organizations mitigate risks, maintain data integrity, and foster trust in data-driven insights by implementing robust data governance.

Integrating data governance policies within data pipelines, such as detection and masking of sensitive attributes, can be very powerful, as governance is applied at ingest, eliminating the risk of leaking sensitive data stored in a data lake. Doing so within the Data Fabric framework brings automation to the process. Data fabric, with its centralized view of data and support for governance initiatives, aids in enforcing compliance and facilitating adherence to data quality standards, security measures, and regulatory requirements. Implementing effective data governance practices is vital for organizations to make informed decisions and maintain the reliability and trustworthiness of their data assets.

Data governance best practices (Source)

Data governance best practices (Source)

Internal or external auditors can conduct a data audit to evaluate an organization’s overall data governance. Data duplication across databases is one of the most common issues an audit uncovers. This is usually a direct result of insufficient organizational structure in bigger teams that can occur naturally over the years if not appropriately addressed from the beginning of the data lifecycle. 

For example, teams A and B could receive the same dataset from a centralized data engineering (DE) team from centralized data processing software. Then, each team provisions its data in silos and creates data duplication.  

A way to avoid this would have been for teams A and B to receive the same Data Product from DE. Well-designed Data Products such as Nexsets by Nexla can deliver data to multiple users without making new copies while also tracking the lineage of data. Even if the data products were to differ for each team, teams A and B could reuse what is already present for the common part of the pipeline instead of duplicating data. Such an approach reduces governance overhead and risks caused by copies of copies of data.

Data Quality is essential to the reliability of data-driven decisions and models. However, incoming data is often incomplete or erroneous (e.g.,  missing values and typos). Data engineers can use processes like blending data from multiple sources or alerting on missing, critical data fields to address data quality issues. Data engineering teams should define standard data quality practices to ensure data is clean, accurate, and consistent.

Is your Data Integration ready to be Metadata-driven?

Download Free Guide

Data access

External and internal data auditing can require reports of who can access data and when access occurs. With data access audits in place, organizations can identify unusual access patterns, such as employees accessing unauthorized datasets, and highlight incorrect data classification patterns. 

Data must be classified as personally identifiable information (PII) or not. PII data examples can be seen in Figure 2. PII data should be removed if they are not needed and masked if they are used for data analytics.

Figure 2. Examples of personally identifiable information (PII). (Source)

Figure 2.  Examples of personally identifiable information (PII). (Source)

Authentication and authorization are two separate processes that every data-related system must have. Authentication means a person proves their identity to access a system. If the identity is verified, then they are granted access. However, the system must also have policies regarding what areas and data each person can access. By default, the system should deny all access to all persons unless there is a legitimate reason for the person to access. This is also known as the principle of least privilege access and is an authorization strategy.

Without strong authentication and authorization policies, data breaches can happen. A recent example is the exposure of data relating to airport employees. In this example, data was stored in AWS S3 buckets without basic authentication enforced, resulting in data exposure. 

Data security

One of the most critical pillars in data auditing is data security. We have all seen through the years that data breaches are possible and do happen. Alongside data access which we covered in the above section, another essential takeaway of an audit in data security is ensuring that data is appropriately encrypted. Encryption is needed both for data in motion and for data at rest. Even if a  hacker makes their way through data access, having a strong encryption implementation will leave the offender with no way to read the data. 

A typical example of data encryption in motion is the HTTPS you see before the website name in a web browser (including this article). This means all communication from you (the client) to the website is encrypted using SSL/TLS certificates. Encrypting “data in transit” protects against a man-in-the-middle attack,   which means an attacker who intercepts communication between the client and server cannot get access to data. Encryption is also essential for “data at rest” stored in a database or on disks. Not having these encryptions present is a significant data audit red flag. Public key-private key encryption, such as PGP and GPG, are two popular solutions for achieving data encryption.

Guide to Metadata-Driven Integration

FREE DOWNLOAD

Learn how to overcome constraints in the evolving data integration landscape

Shift data architecture fundamentals to a metadata-driven design

Implement metadata in your data flows to deliver data at time-of-use

Data retention

While data storage is cheap, it is important for organizations to have a well-defined retention policy. Really old data often has little enterprise value but can still pose a significant data security risk. It is best not to retain old data that has no value and reduce the risk exposure.

Data retention is also mandated by some regulations. One of the most common data audit applications is ensuring that laws around customer data are applied in their entirety. For example, if a European Union (EU) customer leaves a product or service, they can claim their right to be forgotten. This means the company must have an automated process that deletes all the data for that customer from anywhere within their systems, including cached and long-term storage databases. 

Four essential data audit best practices

The four data audit best practices below can help organizations keep their data secure and compliant. 

Adopt a modern approach to data engineering

Data product-based platforms bring a powerful new approach because they allow federated data governance, enabling data owners to determine who can access which data products.  Data Engineering solutions that support this approach offer no-code connectors and integrations that help to produce Data Products. In addition, a platform of choice should also provide a framework to manage Data Products and their lifecycle, including creation, discovery, governance, and consumption. 

Apply access policies based on data classification 

Not every employee needs access to every piece of data. For example, marketing department users do not need access to trading department data and vice versa. Data classification helps solve this problem. Administrators can create two separate access groups to restrict access to data classified as “marketing” or “trading” data respectively. In addition to classifying data based on business units, each data field can be classified as “personally identifiable information (PII)” or “not PII” to refine policies further and support compliance efforts. 

A more modern and scalable approach is through the use of Data Products. 

Administrators can apply these data access policies and classifications with a centralized data catalog solution shared with the compliance department. If more permissions are needed, users can raise a separate request with proper business justification, obtain approvals from data owners, and then receive access. 

Remember that data security techniques depend on the application 

Usually, all modern data processing tools include options for data encryption. If you have custom data and web applications, you probably need to budget and ensure the inclusion of SSL certificates for those applications. There are various options to implement SSL certificates ranging from free to $50 a year to secure an SSL certificate. Additionally, if you are using the cloud, providers will automatically offer you encryption options by default for most data storage mediums, and you can customize that with your own certificates.

Ensure adherence to General Data Protection Regulation (GDPR) laws

GDPR has strict policies around data retention and deleting customer data. For example, in some instances, you must ensure all of a specific customer’s data is removed from your data systems. In practice, there can be entries for a customer in more than one database table or across different databases. 

A practical approach to addressing this problem is to record all of the systems that store customer data and automate the process rather than manually deleting each record. To ensure accuracy, it is essential to have the process internally tested. Then, with a click of a button, you can achieve the goal of removing a specific customer’s data.

Empowering Data Engineering Teams

Free Strategy
Session

Platform

Data Extraction

Data Warehousing

No-Code Automation

Auto-Generated Connectors

Metadata-driven

Multi-Speed Data Integration

Informatica

✔
✔

Fivetran

✔
✔
✔

Nexla

✔
✔
✔
✔
✔
✔

Conclusion

Data audits enable organizations to improve compliance, enhance security, and optimize data quality. Adopting best practices, such as utilizing a data platform and applying granular access policies, helps teams continuously enforce policies that strengthen overall data quality, streamline data audits, reduce data duplication, and avoid data silos with a clearly-defined organizational team structure.

Data platforms, like Nexla, help organize data from different departments making it easier to track a centralized inventory of the data across the company and beyond, which you can then cleanse and implement their encryption, retention, privacy, security, and access policies.

Like this article?

Subscribe to our LinkedIn Newsletter to receive more educational content

Subscribe now