Chapter 2 — AI Life Cycle Risk Management — Part D: AI Data and Asset Management

On this page

2.15AI Asset Inventory
2.16Data Collection for AI
2.17Data Classification
2.18Data Confidentiality
2.19Data Quality
2.20Data Balancing
2.21Data Scarcity
2.22Data Security
2.23Data Preparation and Normalization
2.24Data Minimization and Privacy Considerations

Part D: AI Data and Asset Management

A key aspect of the AI life cycle is understanding and managing the AI solutions deployed by the enterprises as well as the data used to train, validate, and operate those solutions (e.g., identifying AI models running in HR systems).

Data plays an essential role in both the development of AI systems or models and in the use of AI via data input in the form of text, audio, and visuals to elicit a response (e.g., prediction, generative text response) or action (e.g., turning on lights, scheduling a meeting). As such, the foundation of AI begins with good data management hygiene and controls. Many of the issues and risk associated with AI can be linked to issues with data. There are several reasons AI researchers have provided that may explain why AI hallucinates, with key factors related to the data being biased, incomplete, incorrectly labeled, misaligned, or irrelevant.

2.15 AI Asset Inventory

AI assets are distinct from traditional IT assets as they are composed of many systems. In short, an AI asset is not an application such as Microsoft Excel; rather, one AI solution may have multiple owners, models, and versions. AI solutions have multiple requirements (including workflow mappings), contain multiple datasets (e.g., training datasets and data sources for production), and use multiple algorithms. In addition, these solutions may have multiple SDLC considerations, use third-party tools (e.g., Open Source AI, MLflow, Kubeflow, Amazon SageMaker, Azure Machine Learning), and may require multiple licensures for various components. Finally, a single AI solution is also subject to a variety of legal and regulatory compliance considerations.

For these reasons, identifying AI assets, whether internal or external, is a critical task as organizations rely on AI for decision making, operations, efficiency, and innovation. Discovering what tools or systems the enterprise is using requires a structured approach that combines areas of governance, risk management, IT operations, and internal audit. Internal audit should not lead or manage this effort; rather, it should be the department that leads the organization in AI and/or data management.

An AI usage policy is often the starting point for an AI inventory, as many asset identification steps may have been completed in the development of the policy. Refer to the catalog or inventory,¹²⁵ if one has been created. The inventory should be updated at least annually; if this has not been recently completed, the inventory owner or manager should start an update.

Enterprises that develop their own AI products or services should create and maintain an AI model catalog. See 2.4 Build, Adapt, and Document Models for more information.

When reviewing the AI asset identification process, the AI auditor should consider several areas of the organization to ensure the completeness and accuracy of the process used to create and maintain the inventory.

2.15.1 Inventory Objectives and Procedures

Ideally, the AI inventory can be modeled after existing IT asset inventories and follow similar identification and maintenance processes. The purpose of an AI inventory is to identify AI solution assets currently in use, not to penalize departments or employees who are using these tools, perhaps without prior company knowledge. Organizationwide communication should state the purpose of the inventory, the requirement to identify any AI used, and how the inventory effort will increase governance and collaboration while ensuring security. Ensure that employees understand that no one will be reprimanded for disclosing AI use that is in accordance with company policy; rather, this supports the enterprise’s overall risk management and governance efforts.

When management creates a new baseline, collaboration, tool usage, surveys, and interviews are essential to ensure the effort is comprehensive. The organization should establish a standardized list of data fields to increase the quality and consistency of data collected. These fields include:

Name of the AI solution
Version number
License (if needed)
Cost
Deployment (e.g., web interface, installed app, APIs)
Purpose of the AI solution
Frequency of use
Stakeholders
Accountable organization owner
Details related to third-party use/vendor (if applicable)
Regulatory classification of AI system

When conducting surveys and interviews, some level of anonymity encourages respondents to answer truthfully instead of how they think the enterprise “wants” them to respond, especially if it is clear that they will not be disciplined for answering honestly. Note that completely anonymous surveys may decrease accountability because respondents are not motivated to complete the survey. Quantifiable questions that capture structured data help to ensure the effectiveness of a survey.

Keep in mind that open-ended questions can add complexity that might make it hard to organize information gathered from responses. For example, when asked, “What types of AI do you use?”, an employee might answer, “ChatGPT.” The interviewer should ask follow-up questions to ensure the interviewee is differentiating ChatGPT from OpenAI, Claude, Google Gemini, or Microsoft Copilot. The interviewer should also be able to group responses that are similar, such as aggregating a response of “ChatGPT-5” with other ChatGPT answers.

While minimally useful in establishing a baseline inventory, interviews can be helpful in areas such as development, where teams may be experimenting with AI tools for educational or professional development purposes rather than product development or commercial use.

2.15.2 AI Model Inventory

Maintaining a comprehensive AI model inventory is a foundational element of effective AI governance and risk management. An AI model inventory serves as an organized repository that captures detailed information about all AI models and systems deployed within an organization. This inventory enables stakeholders to gain a holistic view of AI assets, facilitating oversight, accountability, and timely response to operational or security incidents.

A well-structured model inventory should include critical attributes that uniquely identify and describe each AI model. Key attributes of a model inventory are shown in figure 2.11.

Establishing procedures that define the scope of the inventory, the parties responsible for its maintenance, and the attributes to be collected is critical. Organizations should aim to inventory all AI models or, at minimum, those classified as high risk or deployed in high-stakes settings. Regular updates to the inventory, at least annually, are necessary to reflect changes such as new deployments, model retirements, or version upgrades.

A diagram depicts 5 attributes of an AI model inventory. — Figure 2.11—Attributes of an AI Model Inventory

2.15.3 Documentation

Regardless of the process undertaken or where the organization is in the documentation cycle, multiple artifacts should be available from which the risk practitioner can obtain both a working knowledge of the organization’s mission and goals for its AI program (e.g., AI policy) and the management and sustainable processes for maintaining the AI solution. Artifacts should be integrated with standard asset management platforms if used (e.g., ServiceNow or Jira).

Maintenance of the AI inventory shares several characteristics with existing IT asset management practices. Decentralized ownership and unsanctioned deployments, also known as shadow IT or shadow AI, are common challenges for IT teams with all digital assets, not just AI. While AI solutions may evolve and propagate rapidly within organizations, effective asset management should ensure that all deployed systems are inventoried and governed, regardless of their underlying technology. Specific AI-related considerations (such as model drift, explainability, and unique security risk) augment, but do not replace, traditional asset governance principles. Model lineage documentation should capture how training data, features, and code evolved.

2.16 Data Collection for AI

For many organizations, the primary challenge is not data scarcity but rather ensuring data quality and usability. Although digitizing paper-based or manual records remains a concern during digital transformation, the key issue is that most organizations struggle to make their vast digital data assets useful for AI applications. Organizational data must be accurate, well-organized, standardized, and accessible to train effective AI models. Overcoming data silos, inconsistencies, and poor governance is essential for transforming raw data into a strategic asset that drives intelligent decision making and successful AI initiatives.

An organization needs a mechanism to collect and connect sources of data into a central place in which that data can be processed and later used for training AI models. Companies have been aggregating and collecting data into data warehouses, data lakes, or other solutions to centrally collect, analyze, and perform analytics reporting on it. This helps with data accessibility, as AI models require large volumes of variety and veracity of data depending on the AI model algorithm that is being used or developed. There is a positive correlation between data size, volume, and variety to accurate performance of AI models.

The advent of big data introduced some key considerations, commonly called the five Vs, applicable to AI data collection and management. Figure 2.12 describes the five Vs of big data.

Data collection-related risk that must be considered as a part of AI operations includes consent, purpose fit, and data lag.

A diagram depicts the five V’s of big data. — Figure 2.12—The Five Vs of Big Data

2.16.1 Consent

Internally developed corporate data can frequently be used to train AI models, with some exceptions (e.g., sensitive data, personal information). Data entrusted to the organization by the consumer or customer (e.g., data subject), referred to as customer data, often requires additional consent, depending on the original terms and purposes at the time of data collection. Data protection regulations, like GDPR, require organizations to obtain the consent of data subjects for the processing of their personal data.¹²⁶ Moreover, the EU AI Act extends the requirements of consent to the testing of high-risk AI systems in real-world conditions.¹²⁷ This requires organizations to obtain informed consent from the subjects of such tests prior to their participation.

Organizations should ensure that proper consent has been obtained before using customer data for AI training. It is important to involve privacy and legal stakeholders in AI governance efforts to evaluate the organization’s legal basis for obtaining consent from the subjects for the use of their data in AI model training and testing. Depending on the organization’s consent management practices, this may involve subject opt-in or opt-out for AI model training. Should a customer of the organization deny or revoke consent for AI model training, it is important to have processes in place to exclude the pertinent information from the training dataset. Failure to do so could result in regulatory or compliance penalties in some jurisdictions, as well as loss of customer trust.

2.16.2 Fit for Purpose

Use of an AI solution should serve the enterprise’s goals and objectives. The solution should be “fit for purpose,” meaning it should be designed to accomplish a specific task or goal.

Indications that an AI solution is not fit for purpose include:

The required data exists somewhere in the organization, but it may not be readily available or easily accessible for the model training or inference engine.
The data is not at the right level of granularity, detail, volume, or veracity necessary to develop high-performing AI models. Depending on the AI, the types, quantity, and variety of data can affect the solution’s outcomes.
The use case is restricted or prohibited by AI regulations on high-risk use.

2.16.3 Data Lag

AI models are often trained on historical data, even up to the day when AI model training starts. The AI model training process can be intensive, depending on the volume of data and the complexity of the models. While training depends on the number of parameters and the hardware employed to train a model, it may take weeks to months. This could result in data lag or model drift, in which there is a discrepancy between data collected and trained into a model and when the model is used against new data. This could make the model less relevant and accurate for making real-time decisions and cause model performance to deteriorate over time. Organizations should monitor if data usage duration exceeds consented retention periods.

Organizations can mitigate the effects of data lag by:

Collecting and retraining a new version of the model with updated datasets if the model is small and feature weights (numerical values that determine the importance or influence of a particular feature in a model’s prediction) are updated
Applying retrieval augmented generation (RAG) techniques to ground large language model (LLM) prompts with specific and up-to-date data. The base model would stay the same, but the input into the model would be first grounded with relevant and specific data.

2.17 Data Classification

Risk related to data classification is exacerbated in the context of AI. Enterprises struggle to organize and classify their data (for safe and efficient handling at different levels of sensitivity) and to integrate with zero trust access policies. Over the years, data classification technologies have been created to help enterprises better classify their data, including highly regulated information such as personal, health, and financial data.

More complex use cases (e.g., IP and sensitive internal data) create more complex issues. For example, the source code an organization develops for its products and internal systems could be considered confidential and, in some cases, constitute IP. However, software developers often leverage open-source code, incorporating it into the overall product or system. It is hard to separate open source from proprietary code. The incorrect classification and handling of source code creates a risk for the organization. Acceptable use should discourage use of open-source code in highly regulated or high-risk (as defined in the EU AI Act) systems.

When sourcing and collecting data for model training, an organization needs to consider the end users of the models. If a model intended for public use was trained on confidential customer data, this mismatch could create risk. Attention should be paid to ensure alignment between sensitivity of the data used for model training and the users of the model. This prevents the disclosure of sensitive data embedded in the model.

2.18 Data Confidentiality

Data confidentiality should be maintained throughout the AI development life cycle and platform. Figure 2.13 describes some data confidentiality considerations.

Figure 2.13—Data Confidentiality Considerations

Data Location	Confidentiality Control
Data source	Sources from which data is collected for model training may be restricted, and the data may be obfuscated or encrypted. This is to ensure the confidentiality of data on a need-to-know basis. As this data is extracted from the originating source, metadata (e.g., data owner, classification) may not be extracted and controls (e.g., access, masking) may not be carried forward to the next stage of the artificial intelligence (AI) development life cycle.
Data lake	Data is collected in a data lake so that structured and unstructured data can be aggregated for easier modeling. Organizations should use caution when data from different levels of classification and sensitivity are commingled. Data that was once restricted at the data source may be more widely available to users if access controls are not preserved.
Data exploration and training platform	This platform includes data extracted from data lakes and brought into Jupyter Notebooks or SageMaker platforms where it can be explored and used for training. Data of different classifications may be stored in notebooks and notebook files.
Vector database	These are new forms of databases in which documents, images, and audio are processed into vectors. Files are stored as mathematical representations with associated metadata. While the original text or data is not stored in the vector databases, new types of access control models need to be considered. For example, access may be controlled via an attribute and index or a query.
AI system production	Once an AI model has been trained and developed and is ready for production, software will be needed to implement the model for use. Often this requires taking live production data and running it through a series of data prep pipelines to get the data transformed and ready to be input into the model and obtain the resulting inference results. Again, depending on the data sources needed for inference, the data classification and handling protocols need to be implemented in the production AI system as well.

Source: ISACA, ISACA AAIA Official Review Manual, USA 2025

2.19 Data Quality

Unlike previous advanced systems with rules explicitly defined by the developers, AI systems leverage DL techniques that heavily rely on data. Therefore, data quality is paramount and often directly proportional to the model’s performance. The results from an AI system are only as good as the data the models were trained on, so auditors should make sure to test whether data quality directly impacts model accuracy metrics.

Raw data can contain multiple errors, omissions, and inaccuracies. A common practice in AI operations is to first explore and profile the data to evaluate the quality of the dataset and then perform data cleansing activities as needed. This includes correcting errors and inputting missing data. There are six dimensions for assessing data quality, as shown in figure 2.14.

Figure 2.14—Dimensions of Data Quality

Dimension	Definition	Example
Accuracy	Data is free from errors and is representative of real-world situations.	A customer’s address is collected from various data sources. Typos and translation or transposition errors create inaccuracies in customer databases or customer relationship management software.
Completeness	Data contains all the necessary fields and records.	Mandatory fields are populated, full, and complete (not truncated) for all records.
Consistency	Data is uniform and standard across the datasets; this includes formats, lengths, metadata, etc.	A date format can be represented as May 1, 2024, 2024/05/01, 05/01/2024, or 01/05/2024.
Timeliness	Data is up to date and available when needed.	Real-time global positioning system (GPS) navigation requires up-to-date GPS coordinates to calculate turn-by-turn instructions and arrival time.
Validity	Data adheres to defined business and technical logic. The organization can define the business and technical logic the data is expected to conform to in order to be accepted and considered valid.	Customer information is validated against a database of active user accounts.
Uniqueness	Data has no duplicative or redundant records in the dataset.	Each customer is assigned a unique identifier.

Source: ISACA, ISACA AAIA Official Review Manual, USA 2025

2.20 Data Balancing

Modern systems and environments generate a tremendous amount of data. Think of the volume of text, images, videos, and audio files generated by daily communications or social media interactions or by the machines and systems used to support daily activities. However, not all that data is captured, collected, and made universally available in the systems used to train AI models. Uneven distributions may naturally occur. If an organization is training an AI model but only has access to—and trains it on—a skewed or imbalanced dataset, it could result in a higher rate of inaccurate outcomes and amplified systemic bias (e.g., underrepresentation of minority voices).

Data imbalance is a common challenge when developing AI models. This occurs because the distribution of the training dataset might include insufficient samples of a minority “class” of data. This can result in AI models producing biased or inaccurate results. One root cause is the lack of sufficient and diverse data used in the training of AI models to produce accurate, useful results. Organizations also overcompensate to boost the data of minority classes more than the real-world distribution.¹²⁸

Organizations can mitigate bias by addressing data balance early in the development process of the model. Profiling the data to understand and evaluate the data distribution during collection and preprocessing are actionable mitigation steps. Oversampling, undersampling, or applying cost-sensitive algorithms in model training are techniques to improve model performance.

2.21 Data Scarcity

High-quality data that is relevant and fit for purpose, for which the organization has obtained consent or has license for use, is often hard to acquire. Organizations often find they have an abundance of data that they have collected and can access, but a scarcity of data to sufficiently develop and deploy the AI models they have prioritized while managing AI risk, compliance, and regulatory requirements. This can be the result of:

Data quality issues
The availability of “minority” or diverse classes of data
The availability of consented or licensed data
Data “trapped” in other source systems that are not available or accessible for model training and inferences
The availability of labeled data

Mitigation strategies for data scarcity include:

Augmentation—Missing data can be supplemented by collecting or procuring targeted data that is missing. For example, a customer database can be augmented with missing address information and parent/child ownership from reputable data brokers. Scarcity related to insufficient quantity and variations can be augmented with the generation of synthetic data that mimics the data distribution needed. Missing data can be imputed through carefully selected algorithms.
Model selection—The size and variety of the available data guide the organization’s choice of AI model. Choosing a model that works well with the dataset to avoid overfitting is a suitable mitigation strategy. Federated data partnerships allow sharing insights without raw data transfer.

2.22 Data Security

While data availability plays a critical role in the development of AI solutions, data security plays a more prominent role in ensuring security. Organizations are required to implement privacy regulations such as the GDPR, and localized data residency laws have limited the use of personal data in addition to safeguarding financial and health data, IP, and confidential data from competitors and cybercriminals.

Implementing AI solutions requires additional data security considerations. Given the role that data plays in AI, organizations should evaluate data security risk and controls throughout the AI life cycle.

2.22.1 Data Encoding

Many DL models require raw data to be tokenized before it can be used in training. The tokenized data is typically stored as binary files, such as TFRecords for storing datasets using Google’s TensorFlow framework. Similar to securing other files stored on a system, securing tokenized data, even in binary, requires access and data encryption controls.

During the training process, the tokens and their embeddings are stored in-memory or on GPUs. Some ML frameworks may store intermediate states of the model as checkpoint files (e.g., ckpt or pth).

Data also can be encoded into vectors, or a mathematical representation of the text, image, or audio file. GenAI and semantic search have caused vector databases to grow in popularity. These vector databases store the vector embeddings (an array of numbers) of unstructured data, making the comparison and searching of data much easier than conventional relationship databases.¹²⁹ Vector databases should be secured with access controls and encryption. However, other techniques specific to vector databases also should be employed, such as designing and managing the vector indexes to control who or what can access them.

2.22.2 Data Access

Data is collected from various data sources, often with their own access and encryption controls in place. This aggregation of data brings additional risk, such as:

The concentration of data into a centralized location (e.g., a data lake) makes it a prime target for attackers, as it is easier for them to exfiltrate valuable data when it has already been aggregated.
When data is transferred from the originating source to the target source (e.g., the data lake), the access controls for that data might not have been properly translated into the target source.
AI model development involves exploration of the dataset by data scientists to evaluate the feasibility of their AI use case against the available data. This may broaden the definition of the principle of least privilege for which the data was originally collected and processed.
Data is replicated from various source systems into a centralized repository for data exploration and training. This creates additional systems in which the data must be secured, thereby increasing the in-scope systems of any compliance program.
Shared cloud AI services may expose data to cross-tenant risk.

The organization should review the AI system’s data flow to identify systems, data stores, and users of the data. Additional access controls may need to be applied on systems that are part of the data flow to ensure the congruency of the access controls from originating systems to target systems. Networking access policies should be reviewed in parallel to ensure containment of the data only to known and authorized systems.

2.22.3 Data Confidentiality/Secrecy

Data confidentiality may be diluted or impaired if an organization is not careful in the design and implementation of data confidentiality or secrecy controls. Some systems that process and handle sensitive personal, health, and financial data may have obfuscation and encryption controls in place, but the training and use of AI models often requires that data be machine-readable for many of the models to properly work. Many of the popular tokenization methods like byte-pair encoding (BPE) and WordPiece use cleartext data to tokenize the text in a mathematical matrix of numbers. Because of this constraint, some sensitive data might be decrypted during the AI development life cycle.

New encryption techniques could allow some AI models to train on data without the need for decryption.¹³⁰ Homomorphic encryption can allow for AI model training while preserving data privacy. However, its current use is limited due to the high computational costs of processing encrypted data. For now, the application of additional compensating controls, like limiting access to unencrypted production data and monitoring the use of sensitive data, is key to ensuring data confidentiality. Disk-level encryption should also be applied to provide defense in depth.

2.22.4 Data Backup

Many datasets used for training AI models originate from other sources. Therefore, backing up this data is redundant. However, there are many new artifacts from the AI development process that should be backed up, including:

Postprocessed data being prepared for model training
Tokenized training data, often stored as binaries
Model weights stored in specialized binary formats
Model architecture parameters and definitions (e.g., number of layers, embedding size) stored in JSON or YAML files

Unlike conventional software development, where the source code provides the specific instructions for the behavior of the software, GenAI models can be nondeterministic. This makes it harder to explain how a model arrives at an outcome. To provide better explainability of AI models, organizations should preserve copies of the training and testing datasets, as well as the model performance and bias testing results.

2.22.5 Data Integrity

Although GenAI models are nondeterministic, the need to preserve the integrity of the datasets being used for AI model training, as well as the model itself (e.g., weights, architecture), is still critical. Scenarios where data integrity could be compromised include:

Data poisoning—An attacker modifies the training data to influence the model’s behavior.
Model tampering—The model weights, architecture, and parameters are modified to provide incorrect, biased, or malicious results.
Embedding tampering—The embedding matrices are tampered with to skew the results of the model.

Data integrity issues also could arise through the preprocessing of datasets. Organizations may need to frequently perform extract, transform, and load (ETL) operations on the data as part of their preprocessing activities for AI development. Errors can be introduced during various stages of the ETL processing. Defining, documenting, testing, and properly implementing the ETL requirements and logic are examples of data integrity controls.

To protect data sources from these issues, organizations should implement robust controls that ensure the integrity and authenticity of data throughout its life cycle. These include:

Access controls—Enforce strict access management across all systems involved in the AI data flow, applying the principle of least privilege to limit data access only to authorized personnel. This reduces the risk of unauthorized data manipulation or insertion of poisoned data.
Data validation and verification—Employ rigorous data validation techniques at ingestion points to detect anomalies or inconsistencies that may indicate poisoning attempts. This includes validating data formats, ranges, and consistency with expected patterns.
Supplier and source management—Establish clear data processing agreements and standard operating procedures (SOPs) with data suppliers and brokers to ensure the provenance and quality of acquired data. Regularly audit and verify supplier data to detect any unauthorized alterations.
Secure data transmission—Protect data in transit using encryption and secure communication protocols to prevent interception and tampering during transfer between systems.
Preprocessing controls—Monitor and control data preprocessing and cleansing routines to prevent malicious modifications. Implement change management and code review processes for scripts and tools used in data preparation.
Segregation of data environments—Separate training data from production data environments to minimize the risk of cross-contamination and facilitate monitoring.
Continuous monitoring and detection—Tools, such as anomaly detection, data lineage tracking, audit trails, logs, and model output modeling, are essential to identify and respond to data integrity attacks and issues promptly.

2.23 Data Preparation and Normalization

Data preparation and normalization are critical processes in ensuring that data used for AI training and inference is accurate, consistent, and fit for purpose. These processes involve several key activities, including data cleansing, standardization, normalization, and profiling, all of which contribute to the overall quality and reliability of AI models.

2.23.1 Data Cleansing

Once raw data is collected, the next step is data cleaning, also referred to as data cleansing or data scrubbing. This process involves the identification and correction of errors, inconsistencies, and inaccuracies within raw datasets. Data cleaning is essential for ensuring the accuracy and reliability of data-driven analyses and decision making. Considering that raw data often comes with noise and lacks structure, data cleaning is a crucial step to help guarantee the integrity and dependability of the data used for the ML model.¹³¹

Data cleaning involves a series of tasks, including:

Handling missing values—Raw data may contain missing values or incomplete records. Data cleaning includes strategies for dealing with missing data, such as imputing missing values or removing records with missing data altogether.¹³²
Removing duplicates—Duplicate records can introduce biases and distort an analysis. Data cleaning involves identifying and removing duplicate entries from the dataset.¹³³
Correcting inconsistencies—Raw data may contain inconsistencies, such as variations in capitalization, spelling, or formatting. Data cleaning involves standardizing and correcting these inconsistencies to ensure uniformity across the dataset.¹³⁴
Fixing data entry errors—Data entry errors, such as typos or incorrect values, can introduce inaccuracies into the data. Data cleaning includes detecting and correcting these errors.¹³⁵ Auditors should make sure to handle inconsistent formats as well (e.g., mixed date formats).
Outlier detection and handling—Outliers are data points that deviate significantly from the norm. Data cleaning involves identifying outliers and deciding how to handle them (e.g., whether to remove, transform, or retain them in the dataset).¹³⁶ Outlier handling is especially critical in fraud detection models where anomalies may be meaningful signals.

AI solutions are only as good as the quality of the data they are trained on or referencing. When correlating multiple data sources to create a larger pool of reference data, there is an even greater need to ensure the accuracy of all sources and the consistency of the data sources for use by AI solutions. The organization’s data governance framework should leverage general best practices to ensure quality for data used by AI solutions.

The DAMA UK Working Group on Data Quality Dimensions identified six key dimensions for measuring data quality:¹³⁷

Completeness—The proportion of stored data against the potential of “100% complete”
Uniqueness—The requirement that nothing will be recorded more than once based upon how that thing is identified
Timeliness—The degree to which data represents reality at the required point in time
Validity—Data is valid if it conforms to the syntax (format, type, range) of its definition
Accuracy—The degree to which data correctly describes the object or event being described
Consistency—The absence of difference, when comparing two or more representations of a thing against a definition

The initial dataset and the outputs from AI solutions must be retained per the classification of that data. Data retention is the discipline of ensuring that persistent data is stored in compliance with legal and business data archival requirements through policies, standards, processes, and procedures.

Data is retained primarily for business and regulatory purposes according to established schedules, archival rules, data formats, and the permissible means of storage, access, and security protocols (e.g., tokenization, encryption, and anonymization).

Data retention requirements are derived from various sources:

Internal requirements should be consistent with data use limitation policies.
External requirements derived from regulations and laws may vary, even for the same data.

Given the high degree of requirements overlap, data retention policies should provide practical guidance with the goal of satisfying all parties.

Retention policies tend to focus on sensitive data and provide guidance for retention schedules and archival rules. Retention policies and schedules tend to address classes of data, such as PII and personal health information (PHI). Therefore, ensuring compliance requires authoritative and up-to-date identification of sensitive data in all applications.

2.23.2 Data Optimization

Data optimization ensures that data is accurate, reliable, and accessible for data analysis and AI decision making. Techniques include data cleaning, integration, enrichment, and transformation. Optimization should balance performance with cost efficiency (e.g., cloud storage).

Some aspects of data optimization are shown in figure 2.15.¹³⁸

Figure 2.15—Aspects of Data Optimization

Data governance	Establishing policies and practices for data governance to maintain the quality, security, and compliance of data during optimization
Data storage	Managing and controlling data storage infrastructure to minimize storage space requirements and consumption
Data processing	Enhancing the speed and efficiency of data transformation, analytics, and computation
Data cleansing and quality improvement	Resolving inconsistencies, errors, and missing values to ensure data accuracy and reliability
Data integration	Combining data from multiple sources into a coherent and unified form to facilitate easier reporting and analysis
Data life cycle management	Maintaining the entire data life cycle to ensure the availability and proper disposal of data
Query and access	Enhancing data access and querying to boost database performance
Cost	Developing cost-effective data strategies to reduce data management and analytics costs while maintaining performance and reliability
Data security and compliance	Implementing access controls, encryption, and auditing to ensure compliance with relevant regulations and the security of data
Scalability	Using scalable technologies and architectures to handle increasing data volumes

Source: ISACA, ISACA AAIA Official Review Manual, USA 2025

Data Standardization

Standardization involves transforming data into a consistent format and structure to facilitate integration and analysis. This includes aligning data types, units of measurement, and categorical values across datasets. Standardization ensures that AI models receive data in expected formats, reducing errors during training and inference. It also supports interoperability when combining multiple data sources, which is common in AI applications.

Data Normalization

Normalization is the process of scaling numerical data to a common range or distribution, often between zero and one, or to have a mean of zero and standard deviation of one. This step is particularly important for ML models sensitive to the scale of input features, as it prevents features with larger ranges from disproportionately influencing the model. Normalization techniques help improve model convergence during training and enhance predictive accuracy.

Data Profiling

Data profiling entails analyzing datasets to understand their structure, content, and quality characteristics. Profiling activities include assessing data completeness, uniqueness, validity, accuracy, consistency, and timeliness. This analysis helps identify data quality issues early and informs the design of cleansing and normalization strategies. Profiling also supports compliance with data governance policies by ensuring that data used in AI processes meets organizational and regulatory standards.

2.24 Data Minimization and Privacy Considerations

Data privacy concerns in AI arise from the use of large datasets, often containing sensitive or personal information, to train and operate AI systems.

At the core of privacy concerns is not only the ability of ML models to “remember” or regurgitate training data but also the risk that these models can reveal or reconstruct personal or sensitive information by learning and exposing correlations across data attributes. Adversaries may exploit these correlations to infer, derive, or even fabricate sensitive attributes, potentially compromising individuals’ privacy—sometimes without direct access to original training records. Effective privacy defense strategies must therefore address both direct memorization and indirect inference risk. Novel defenses based on concepts that would avoid these downsides have been proposed to reduce membership inference risk, and some have been developed. On balance, however, the best defense against these sorts of attacks may be that they have been found more difficult to accomplish in practice than theory suggests should be the case.

For organizations leveraging ML or AI capabilities to evaluate sensitive data, the most important consideration may be liability for successful inference. This is one area in which California Privacy Rights Act (CPRA) and GDPR diverge: the California statute expressly takes inference into account, but concern regarding inference was not prevalent when the EU regulation was drafted. Organizations processing sensitive data should balance the potential efficiencies associated with ML and AI with potential penalties for noncompliance arising from privacy law and regulations in their jurisdictions.

2.24.1 Data Ownership With Regard to Governance and Privacy

It is important to understand the role of data governance in an enterprise. Data governance is a systematic process for proactively managing data and improving data quality in order to help the enterprise achieve its goals and objectives. Data governance helps businesses improve their efficiency and ensure legitimate information is used in business processes by laying the foundation of the data management discipline while keeping the primary purpose, to manage and improve data quality, intact.

Some roles and responsibilities related to data governance include:

Data custodian—The individual(s) and department(s) responsible for the storage and safeguarding of computerized data
Data owner—The individual(s) responsible for the integrity, accurate reporting, and use of computerized data
Data protection officer (DPO)—The individual(s) responsible for informing and advising organizations about their data protection obligations and for monitoring their compliance (under GDPR, some organizations are required to appoint a DPO)
Data steward—The individual(s) responsible for data quality

In part, managing and improving data quality is ensuring its integrity. Consequently, a data governance policy typically describes the security controls that will be applied to protect data at each phase of the data life cycle.

2.24.2 Privacy Regulation Considerations

Many privacy regulations address AI either directly or indirectly. Many of the general data governance and AI ethics best practices apply and are prevalent in all current and proposed privacy requirements. Some common themes and privacy-based considerations include:

Transparency—Inform users about AI decision-making processes and their impacts.
Consent—Obtain explicit consent for data use in AI systems.
Fairness and accountability—Ensure that AI systems are unbiased and auditable.
Explanations—Provide meaningful information on the logic of AI-based decisions.
Opt-out options—Allow individuals to decline participation in automated decision-making processes.

Organizations leveraging AI solutions to process personal information covered by privacy laws should ensure that the system takes their privacy-related requirements into account. This should be done through the organization’s data governance process and could include standard techniques, such as performing a data protection impact assessment (DPIA). A DPIA is a form of risk assessment of the impact of data processing operations on the protection of personal data; it is required by the GDPR, especially while using new technologies.¹³⁹

2.24.3 Data Minimization Considerations

Data minimization is a foundational principle in managing AI data and assets, emphasizing the collection and use of only the data strictly necessary to fulfill a defined purpose. This approach mitigates privacy risk by reducing the volume of sensitive or personal information exposed to potential breaches or misuse. Enterprises often accumulate extensive datasets and retain them indefinitely, which can transform these data holdings from valuable assets to significant liabilities under stringent privacy regulations such as the GDPR and CPRA.

Implementing data minimization requires organizations to clearly define the purpose of data collection and to limit data acquisition accordingly (e.g., voice assistants should minimize retention of audio transcripts beyond immediate use). This includes establishing data requirements during the initial phases of AI development and ensuring that only relevant data elements are ingested and processed. Techniques such as data masking and tokenization can be employed to reduce the exposure of sensitive information while maintaining the utility of datasets for AI training and inference.