Defining Data Needs for AI(II)

13/02/2026

Identifying Data Needs for AI Projects (Phase II)

In Phase II, we translate business objectives into concrete data requirements that can reliably power your AI solutions. We work with stakeholders to clarify use cases, define success metrics, and map the data sources needed to support them. This includes assessing current data assets, identifying critical gaps, and prioritizing what must be collected, cleaned, or integrated. The outcome is a clear, actionable data blueprint that aligns technical needs with strategic goals and reduces risk in later development stages.

Our approach covers both quantitative and qualitative dimensions of your data landscape. We evaluate data volume, quality, accessibility, and governance, while also considering privacy, compliance, and ethical constraints. Together, we define data schemas, labeling strategies, and minimal viable datasets for experimentation. By the end of Phase II, your organization has a prioritized roadmap of data initiatives, clear ownership, and realistic timelines, ensuring that subsequent AI modeling efforts are efficient, scalable, and aligned with business value.

Artificial Intelligence projects do not fail because of weak algorithms—they fail because of poor data foundations.
Phase II of the PMI-CPMAI® framework focuses on identifying, assessing, and validating data before any model is built.

This phase answers one critical question:

Do we have the right data—ethically, legally, and practically—to support the AI solution?

Why Data Understanding Comes Before Model Building

Many organizations rush into selecting algorithms and tools. CPMAI emphasizes a disciplined, business-first approach where data readiness determines AI feasibility.

Poor data quality and governance are the leading causes of AI failure.

If data is unreliable, AI decisions will also be unreliable.

1. Translating Business Goals into Data Requirements

Every AI initiative starts with a business problem, but Phase II converts it into clear data needs.

Example

Business Goal: Reduce customer churn
Derived Data Needs:

  • Historical customer behavior

  • Transaction frequency and value

  • Support tickets and complaints

  • Definition of "churn" and time horizon

Key Outputs

  • Target variable (what AI predicts)

  • Input features (what influences outcomes)

  • Historical data window

  • Prediction timeframe

2. Identifying the Right Data Types

Different AI use cases require different kinds of data.

Common Data Categories

  • Structured data: Databases, spreadsheets, metrics

  • Unstructured data: Emails, documents, images, audio

  • Semi-structured data: Logs, JSON, XML

  • External data: Market data, weather, social signals

CPMAI stresses selecting data based on business relevance—not availability.

3. Assessing Data Availability and Accessibility

Not all required data is immediately usable.

Key Questions

  • Does the data exist?

  • Who owns the data?

  • Is it siloed across departments?

  • Can it be legally accessed and shared?

Early identification of access constraints prevents costly rework later.

4. Evaluating Data Quality and Bias

Data quality is the single most critical success factor in AI projects.

Core Data Quality Dimensions

  • Accuracy: Correct and error-free

  • Completeness: Minimal missing values

  • Consistency: Uniform meaning across systems

  • Timeliness: Up-to-date and relevant

  • Bias & representativeness: Fair reflection of reality

⚠️ Biased data leads to biased AI outcomes—an important CPMAI exam theme.

5. Privacy, Security, and Compliance Considerations

AI data must comply with legal and ethical standards.

Phase II Compliance Checks

  • Personally Identifiable Information (PII)

  • Consent and purpose limitation

  • Data anonymization or masking

  • Industry and regional regulations

If data cannot be used legally or ethically, the AI solution must be redesigned or stopped.

6. Identifying Data Gaps and Remediation Options

After assessment, teams decide whether data is:

  • ✅ Sufficient to proceed

  • ⚠️ Requires enrichment or labeling

  • ❌ Inadequate for AI use

Remediation Strategies

  • Data cleaning and normalization

  • Feature engineering

  • External data acquisition

  • Manual or automated labeling

7. Phase II Deliverables 

By the end of Phase II, organizations should have:

✔ Data requirements document
✔ Data inventory and source mapping
✔ Data quality and bias assessment
✔ Privacy and compliance review
✔ Go / No-Go recommendation

In CPMAI Phase II, understanding the properties of Big Data is essential to determine whether data is suitable, sufficient, and feasible for an AI initiative. These properties help decision-makers evaluate data readiness before any preparation or modeling begins.

  • Volume determines whether the quantity of available data is adequate to support statistically meaningful AI outcomes.

  • Velocity assesses how quickly data is generated and whether real-time or near-real-time processing is required to meet business objectives.

  • Variety evaluates the types of data involved (structured, unstructured, or semi-structured) and their relevance to the defined business problem.

  • Veracity measures data quality, reliability, bias, and uncertainty, which directly impact trust, fairness, and risk in AI outcomes.

  • Value ensures that the data, if used, can realistically contribute to measurable business benefits and decision improvement.

From a CPMAI perspective, these properties are not technical characteristics but decision criteria used to validate data feasibility, identify constraints, and support a Go / No-Go recommendation. If data volume is insufficient, velocity cannot be supported, variety is irrelevant, veracity is compromised, or value cannot be demonstrated, the AI initiative must be paused, redesigned, or stopped before moving to Phase III.

CPMAI Phase II – Scenario MCQs (Data Understanding)

Q1.

A retail organization wants to use AI to predict customer churn. During Phase II, the team realizes that "churn" is defined differently by sales and marketing teams. What should be done first?

A. Start data collection using both definitions
B. Select the definition with more data
C. Standardize the business definition of churn
D. Proceed and resolve later during modeling

Correct Answer: C
Explanation: Phase II requires clear target variable definition before any data analysis begins.

Q2.

An AI project requires customer interaction data, but access is restricted due to departmental ownership. What Phase II issue does this represent?

A. Data quality problem
B. Data availability and accessibility constraint
C. Data labeling issue
D. Model feasibility issue

Correct Answer: B
Explanation: Phase II explicitly assesses data ownership, access, and silos.

Q3.

Which data type is MOST suitable for sentiment analysis in a customer feedback AI project?

A. Structured transactional data
B. Numerical time-series data
C. Unstructured textual data
D. Reference master data

Correct Answer: C
Explanation: Sentiment analysis relies primarily on unstructured text data.

Q4.

A dataset has high volume but contains outdated records that no longer reflect current business conditions. Which data quality dimension is MOST affected?

A. Accuracy
B. Completeness
C. Timeliness
D. Consistency

Correct Answer: C
Explanation: Timeliness ensures data reflects current and relevant conditions.

Q5.

An AI team discovers that historical loan approval data reflects past discriminatory practices. What Phase II concern does this raise?

A. Model overfitting
B. Algorithm selection
C. Data bias and fairness risk
D. Infrastructure limitation

Correct Answer: C
Explanation: Phase II requires bias identification before data is used for modeling.

Q6.

Which activity BEST represents translating a business problem into data requirements?

A. Choosing a machine learning algorithm
B. Defining input features and target variables
C. Selecting cloud infrastructure
D. Training a pilot model

Correct Answer: B
Explanation: Phase II focuses on mapping business objectives to data elements.

Q7.

A project team plans to use social media data for AI analysis. What Phase II check is MOST critical before proceeding?

A. Feature engineering complexity
B. Data storage cost
C. Legal and consent compliance
D. Model explainability

Correct Answer: C
Explanation: External data sources require privacy, consent, and legal validation.

Q8.

Which outcome indicates that Phase II should recommend a No-Go decision?

A. Data exists but requires cleaning
B. Data quality is low but improvable
C. Data cannot be legally or ethically used
D. Data labeling will take time

Correct Answer: C
Explanation: If data cannot be used legally or ethically, CPMAI requires stopping or redesigning the initiative.

Q9.

A team assumes more data will automatically improve AI outcomes. Which CPMAI principle does this violate?

A. Model interpretability
B. Data sufficiency over quality
C. Algorithm efficiency
D. Automation maturity

Correct Answer: B
Explanation: CPMAI prioritizes data relevance and quality over volume.

Q10.

Which document is a key deliverable of Phase II?

A. Model performance report
B. Data requirements and inventory document
C. Deployment architecture
D. Monitoring dashboard

Correct Answer: B
Explanation: Phase II deliverables focus on data understanding, sources, and readiness.

Q11.

A dataset has inconsistent formats for the same attribute across systems. Which data quality issue is this?

A. Accuracy
B. Completeness
C. Consistency
D. Bias

Correct Answer: C
Explanation: Consistency ensures the same meaning and format across datasets.

Q12.

Which Phase II decision MOST directly impacts future model explainability?

A. Choice of algorithm
B. Feature selection and data definition
C. Compute infrastructure
D. Model deployment method

Correct Answer: B
Explanation: Explainability begins with transparent, well-defined data and features.

Q13.

An organization decides to enrich internal data with third-party market data. This activity belongs to which Phase II outcome?

A. Algorithm optimization
B. Data gap remediation
C. Model tuning
D. Performance validation

Correct Answer: B
Explanation: Phase II identifies data gaps and enrichment strategies.

Q14.

Which statement BEST reflects CPMAI guidance for Phase II?

A. Model accuracy determines AI success
B. Data preparation is a Phase IV activity
C. Data feasibility must be validated before modeling
D. Algorithms can compensate for poor data

Correct Answer: C
Explanation: Phase II exists specifically to validate data feasibility early.

Q15.

Who is MOST responsible for validating data relevance during Phase II?

A. AI engineer
B. Data scientist
C. Business stakeholder
D. Infrastructure architect

Correct Answer: C
Explanation: CPMAI stresses business ownership of data relevance, not technical teams alone.

Q16.

An AI initiative requires real-time decision-making, but available data is updated only once per day. Which Big Data property represents the primary constraint?

A. Volume
B. Variety
C. Velocity
D. Veracity

Correct Answer: C
Explanation: Velocity evaluates whether data is generated and processed at the speed required to meet business objectives.

Q17.

A project team has access to large amounts of customer data, but much of it is irrelevant to the defined business problem. Which Big Data property is most impacted?

A. Volume
B. Variety
C. Value
D. Velocity

Correct Answer: C
Explanation: Value assesses whether data can realistically contribute to measurable business outcomes, not just whether it exists.

Q18.

An AI model is trained using data from only one geographic region, leading to unfair outcomes when applied globally. Which Big Data property should have been evaluated more carefully in Phase II?

A. Volume
B. Veracity
C. Velocity
D. Value

Correct Answer: B
Explanation: Veracity includes data bias and representativeness, which directly affect fairness and trustworthiness of AI outcomes.

Q19.

An organization has high-quality, relevant data, but the dataset is too small to support statistically reliable predictions. Which Big Data property is the primary concern?

A. Variety
B. Velocity
C. Volume
D. Veracity

Correct Answer: C
Explanation: Volume determines whether sufficient data exists to support meaningful AI analysis and prediction.

Q20.

An AI initiative requires analyzing customer emails, call recordings, and transaction logs together. Which Big Data property is MOST relevant during Phase II assessment?

A. Volume
B. Velocity
C. Variety
D. Value

Correct Answer: C
Explanation: Variety evaluates the presence and relevance of multiple data types (structured and unstructured) required for the AI use case.