AI for Good stories

AI for good starts with quality data

AI is transforming how we work, govern, and live, but only as far as its data allows: the quality of data determines the quality of intelligence. Bigger models cannot fix bad inputs.

by

Oussama Elmerrahi, Regional Lead – Paris Hub

Featured Image

AI is transforming how we work, govern, and live, but only as far as its data allows: the quality of data determines the quality of intelligence. Bigger models cannot fix bad inputs. To be trustworthy, AI must begin with clean, complete, well-governed data.

The Professional Reality

In real projects, the barrier is rarely the model; it’s the data. The UK’s National Health Service (NHS) learned this early: guidance on AI emphasized that safe innovation requires high-quality, well-curated datasets and strong governance (NHSX, 2019). Six years on, analyses still flag the same bottlenecks: data infrastructure, interoperability, and information governance. These remain key obstacles to scaling AI across clinical pathways (The King’s Fund, 2025). This persistence illustrates a broader truth: data quality quietly determines AI success long before deployment.

Evidence from Recent Research

The trend spans industries. Appen’s State of AI 2024 reported a 10 point year over year rise in bottlenecks tied to sourcing, cleaning, and annotating data, alongside a 9 point decline in data accuracy since 2021 (Appen, 2024). Anomalo’s State of Enterprise Data Quality 2024 found 95% of leaders experienced data-quality issues that affected business outcomes, and 87% said traditional rules-based approaches no longer scale (Anomalo, 2024). Qlik’s 2025 research shows that 81% of organizations still struggle with AI data quality, putting ROI and trust at risk (Qlik, 2025). Precisely’s 2024 survey notes 64% now cite data quality as their top integrity challenge (Precisely, 2024).

Advisory and policy bodies converge on the same diagnosis. Deloitte groups gen-AI headwinds into four linked issues: cross-type integrity, reliability of outputs, drift management, and governance (Deloitte, 2025). In parallel, the NIST AI Risk Management Framework elevates data provenance, quality, and bias mitigation as central to trustworthy AI (NIST, 2023; NIST, 2024).

What “Data Quality” Means in AI

Data quality spans multiple, testable dimensions:

  • Accuracy & validity: errors or stale entries distort learning.
  • Completeness: missing features or sparse labels blind models to real-world variation.
  • Consistency & standardization: conflicting schemas and semantics create silent failure modes.
  • Timeliness & freshness: stale data misguides time-sensitive decisions.
  • Representativeness & fairness: non-diverse samples yield biased, potentially discriminatory outputs.

These map to common data-quality checklists and to NIST AI RMF guidance on provenance and bias (NIST, 2023/2024).

Why the Challenge Is Acute for AI

Modern AI magnifies imperfections. Unlike traditional analytics over curated tables, today’s systems ingest heterogeneous streams such as text, images, audio and sensor data, which makes uniform quality harder to enforce. Supervised learning adds labeling error and production workloads then encounter data and concept drift, requiring continuous monitoring and recalibration. Engineering playbooks reinforce this point: Google’s ML Test Score proposes concrete checks for production readiness (Breck et al., 2017), while TensorFlow Data Validation (TFDV) detects schema anomalies, skew between training and serving, and drift in pipelines (TFDV, 2024).

From Slogans to Systems: A Data-Quality Playbook

To move from awareness to durable capability, organizations need a data quality mindset embedded across the AI lifecycle:

  1. Data contracts between producers and consumers
    Define schemas, ranges, null rules, units, and SLOs for freshness and completeness. Treat breaks as production incidents (Deloitte, 2025).
  2. Automated quality gates in CI/CD
    Validate schemas, statistical properties, uniqueness, referential integrity, and drift on every batch or deploy. Fail the build when checks fail, and promote only when all checks pass (Breck et al., 2017; TFDV, 2024).
  3. Lineage, versions, and provenance
    Record how raw data becomes features and training sets, version datasets, and document sources in model cards. Align to NIST AI RMF expectations on provenance and integrity (NIST, 2023/2024).
  4. Labeling quality
    Use gold standard sets, measure interrater agreement, audit edge cases, and relabel where errors cluster (Appen, 2024).
  5. Drift detection & response
    Monitor population, concept, and pipeline drift with alerts, canaries, and scheduled recalibration, and do not rely on periodic retraining alone (Breck et al., 2017; TFDV, 2024).

Ultimately, AI’s transformative potential rises or falls on data integrity. The biggest risks are not scarce computing or clever algorithms but the quiet erosion of trust caused by poor quality inputs. Organizations that invest in clean, consistent, and well governed data, and that operationalize validation, lineage, drift management, and governance, will see better performance and earn the credibility to sustain innovation over time. “AI for good” begins with getting the data right.

Data Consultant
,
DataNovaQ

Based in Paris, France, Oussama Elmerrahi is a Data Consultant and entrepreneur focused on optimizing data governance to ensure quality, security, and compliance. By leveraging innovative practices, he enables informed decision-making, mitigates risk, and tackles the challenges of complex data environments. Passionate about sustainability, he designs data solutions that are effective, ethical, and environmentally responsible.

With a background in Robotics, Industrial Engineering, and cross-functional leadership, Oussama has led collaborative projects across diverse teams, serving as Chairman and Section Lead at IEEE bodies. He is an experienced public speaker and a mentor with the Arqus European University Alliance, supporting impactful careers in technology and data.

As a Youth Ambassador for the Internet Society (ISOC, 2025 cohort), he champions a free and accessible internet, emphasizing innovation, inclusion, and global connectivity. Internationally, he contributes to AI ethics and governance as a Research Group Member with the Center for AI and Digital Policy (CAIDP). Committed to aligning technology with social impact, Oussama advances data-driven and AI-powered solutions that foster a sustainable, inclusive digital future, promoting cross-sector collaboration through AI for Good.

Are you sure you want to remove this speaker?