Data requirements under the EU AI Act

Challenges for AI system providers

Art. 10 mandates that data used for training, validation, and testing must be:

Relevant, representative, free of errors, and complete.
Statistically sound and appropriate for the intended purpose.
Accompanied by documentation of collection methods, assumptions, and preprocessing steps.
Evaluated for potential biases and gaps, with mitigation strategies in place.

These provisions apply across all high-risk AI systems.

Annex IV complements Art. 10 provisions by detailing what must be documented by deployers of high-risk AI systems:

Dataset descriptions, including origin, scope, and characteristics.
Data labeling and cleaning procedures.
Versioning and traceability across the data lifecycle.

Bias mitigation is a cornerstone of the AI Act. Developers must proactively identify and address biases that could lead to wrong AI system outputs. This includes:

Using diverse and representative datasets.
Applying fairness-aware algorithms and validation metrics.
Documenting bias detection and mitigation techniques.

This aligns with series of standards ISO/IEC 5259 (Data Quality for AI) and ISO/IEC 8183 (AI Data Lifecycle), which provide operational guidance for implementing robust data governance. Under the AI Act the harmonized standards EN 18284 (Quality and governance of datasets in AI) and EN 18283 (Concepts, measures and requirements for managing bias in AI systems) will be the first choice for achieving compliance.

Data management process is key to AI Act compliance

The data management process in the quality management system of the AI system provider should encompass the following steps:

Data requirement specification,
Data management planning,
Data collection,
Data preparation,
Data provision, and
Data decommissioning.

Requirement specification and management planning typically occur within the AI model development process. The data management report serves as central proof of compliance.

Post-market monitoring and risk management

The AI Act requires continuous monitoring of deployed systems to detect performance degradation or data drift. This is especially critical for adaptive systems that evolve over time. Key practices include:

Logging inputs and outputs for traceability.
Monitoring prediction drift and triggering retraining or updates.
Using predetermined change control plans (PCCPs) for systems that learn post-deployment.

Risk management must also account for data-specific hazards such as poisoning, distributional shifts, and adversarial manipulation.

Data protection

When personal data is involved, the AI Act intersects with the EU GDPR. Developers must:

Assess whether data can be linked to individuals, even indirectly.
Apply principles of data minimization, purpose limitation, and fairness.
Ensure lawful bases for processing and implement privacy-preserving techniques.

This dual compliance challenge underscores the need for cross-functional collaboration between AI engineers, legal experts, and data protection officers.

Data access by Notified Bodies

Access to provider data by Notified Bodies (Art. 43) presents a delicate intersection with GDPR compliance. To verify conformity of high-risk AI systems, Notified Bodies may require access to training, validation, and testing datasets possibly containing personal or pseudonymized data. Providers must ensure that such data sharing is explicitly covered by a legal basis, and that safeguards like anonymization, contractual controls, and audit trails are in place. Without these, the risk of unauthorized data access or secondary use could undermine both regulatory trust and data subject rights.

Additional considerations in data management

While the AI Act provides a strong foundation, the following critical aspects deserve additional attention:

Synthetic data: Increasingly used to augment or replace real datasets, synthetic data must be evaluated concerning quality, bias, and representativeness as well.
Data preparation: The role of domain experts in labeling, validating, and interpreting data remains irreplaceable. Human oversight processes enhance quality and accountability.

Conclusion

The AI Act sets a standard for data governance in high-risk AI systems. By embracing its requirements and integrating emerging standards, developers can build systems that are not only compliant but also resilient, ethical, and future proof. Data is no longer just a technical asset but also a regulatory cornerstone and a competitive differentiator.