🏠︎ / Updates / Building the Next Generation of Forensic Deepfake Datasets

Building the Next Generation of Forensic Deepfake Datasets

23 February 2026

Join the DETECTOR Community

Collaborate to enhance Europe’s resilience against deepfakes and synthetic media

Join the Community

SHARE

As generative AI evolves from experimental novelties to hyper-realistic “multimodal” deepfakes—blending synthetic video, cloned audio, and forged text—the challenge for Law Enforcement Agencies (LEAs) and Forensic Investigators (FIs) is no longer just asking, “Is this real?” Instead, they must ask: “Where exactly was it changed? Who made it? And can I prove it in a court of law?”

These challenges are no longer hypothetical—they are already shaping how investigators and courts assess the credibility of digital evidence. Tackling them requires solutions built for real operational conditions, supported by datasets that reflect what practitioners actually encounter and by development practices that can stand up to independent review. DETECTOR aims to address these needs. As an EU-funded project bringing together research and practitioner expertise, DETECTOR is developing datasets, methods, and governance practices to strengthen deepfake detection and forensic analysis across Europe’s investigative and judicial contexts.

Creating datasets to train these tools is a major challenge. It demands advanced engineering and compliance with European regulations like the EU AI Act, GDPR, and Law Enforcement Directive (LED). Our project is developing a new framework for creating these datasets, ensuring they are not only technically robust but also FAIR (Findable, Accessible, Interoperable, and Reusable).

The Challenge: A “Messy” Reality

Most deepfake datasets contain studio-quality faces with clear glitches. Real-world investigations, however, are messy. They involve grainy CCTV footage, heavily compressed social media clips, and “hybrid” forgeries where a real person’s face is kept, but their words or surroundings are synthetically altered.

Current detection tools often provide a simple “fake” or “real” score. For a forensic investigator, this isn’t enough. They need fine-grained localisation (identifying the specific seconds in a video or the exact pixels in an image that have been altered) and explainability (understanding why the AI flagged a shadow as unnatural or a lip-sync as slightly off).

The Solution: Forensic-Grade Multimodal Datasets

Our project shifts the focus from simple detection to operational forensic capability. We are curating datasets that don’t just mirror the latest AI trends but also focus on meeting the evidence requirements of the justice system.

Our Rationale: Trust Through Compliance

A tool is only as good as the data it’s trained on, and in Europe, that data must be beyond legal reproach. Our approach is built on three pillars:

1. Dual-Regime Governance

We recognise that research and police work have different rules. We maintain strict separation between Research Datasets (governed by GDPR) and Operational Validation Datasets (governed by the LED). This ensures that data used for general AI training doesn’t cross legal boundaries into sensitive investigative territory.

2. Sovereignty and Security

Forensic data is sensitive. To facilitate cross-border cooperation between European LEAs without centralising (and risking) data, we will use Sovereign Data Spaces. This federated architecture will allow agencies to share insights and validate models while keeping the actual data under their own jurisdiction.

3. Strategic Sourcing: Privacy and the Public Domain

To respect privacy while building massive datasets, we plan to use a multi-level sourcing approach:

Synthetic Identities: We will use high-fidelity synthetic subjects (which have no data subject rights) as the source for mass deepfake permutations.
Consented Actors: We will work with paid actors for “Real” footage to capture authentic sensor artifacts.
Public Domain & Deceased Subjects: To reduce compliance friction regarding biometric data, we consider, in line with national laws, leveraging archival sourcing from repositories like the Internet Archive. Since GDPR protections generally do not extend to deceased persons, using public domain footage of historical figures provides a high-quality “Real” baseline. If used, this would ensure human-written and human-captured ground truth that is legally resilient and free from the “AI contamination” found in modern web-scraped data.

How We Will Achieve It: The Technical Pipeline

To meet the transparency requirements of the EU AI Act, we don’t just “collect” data; we document its entire life cycle.

Immutable Audit Trails: Every transformation, from a raw file to a “distorted” forensic sample, will be tracked via a lineage graph. This creates a “chain of custody” for the data itself.
Visual Debugging and Bias Auditing: We will use specialised tools to ensure our datasets aren’t teaching AI to be biased. By “visualising” what the AI sees, we can ensure it is detecting a forgery and not just a specific background or demographic trait.
Calibration for the Courtroom: Our sets will include specific “calibration pairs”—authentic and manipulated content linked by the same identity. This will allow models to output a Likelihood Ratio (LR), a standard forensic metric that helps judges and juries understand the weight of the evidence.

Next Steps

By bridging the gap between academic research and the high-stakes environment of forensic investigation, we are ensuring that the next generation of deepfake detectors are ready for the courtroom.

Stay connected with DETECTOR to keep up with new materials and project updates. Follow us on LinkedIn for regular news and subscribe to our newsletter to receive key announcements directly. If you are interested in the project’s work or would like to explore opportunities to engage, please contact us through the contact form.