Deepfake audio refers to synthetic speech that closely mimics a person?s voice, posing risks to security and privacy. This paper proposes a hybrid detection framework combining XLS-R, a multilingual speech representation model, with the Conformer architecture, which captures both local and global audio dependencies. XLS-R extracts rich multilingual embeddings, while the Conformer leverages temporal and contextual features to distinguish genuine from AI-generated speech. Evaluation on benchmark datasets demonstrates that the proposed system achieves improved accuracy and robustness across multiple languages and acoustic conditions.
Conformer, Deepfake Audio, Multilingual Speech Representation, XLS-R
IRE Journals:
Usha Janakiraman, Priyadharshini Ambalavanan, Padmapriya S "Wav2Vec Meets Conformer: A Novel Hybrid Approach for Multilingual Deepfake Audio Detection" Iconic Research And Engineering Journals Volume 9 Issue 5 2025 Page 2438-2446 https://doi.org/10.64388/IREV9I5-1712477
IEEE:
Usha Janakiraman, Priyadharshini Ambalavanan, Padmapriya S
"Wav2Vec Meets Conformer: A Novel Hybrid Approach for Multilingual Deepfake Audio Detection" Iconic Research And Engineering Journals, 9(5) https://doi.org/10.64388/IREV9I5-1712477