Proposed Idea: Listening Like a Human
Our goal was to design a model to listen like a human by understanding the prosodic variation from real speech and use that knowledge to build up understanding about prosodic variation in real and fake speech while also capturing other spoof-relevant cues.
Figure: Stage 1 models the natural prosodic variation of real speech, while Stage 2 discriminates between real and synthetic expressive speech.
HuLA
Figure: Training phase of HuLA, the proposed prosody-aware multi-task learning method for anti-spoofing. Blue blocks indicate initialization from pretrained models, while pink blocks represent training from scratch.
Datasets Used for Evaluation
Results and Discussion
Our experiments demonstrate the effectiveness of HuLA, a two-stage MTL framework that improves spoof detection through explicit prosody modeling. Although trained only on ASVspoof 2019, which lacks the diversity and realism of recent attacks, HuLA generalizes well across datasets that differ substantially from the training domain. Several of these sets include expressive and emotional synthetic speech, which typically fool state-of-the-art baselines. HuLA benefits from prosody-aware training, which equips the model to detect mismatches in expressiveness that are not dataset-dependent. This aligns with our design principle of listening like a human: just as listeners use prosodic cues to judge naturalness, HuLA leverages prosodic variation in both real and spoofed speech to capture subtle differences in expressiveness.
References
- Wang, Xin, et al. "ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech." Computer Speech & Language 64 (2020): 101114.
- Liu, Xuechen, et al. "ASVspoof 2021: Towards spoofed and deepfake speech detection in the wild." IEEE/ACM Transactions on Audio, Speech, and Language Processing 31 (2023): 2507-2522.
- Wang, Xin, et al. "ASVspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale." arXiv preprint arXiv:2408.08739 (2024).
- Zhao, Yan, et al. "Emofake: An initial dataset for emotion fake audio detection." China National Conference on Chinese Computational Linguistics. Springer Nature Singapore, 2024.
- Zhou, Kun, et al. "Speech synthesis with mixed emotions." IEEE Transactions on Affective Computing 14.4 (2022): 3120-3134.
- Flórez, PA Tamayo, Rubén Manrique, and B. Pereira Nunes. "HABLA: A dataset of Latin American Spanish accents for voice anti-spoofing." Proc. Interspeech. Vol. 2023.
- Yi, Jiangyan, et al. "ADD 2022: the first audio deep synthesis detection challenge." ICASSP 2022.