HuLA

Abstract: Current anti-spoofing systems remain vulnerable to expressive and emotional synthetic speech, since they rarely leverage prosody as a discriminative cue. Prosody is central to human expressiveness and emotion, and humans instinctively use prosodic cues such as F₀ contours and voiced/unvoiced structure to distinguish natural from synthetic speech. In this paper, we propose HuLA, a two-stage prosody-aware multi-task learning framework for spoof detection. In Stage 1, a self-supervised learning (SSL) backbone is trained on real speech with auxiliary tasks of F₀ prediction and voiced/unvoiced classification, enhancing its ability to capture natural prosodic variation. In Stage 2, the model is jointly optimized for spoof detection and prosody tasks on both real and synthetic data, leveraging prosodic awareness to detect mismatches between natural and expressive synthetic speech. Experiments show that HuLA consistently outperforms strong baselines on challenging out-of-domain datasets, including expressive, emotional, and cross-lingual attacks.

Proposed Idea: Listening Like a Human

Our goal was to design a model to listen like a human by understanding the prosodic variation from real speech and use that knowledge to build up understanding about prosodic variation in real and fake speech while also capturing other spoof-relevant cues.

Figure: Stage 1 models the natural prosodic variation of real speech, while Stage 2 discriminates between real and synthetic expressive speech.

HuLA

Figure: Training phase of HuLA, the proposed prosody-aware multi-task learning method for anti-spoofing. Blue blocks indicate initialization from pretrained models, while pink blocks represent training from scratch.

Datasets Used for Evaluation

Results and Discussion

Our experiments demonstrate the effectiveness of HuLA, a two-stage MTL framework that improves spoof detection through explicit prosody modeling. Although trained only on ASVspoof 2019, which lacks the diversity and realism of recent attacks, HuLA generalizes well across datasets that differ substantially from the training domain. Several of these sets include expressive and emotional synthetic speech, which typically fool state-of-the-art baselines. HuLA benefits from prosody-aware training, which equips the model to detect mismatches in expressiveness that are not dataset-dependent. This aligns with our design principle of listening like a human: just as listeners use prosodic cues to judge naturalness, HuLA leverages prosodic variation in both real and spoofed speech to capture subtle differences in expressiveness.

References

Wang, Xin, et al. "ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech." Computer Speech & Language 64 (2020): 101114.
Liu, Xuechen, et al. "ASVspoof 2021: Towards spoofed and deepfake speech detection in the wild." IEEE/ACM Transactions on Audio, Speech, and Language Processing 31 (2023): 2507-2522.
Wang, Xin, et al. "ASVspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale." arXiv preprint arXiv:2408.08739 (2024).
Zhao, Yan, et al. "Emofake: An initial dataset for emotion fake audio detection." China National Conference on Chinese Computational Linguistics. Springer Nature Singapore, 2024.
Zhou, Kun, et al. "Speech synthesis with mixed emotions." IEEE Transactions on Affective Computing 14.4 (2022): 3120-3134.
Flórez, PA Tamayo, Rubén Manrique, and B. Pereira Nunes. "HABLA: A dataset of Latin American Spanish accents for voice anti-spoofing." Proc. Interspeech. Vol. 2023.
Yi, Jiangyan, et al. "ADD 2022: the first audio deep synthesis detection challenge." ICASSP 2022.