Leads
Team Members
Introduction
From pop culture AI such as HAL and Jarvis to early consumer-facing systems such as Watson, Siri, and Alexa - the canonical image of an AI assistant engages with users through speech rather than text. This is, in many ways, unsurprising as written language is generally learned later than spoken, and people can communicate faster1 through speech. However, to deliver on this long-standing vision of an effective voice assistant, we need standard ways to measure progress across the range of capabilities which a voice assistant needs to be safe, effective, and enjoyable to use.
Existing benchmarks2,3,4,5,6,7,8,9,10,11 focus either primarily on the analytical use cases of speech AI, such as classifying emotions or demographics from speech, or solely on information seeking tasks, such as question answering. Instead of focusing on audio processing broadly or the analysis of speech, we set out to create and curate benchmarks which assess Large Audio Models (LAMs) capabilities in dimensions necessary for product developers to produce high-quality and safe speech-in, speech-out voice assistants12.
CAVA Benchmark
The most significant distinguishing feature of a voice assistant from auditory AI more broadly is the focus on providing efficient, task-oriented assistance to the user through voice interaction. In order to provide a high quality user experience, a foundation model for voice assistants requires (at a minimum) capabilities across each of the following six domains.
- Turn Taking: The model needs to be able to understand conversational turn taking dynamics so that the assistant can engage with the user(s) at appropriate times.
- Instruction Following: The model needs to be able to follow instructions so that the assistant can adhere to both user and system designer prompts.
- Function Calling: The model needs to be able to map user requests to function calls so that the assistant can retrieve information and execute actions on the user(s) behalf.
- Tone Awareness: The model needs to be able to recognize paralinguistic information communicated through speech and modify responses so that the assistant can respond appropriately.
- Safety: The model needs to be able to recognize jailbreaking and deceptive intents in users so that the assistant can adhere to safety and ethical guidelines.
- Latency: The model needs to respond with low latency so that the assistant can maintain natural conversational pacing.
Both historically and at present, many of these capabilities have been handled by separate models and pipelined together by practitioners to create voice assistants. Tone awareness can be delegated to separate classifiers for paralinguistic information, function calling to semantic parsers, and turn taking to voice activity detection models. However, realtime APIs and the models designed to support them indicate a shift towards these capabilities becoming part of more monolithic models which will need to meet these multifaceted, and at times conflicting, capability needs. To address this, we set out in January to construct a set of benchmarks which broadly covered these capability domains, which we are releasing now as a Comprehensive Assessment of Voice Assistants (CAVA).
CAVA is based on the idea that measuring these capabilities separately can produce metrics which are misaligned with the downstream utility of a model. For example, a model which achieves strong performance but high latency is likely to prove frustrating to users in real time spoken conversations. As such, benchmarks which omit latency could incentivize models to create worse user experiences to maximize benchmark results. While there exist multitask benchmarks along separate capabilities, CAVA is the first designed to test which systems are likely to serve as the overall best foundation for voice assistant developers.
Results
In our initial comparison, we evaluate four commercial LAM systems, two of these are from OpenAI while the other two are from Google. From OpenAI, we evaluate the most recent (as of May 2025) GPT-4o Audio end-to-end audio model. We then compare this to a multistage GPT-based pipeline specialized for each stage of the speech processing workflow: transcription (GPT-4o-mini-transcribe), natural language understanding and generation (GPT-4o, text-only version), and text-to-speech synthesis (GPT-4o-mini-tts, GPT Pipeline in the table). From Google, we compare to their two recently released general purpose LLMs, both of which have audio input support, Gemini 2.0 Flash and Gemini 2.5 Pro. Notably, Gemini 2.5 Pro does not support audio outputs at this point, while Gemini 2.0 Flash only supports audio output via the Gemini Live API. If you are interested in what exactly CAVA data looks like, please jump to the Examples section!
Looking at the CAVA results, we can observe several patterns in model performance. GPT-4o shows a significant advantage in the Jeopardy task with a 73.0% win rate compared to GPT Pipeline's 15.4% and Gemini 2.0's 6.0%. This difference reflects Jeopardy's zero-sum nature where latency impacts outcomes - the first correct answer wins regardless of how close the competitors were - and the fact that the end-to-end GPT-4o Audio model is consistently several hundred milliseconds faster than any other competitors.
For instruction following, Gemini models show better system prompt adherence (70.2% for Gemini 2.5 vs 64.7% for GPT Pipeline), while GPT-4o performs better in pronunciation control (58% correct vs 32% for Gemini 2.0). Gemini models currently only support speech output via the Live API, which may contribute to some of the observed performance characteristics. The speech outputs appear to have a more staccato vocal quality, suggesting that the current speech outputs come from a traditional TTS system rather than an end-to-end speech model.
Function calling shows pipeline approaches slightly outperforming end-to-end models (27% vs 24% function calls match). While Gemini models do support function calling, we found that they currently (opens in a new tab) don't fully support required function calling similar to OpenAI or VLLM, which negatively affected their performance on our benchmark which always requires at least one function call to be made. As we believe this is likely a temporary implementation detail rather than a fundamental limitation, we've omitted these results for the time being.
Safety evaluations indicate Gemini 2.5 has better jailbreaking resistance with a 49.0% attack success rate compared to approximately 79% for others, while Gemini 2.0 performs best at deception detection with 26.1% accuracy.
These results suggest that different models have distinct strengths across the dimensions needed for voice assistants, requiring developers to consider which capabilities best match their specific application requirements. When looking only at OpenAI models, the speech-in, speech-out language model seems to offer the most natural conversational capabilities with consistently lower latency, more controllable speech, and better tone awareness. However, comparing between GPT-4o Audio and the GPT pipeline approach shows that this improved capability comes at the cost of slightly diminished capabilities in dimensions which require less "native" speech reasoning such as function calling, turn prediction, and system prompt following.
Looking at models on the whole, the Gemini 2.5 Pro model demonstrates impressive capabilities, particularly in system prompt adherence and jailbreaking resistance. It's worth noting that Gemini's APIs have undergone significant updates (opens in a new tab) in March 2025, and some features voice agent developers might need are likely still stabilizing (such as function calling) or "coming soon" (such as audio outputs). As these APIs mature, we expect to see Gemini's capabilities in these dimensions improve as the systems stabilize and move out of experimental/preview releases.
This highlights that many of the capabilities we measure in CAVA (such as latency and function calling) move beyond core model capabilities and into these system dimensions. API driven models are especially opaque in terms of system design, so some metrics are likely to change as system engineering improves for these models even if the underlying weights stay the same. However, we think these system level performance metrics are important, especially for application and product developers trying to pick which API to rely on for their voice assistant deployments. For open-source model developers, these system engineering components are often overlooked during model training! By building on VLLM (opens in a new tab), as we describe in our "How to add a model", we hope CAVA provides a more straightforward way for open-source model developers to test and improve these broader dimensions as well.
Details about CAVA Tasks
In our initial release of CAVA, we focus on covering the space of dimensions listed in the introduction and construct one or more representative tasks to cover each domain. This provides application developers at least one reference point for each capability domain to help guide decision making around model adoption! However, we are certainly interested in hearing from product developers to understand whether there are specific tasks they test on which feel generalizable to the broader space of voice assistant deployments!
Latency
Jeopardy
To evaluate how quickly and accurately LAMs can respond to user queries, we return to the classic task of Jeopardy. This benchmark measures the ability of different models to process and respond to audio questions with both speed and accuracy, simulating time-sensitive interactions typical in voice assistant scenarios. The Win Rate metric reports the percentage of cases where this model provides the correct answer with the lowest latency of all tested models.
Function Calling 13
Spoken Task Oriented Parsing
One of the original use cases for voice assistants like Siri and Alexa was device control. This capability remains crucial in the era of LAMs, which are increasingly used in agentic applications. While current models are highly accurate at single turn function calls, even relatively simple queries can require more complex compositional function calls in order to deal with real world ambiguity. For example, when a user asks to 'navigate to my house' a model might need to first obtain the location of the user's house using an initial tool, then get directions to that specific location using a second tool. To measure this function-calling ability, we adapted the STOP dataset, a popular spoken semantic parsing dataset, and created a unified API encompassing all intents and purposes in the benchmark. We then evaluate how effectively modern LAMs can parse user intents into precise and compositional function calls. Models are scored based on whether they call the correct functions the expected number of times, a much more difficult metric than overall intent matching but looser than exact match which we find to be overly strict by human judgement.
Instruction Following
System Prompt Following
For LAMs to serve effectively as assistants and agents, they must consistently follow user-specified instructions provided through both text and audio. Using a set of verifiable instructions from IFEval which are applicable to speech, such as 'answer in fewer than 50 words' or 'do not use the word X in your response,' we benchmark how consistently models adhere to instructions within a spoken question-answering context. Evaluation is based on a binary pass/fail metric that determines whether the model's response strictly adheres to all given constraints.
Pronunciation Prompt Following
Models which serve as the backbone for voice assistants ideally can follow instructions multi-modally. For example, users might have specific desired pronunciation of their name that the model fails at to begin with and commercial voice assistants might need to be able to pronounce non-standard product or company names. A general purpose model for voice assistants should be instructable in these concrete dimensions of spoken language, not just in text. While these instructable aspects of LAMs extend even to more subtle features of vocal generation quality such as emotion, we focus on pronunciation control, a domain with objective expectations for output! Here, the model is given examples of words mined from Wiktionary with pronunciations we found models did not know a priori along with an Oxford English Dictionary pronunciation guide for the word, testing whether the LAM can adapt it's pronunciation based on the text instruction. Models are evaluated using an Audio LLM as a Judge between a reference pronunciation and the model output.
Tone Awareness
Emotion Counterfactual Response Generation
To ensure LAMs can recognize social cues and respond appropriately, we introduce the task of tone-aware response generation. This evaluates the model's ability to generate appropriate responses that adapt to the same text input delivered with different emotional tones. This capability is essential for voice assistants to maintain natural conversation flow and demonstrate appropriate social awareness during interactions. Performance is measured through a text LLM as a Judge to judge the emotional relevance and specificity of model generated replies.
Turn-Taking 14
AMI Turn Prediction
For LAMs to integrate into group conversations and collaborative work, they must understand when it's appropriate to contribute. Using the AMI corpus of meeting recordings, we evaluate turn prediction capabilities by assessing whether models can accurately predict who should speak next given the entire context of a conversation. Models are scored on prediction accuracy compared to the ground truth of who actually spoke next in the recorded meetings.
Safety
Deception Detection
We test whether LAMs can understand complex social interactions and detect deceptive communication by examining their performance on scenarios similar to One Night Ultimate Werewolf games. Models must identify deceptive intent in audio recordings of structured role-playing discussions, allowing us to better understand vulnerabilities in LAMs' social reasoning capabilities. Performance is measured by accuracy in correctly identifying which player is the Werewolf (the deceptive actor) in the game scenario.
Speech Jailbreaking
We vocalize persuasive prompts and convert them to audio input using text-to-speech, then examine how LAMs respond to these persuasive vocal characteristics to help strengthen safeguards against manipulation attempts. Models are evaluated on their refusal rate and safety alignment when presented with potentially harmful requests delivered through persuasive techniques.
How to Add A Model to CAVA?
Our goal with CAVA is to make model evaluation as straightforward as possible. Adding your model is simple:
As long as VLLM (opens in a new tab) supports your model, CAVA supports your model without any additional code. Just prefix your model path on HuggingFace with vllm/
when running the evaluation:
python -m cava.inference --task jeopardy --models vllm/YourOrganization/YourModel
To streamline the process for both HuggingFace datasets and sources which require additional processing, we've included run scripts in the git repository that handle all the downloading, setup, and evaluation automatically. You can find these at ./run_scripts
including a single run_all.sh
which executes all of the CAVA evals.
We're happy to improve the framework and welcome your feedback if you find it difficult to run your model! By integrating with VLLM, our hope is that new models are straightforward to evaluate with this library and there should be minimal overhead beyond the VLLM support itself! If you encounter any issues or have suggestions for making model integration even simpler, please open an issue on our GitHub repository.
BibTeX Citation
@misc{cava2025,
title = {CAVA: Comprehensive Assessment of Voice Assistants},
author = {Held, Will and Ryan, Michael J. and Shrivastava, Aditya and Khan, Ali Sartaz and Ziems, Caleb and Li, Ella and Bartelds, Martijn and Sun, Michael and Li, Tan and Gan, Woody and Yang, Diyi},
year = {2025},
url = {https://talkarena.org/cava},
howpublished = {\url{https://github.com/SALT-NLP/CAVA}},
note = {A benchmark for evaluating large audio models (LAMs) capabilities across six domains: turn taking, instruction following, function calling, tone awareness, safety, and latency}
}
Contributions
- Will Held led developed the emotion counterfactual response generation task, architected the CAVA codebase, and co-organized the collaborators with Michael.
- Michael J. Ryan created the pronunciation prompt following task and co-organized the collaborators with Will.
- Aditya Shrivastava designed the system prompt following task.
- Ali Sartaz Khan developed the Jeopardy task for latency measurement.
- Caleb Ziems ran the Prolific data collection for the emotion counterfactual task.
- Minzhi Ella Li developed the speech jailbreaking task.
- Martijn Bartelds gave various insights into speech evaluation, project framing, and metric selection.
- Michael Sun designed the deception detection task for safety testing.
- Tan Li and Will Held adapted the function calling task from STOP.
- Woody Gan created the AMI turn prediction task for turn-taking evaluation.
- Diyi Yang defined the project scope along with Michael and Will and provided guidence, supervision, and support to make the project possible.
External Acknowledgements
We would like to express our gratitude to the OpenAI Multimodal Team for providing the credits that made it possible to evaluate their models and for feedback on early task ideation. We also thank the Stanford HAI-GCP (opens in a new tab) compute grant for the credits used for Gemini evaluations. This research would not have been possible without their generous support.
Citations
Footnotes
-
Ruan, S., Wobbrock, J. O., Liou, K., Ng, A., & Landay, J. A. (2018). Comparing Speech and Keyboard Text Entry for Short Messages in Two Languages on Touchscreen Phones. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 1(4), 1–23. https://doi.org/10.1145/3161187 (opens in a new tab) ↩
-
Huang, C., Lu, K., Wang, S., Hsiao, C., Kuan, C., Wu, H., Arora, S., Chang, K., Shi, J., Peng, Y., Sharma, R., Watanabe, S., Ramakrishnan, B., Shehata, S., & Lee, H. (2024). Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech. arXiv preprint arXiv:2309.09510. https://arxiv.org/abs/2309.09510 (opens in a new tab) ↩
-
Huang, C., Chen, W., Yang, S., Liu, A. T., Li, C., Lin, Y., Tseng, W., Diwan, A., Shih, Y., Shi, J., Chen, W., Chen, X., Hsiao, C., Peng, P., Wang, S., Kuan, C., Lu, K., Chang, K., Yang, C., Gutierrez, F. A. R., Kuan-Po, H., Arora, S., Lin, Y., Ming To, C., Yeo, E., Chang, K., Chien, C., Choi, K., Hsieh, C., Lin, Y., Yu, C., Chiu, I., Guimarães, H., Han, J., Lin, T., Lin, T., Chang, H., Chang, T., Chen, C. W., Chen, S., Chen, Y., Cheng, H., Dhawan, K., Fang, J., Fang, S., Chiang, K. Y. F., Fu, C. A., Hsiao, H., Hsu, C. Y., Huang, S., Wei, L. C., Lin, H., Lin, H., Lin, H., Lin, J., Liu, T., Lu, L., Pai, T., Pasad, A., Kuan, S., Shon, S., Tang, Y., Tsai, Y., Chiang, W. J., Wei, T., Wu, C., Wu, D., Yang, C. H., Yang, C., Yip, J. Q., Yuan, S., Wu, H., Livescu, K., Harwath, D., Watanabe, S., & Lee, H. (2025). Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks. In The Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=s7lzZpAW7T (opens in a new tab) ↩
-
Sakshi, S., Tyagi, U., Kumar, S., Seth, A., Selvakumar, R., Nieto, O., Duraiswami, R., Ghosh, S., & Manocha, D. (2025). MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark. In The Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=TeVAZXr3yv (opens in a new tab) ↩
-
Bu, F., Zhang, Y., Wang, X., Wang, B., Liu, Q., & Li, H. (2024). Roadmap towards Superhuman Speech Understanding using Large Language Models. arXiv preprint arXiv:2410.13268. https://arxiv.org/abs/2410.13268 (opens in a new tab) ↩
-
Wang, B., Zou, X., Lin, G., Sun, S., Liu, Z., Zhang, W., Liu, Z., Aw, A., & Chen, N. F. (2024). AudioBench: A Universal Benchmark for Audio Large Language Models. arXiv preprint arXiv:2406.16020. https://arxiv.org/abs/2406.16020 (opens in a new tab) ↩
-
Li, T., Liu, J., Zhang, T., Fang, Y., Pan, D., Wang, M., Liang, Z., Li, Z., Lin, M., Dong, G., Xu, J., Sun, H., Zhou, Z., & Chen, W. (2025). Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction. arXiv preprint arXiv:2502.17239. https://arxiv.org/abs/2502.17239 (opens in a new tab) ↩
-
Gong, K., Feng, K., Li, B., Wang, Y., Cheng, M., Yang, S., Han, J., Wang, B., Bai, Y., Yang, Z., & Yue, X. (2024). AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information? arXiv preprint arXiv:2412.02611. https://arxiv.org/abs/2412.02611 (opens in a new tab) ↩
-
Ao, J., Wang, Y., Tian, X., Chen, D., Zhang, J., Lu, L., Wang, Y., Li, H., & Wu, Z. (2025). SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words. arXiv preprint arXiv:2406.13340. https://arxiv.org/abs/2406.13340 (opens in a new tab) ↩
-
Chen, Y., Yue, X., Zhang, C., Gao, X., Tan, R. T., & Li, H. (2024). VoiceBench: Benchmarking LLM-Based Voice Assistants. arXiv preprint arXiv:2410.17196. https://arxiv.org/abs/2410.17196 (opens in a new tab) ↩
-
Yang, Q., Xu, J., Liu, W., Chu, Y., Jiang, Z., Zhou, X., Leng, Y., Lv, Y., Zhao, Z., Zhou, C., & Zhou, J. (2024). AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension. arXiv preprint arXiv:2402.07729. https://arxiv.org/abs/2402.07729 (opens in a new tab) ↩
-
Pipecat Team (2025) Voice AI & Voice Agents: An Illustrated Primer https://voiceaiandvoiceagents.com/#llms-for-voice (opens in a new tab) ↩
-
To produce high quality function calling data, CAVA releases scripts which download STOP (opens in a new tab) and adapt it from semantic parsing to a task aligned with modern function calling APIs.
Tomasello, P., Shrivastava, A., Lazar, D., Hsu, P., Le, D., Sagar, A., Elkahky, A., Copet, J., Hsu, W., Adi, Y., Algayres, R., Nguyen, T. A., Dupoux, E., Zettlemoyer, L., & Mohamed, A. (2022). STOP: A dataset for Spoken Task Oriented Semantic Parsing. arXiv preprint arXiv:2207.10643. https://arxiv.org/abs/2207.10643 (opens in a new tab) ↩ -
To model turn-taking behaviour, CAVA samples conversational contexts from the AMI Corpus (opens in a new tab).
Carletta, J., Ashby, S., Bourban, S., Flynn, M., Guillemot, M., Hain, T., Kadlec, J., Karaiskos, V., Kraaij, W., Kronenthal, M., Lathoud, G., Lincoln, M., Lisowska, A., McCowan, I., Post, W., Reidsma, D., & Wellner, P. (2005). The AMI meeting corpus: a pre-announcement. In Proceedings of the Second International Conference on Machine Learning for Multimodal Interaction (MLMI'05) (pp. 28-39). Springer-Verlag. https://doi.org/10.1007/11677482_3 (opens in a new tab) ↩