Static Evaluation

Tasks and Datasets

We select all speech comprehension benchmarks from existing holistic evaluation sets for audio models (namely AudioBench (opens in a new tab) and AIR-Bench (opens in a new tab)). There are 18 different datasets in total and we perform evaluation for 11 different large audio models.

The datasets cover a wide range of tasks that evaluate models' knowledge of Speaker Cognitive State, Speaker Identity, and Speech Content Understanding. They include Humor Detection, Sarcasm Detection, Intent Detection, Relationship Classification, Gender Classification, Age Classification, Accent Classification, Speech Grounding, Language Identification, Speech Entity Recognition, Speech Question Answering, and Speech Instruction Following.

Category	Task	Dataset	Size	Category	Task	Dataset	Size	Category	Task	Dataset	Size
Cognitive State	Humor Detection	URFUNNY	994	Speaker Identity	Language Identification	Covost2-lan	1000	Speech Content	Speech Grounding	Librispeech-grounding	1000
	Sarcasm Detection	MUSTARD	690		Gender Classification	Commonvoice	1258		Speech Entity Recognition	SLURP-ent	1000
	Pragmatic Intent Detection	SLURP	753		Age Classification	FairSpeech	1000		Instruction Following	Alpaca-Audio	100
	Emotion Recognition	IEMOCAP	1023		Age Classification	Commonvoice	1258		Instruction Following	Openhermes-Audio	100
	Emotion Recognition	MELD	2608		Gender Classification	FairSpeech	1000		Speech QA	CN-College-Listen	2271
Relationship	Relationship Classification	CallHome	24		Accent Classification	Commonvoice	1086		Speech QA	Public_sg_speech	688

Result Analysis

To ensure robustness, we report the average of model performance using three different prompt variations. For public_sg_speech, openhermes, and alpaca datasets, we report the

cfm metric. For other tasks, we report the macro F1 scores.

In general, close sourced models like Gemini and

GPT4o generally top the leaderboard: Gemini has the highest performance on SLURP intent classification (F1: 91.4), MELD emotion recognition (F1: 26.9), CN_college_listen (F1: 66.1) and GPT4o performs the best on MUSTARD sarcasm detection (F1: 53.6), IEMOCAP emotion recognition (F1: 31.5), CallHome relation classification (F1: 59.7), and Commonvoice accent classification (F1: 35.3).

Among the open-sourced models, Qwen2-Audio demonstrates outstanding performance on SpeechQA and Gender/Age classification tasks and DiVA shows excellent humor detection and speech instruction following capability that outperforms all other models. These two open-sourced models also show relatively good performance on other tasks, demonstrating good generalizability.

NextGPT and PandaGPT perform relatively worse, especially for tasks like intent and emotion recognition, accent recognition, and instruction following. They share similar encoder architecture (ImageBind) and this suggests the limitation of using ImageBind for encoding audio features.

We also perform evaluation for the sequential pipeline of Whisper plus Llama3-8B-Instruct. It shows relatively good performance for tasks like emotion recognition and speech QA, which means some of the data instances can be inferred from content only. However, for each and every task there are speech models outperforming the whisper+llama3 pipeline. This suggests that some information like emotion, relationship, and sarcasm can be

embedded in vocal cues and requires understanding beyond content.

Model Performance on Static Benchmarks

Performance across different datasets (ranked from highest to lowest)

Talk Arena Interactive Evaluation