Static Evaluation
Tasks and Datasets
We select all speech comprehension benchmarks from existing holistic evaluation sets for audio models (namely AudioBench (opens in a new tab) and AIR-Bench (opens in a new tab)). There are 18 different datasets in total and we perform evaluation for 11 different large audio models.
The datasets cover a wide range of tasks that evaluate models' knowledge of Speaker Cognitive State, Speaker Identity, and Speech Content Understanding. They include Humor Detection, Sarcasm Detection, Intent Detection, Relationship Classification, Gender Classification, Age Classification, Accent Classification, Speech Grounding, Language Identification, Speech Entity Recognition, Speech Question Answering, and Speech Instruction Following.
Category | Task | Dataset | Size | Category | Task | Dataset | Size | Category | Task | Dataset | Size |
---|---|---|---|---|---|---|---|---|---|---|---|
Cognitive State | Humor Detection | URFUNNY | 994 | Speaker Identity | Language Identification | Covost2-lan | 1000 | Speech Content | Speech Grounding | Librispeech-grounding | 1000 |
Sarcasm Detection | MUSTARD | 690 | Gender Classification | Commonvoice | 1258 | Speech Entity Recognition | SLURP-ent | 1000 | |||
Pragmatic Intent Detection | SLURP | 753 | Age Classification | FairSpeech | 1000 | Instruction Following | Alpaca-Audio | 100 | |||
Emotion Recognition | IEMOCAP | 1023 | Age Classification | Commonvoice | 1258 | Instruction Following | Openhermes-Audio | 100 | |||
Emotion Recognition | MELD | 2608 | Gender Classification | FairSpeech | 1000 | Speech QA | CN-College-Listen | 2271 | |||
Relationship | Relationship Classification | CallHome | 24 | Accent Classification | Commonvoice | 1086 | Speech QA | Public_sg_speech | 688 |
Result Analysis
To ensure robustness, we report the average of model performance using three different prompt variations.
For public_sg_speech
, openhermes
, and alpaca
datasets, we report the
cfm metric. For other tasks, we report the macro F1 scores.
In general, close sourced models like Gemini and GPT4o generally top the leaderboard: Gemini has the highest performance on SLURP intent classification (F1: 91.4), MELD emotion recognition (F1: 26.9), CN_college_listen (F1: 66.1) and GPT4o performs the best on MUSTARD sarcasm detection (F1: 53.6), IEMOCAP emotion recognition (F1: 31.5), CallHome relation classification (F1: 59.7), and Commonvoice accent classification (F1: 35.3).
Among the open-sourced models, Qwen2-Audio demonstrates outstanding performance on SpeechQA and Gender/Age classification tasks and DiVA shows excellent humor detection and speech instruction following capability that outperforms all other models. These two open-sourced models also show relatively good performance on other tasks, demonstrating good generalizability. NextGPT and PandaGPT perform relatively worse, especially for tasks like intent and emotion recognition, accent recognition, and instruction following. They share similar encoder architecture (ImageBind) and this suggests the limitation of using ImageBind for encoding audio features.
We also perform evaluation for the sequential pipeline of Whisper plus Llama3-8B-Instruct. It shows relatively good performance for tasks like emotion recognition and speech QA, which means some of the data instances can be inferred from content only. However, for each and every task there are speech models outperforming the whisper+llama3 pipeline. This suggests that some information like emotion, relationship, and sarcasm can be embedded in vocal cues and requires understanding beyond content.