Blog
Static Evaluation

Static Evaluation

Tasks and Datasets

We select all speech comprehension benchmarks from existing holistic evaluation sets for audio models (namely AudioBench (opens in a new tab) and AIR-Bench (opens in a new tab)). There are 18 different datasets in total and we perform evaluation for 11 different large audio models.

The datasets cover a wide range of tasks that evaluate models' knowledge of Speaker Cognitive State, Speaker Identity, and Speech Content Understanding. They include Humor Detection, Sarcasm Detection, Intent Detection, Relationship Classification, Gender Classification, Age Classification, Accent Classification, Speech Grounding, Language Identification, Speech Entity Recognition, Speech Question Answering, and Speech Instruction Following.

CategoryTaskDatasetSizeCategoryTaskDatasetSizeCategoryTaskDatasetSize
Cognitive StateHumor DetectionURFUNNY994Speaker IdentityLanguage IdentificationCovost2-lan1000Speech ContentSpeech GroundingLibrispeech-grounding1000
Sarcasm DetectionMUSTARD690Gender ClassificationCommonvoice1258Speech Entity RecognitionSLURP-ent1000
Pragmatic Intent DetectionSLURP753Age ClassificationFairSpeech1000Instruction FollowingAlpaca-Audio100
Emotion RecognitionIEMOCAP1023Age ClassificationCommonvoice1258Instruction FollowingOpenhermes-Audio100
Emotion RecognitionMELD2608Gender ClassificationFairSpeech1000Speech QACN-College-Listen2271
RelationshipRelationship ClassificationCallHome24Accent ClassificationCommonvoice1086Speech QAPublic_sg_speech688

Result Analysis

To ensure robustness, we report the average of model performance using three different prompt variations. For public_sg_speech, openhermes, and alpaca datasets, we report the cfm metric. For other tasks, we report the macro F1 scores.

In general, close sourced models like Gemini and GPT4o generally top the leaderboard: Gemini has the highest performance on SLURP intent classification (F1: 91.4), MELD emotion recognition (F1: 26.9), CN_college_listen (F1: 66.1) and GPT4o performs the best on MUSTARD sarcasm detection (F1: 53.6), IEMOCAP emotion recognition (F1: 31.5), CallHome relation classification (F1: 59.7), and Commonvoice accent classification (F1: 35.3).

Among the open-sourced models, Qwen2-Audio demonstrates outstanding performance on SpeechQA and Gender/Age classification tasks and DiVA shows excellent humor detection and speech instruction following capability that outperforms all other models. These two open-sourced models also show relatively good performance on other tasks, demonstrating good generalizability. NextGPT and PandaGPT perform relatively worse, especially for tasks like intent and emotion recognition, accent recognition, and instruction following. They share similar encoder architecture (ImageBind) and this suggests the limitation of using ImageBind for encoding audio features.

We also perform evaluation for the sequential pipeline of Whisper plus Llama3-8B-Instruct. It shows relatively good performance for tasks like emotion recognition and speech QA, which means some of the data instances can be inferred from content only. However, for each and every task there are speech models outperforming the whisper+llama3 pipeline. This suggests that some information like emotion, relationship, and sarcasm can be embedded in vocal cues and requires understanding beyond content.

Model Performance on Static Benchmarks
Performance across different datasets (ranked from highest to lowest)