Talk Arena Leaderboard

Prolific Study (Paid Participants)

We applied Bradley Terry model to pariwise voting results on Prolific to get a ranking for the five models tested. The final result shows a ranking of DiVA, GPT4o, Gemini-1.5-pro, Qwen2-Audio, Typhoon-1.5 (from most to least preferred).

Talk Arena's leaderboard reflects how models perform across unconstrained user interaction, not just audio-specific tasks. Users are free to engage with models for any purpose, from general conversation to specialized audio processing. This means our rankings show which models excel at what people use them for, even if that usage could be reasonably well served by a text-only language model, rather than focusing solely on audio capabilities.

While we consider this usage-based approach a feature instead of a bug, we understand some may prefer focused audio-only evaluation metrics. For those interested in pure audio processing performance, we recommend reviewing our static benchmark results, which include evaluations of many speech capabilities that are more audio-specific!

Why did DiVA beat GPT4o on Prolific? GPT4o voice mode is optimized for speech-in speech-out. We are testing speech-in text-out. The GPT4o outputs tend to be short and direct, while Prolific users mentioned preferring DiVA's longerform text and structured lists. This suggests speech AI providers should consider adapting model outputs based on the target output modality.