Blog
Interactive Evaluation

Interactive Evaluation

User Preference

As an initial effort, we collected a total 5000 votes on Prolific using Talk Arena for pairwise comparisons among GPT4o, Gemini-1.5-pro, Typhoon, Qwen2-Audio and DiVA, which are top performing models from the results of static evaluation. For each of the ten combinations, we collect 500 votes from more than 50 different crowdworkers. In total, we have around 359 different voters.

DiVA(252 votes)
Tie(72 votes)
GPT4o(176 votes)
DiVA(313 votes)
Tie(57 votes)
Gemini(130 votes)
DiVA(322 votes)
Tie(75 votes)
Qwen2(103 votes)
GPT4o(186 votes)
Tie(101 votes)
Gemini(213 votes)
GPT4o(254 votes)
Tie(91 votes)
Qwen2(155 votes)
Gemini(255 votes)
Tie(74 votes)
Qwen2(171 votes)
Typhoon(50 votes)
Tie(33 votes)
DiVA(417 votes)
Typhoon(19 votes)
Tie(34 votes)
GPT4o(447 votes)
Typhoon(50 votes)
Tie(26 votes)
Gemini(424 votes)
Typhoon(82 votes)
Tie(71 votes)
Qwen2(347 votes)

Comparison with Static Evaluation

We compare the user preference result in interactive evaluation to that of static evaluation by computing the top-k Kendall Tau Distance between rank in static evaluation and that in interactive evaluation:

Distance between Rankings on Static Datasets and Talk Arena
Top-k Kendall Tau Ranking Distance (lower is better correlated)

Here are some observations:

  1. None of the static bench reflects exactly the same rank in interactive eval
  2. Ranks on emotion recognition and language detection benchmarks are most similar to that in interactive eval
  3. Ranks on gender detection and nuanced intent (humor, sarcasm) detection are not very correlated with that in interactive eval

These are our observations from Prolific study, and we hope to reach more conclusions with the vote from the public.