Task

Detailed breakdown of individual task performance across different models.

Task Name (14 tasks)
claude-4-6-sonnet
gemini-3-flash
gemini-3.1-pro
glm-4.7
gpt-5.2-codex
193.4s339.0s203.0s189.5s98.1s
166.5s600.1s600.1s159.0s206.7s
591.8s600.1s600.0s83.4s36.6s
150.0s123.4s600.1s426.3s458.2s
187.6s129.0s600.0s600.1s81.0s
279.9s182.2s238.7s600.1s269.8s
600.1s600.1s377.8s575.0s219.7s
274.0s600.1s600.1s260.4s170.0s
103.4s600.0s587.1s276.2s213.7s
441.7s236.2s600.1s582.7s393.0s
141.5s268.1s600.1s271.2s69.0s
404.8s137.9s600.1s319.2s518.2s
284.7s344.3s531.5s600.1s364.9s
88.3s466.9s197.0s112.2s38.4s