今天阶跃星辰发布的Step3.5Flash在这个体量下的表现优异,并且性能基本全面优于Minimax m2.1,并且速度非常快,基本保持300tokens/s,Iflow是否考虑加入,用于中等或轻量任务,以及分配给Subagent。
| 类别 | 评测集 | Step 3.5 Flash | GLM 4.7 | Deepseek v3.2 | Kimi K2 Thinking | Gemini 3 Pro | GPT 5.2 |
|---|---|---|---|---|---|---|---|
| 推理 (Reasoning) | AIME 2025 | 97.3 | 95.7 | 93.1 | 96.1 | 95.0 | 100.0 |
| IMOAnswerBench | 85.4 | 82.0 | 78.3 | 81.8 | 83.3 | 86.3 | |
| HMMT 2025 (Avg. Feb and Nov) | 96.2 | 95.3 | 91.4 | 95.4 | 95.4 | 97.1 | |
| 代码 (Coding) | SWE-bench Verified | 74.4 | 73.8 | 73.1 | 76.8 | 76.2 | 80.0 |
| Terminal-Bench 2.0 | 51.0 | 41.0 | 46.4 | 50.8 | 54.2 | 54.0 | |
| LiveCodeBench-V6 | 86.4 | 84.9 | 83.3 | 85.0 | 90.7 | 87.7 | |
| 智能体 (Agent) | $\tau^2$-Bench | 88.2 | 87.4 | 80.3 | 74.3 | 90.7 | 84.1 |
| BrowseComp (w/ Context Manager) | 69.0 | 67.5 | 67.6 | 74.9 | 59.2 | 65.8 | |
| xbench-DeepSearch (2025.10) | 54.0 | 35.0 | - | 40.0 | - | 75.0 |

