❝ We introduce CodeDance, a dynamic tool-integrated multimodal large language model that treats executable code as a general solver for visual reasoning. Instead of relying on rigid, text-only pipelines, CodeDance plans, composes, and executes visual–symbolic operations while rendering intermediate artifacts, enabling a transparent, self-checkable reasoning process.
CodeDance scales up multimodal tool-based reasoning by letting the model think, write code, execute it, and reflect in a single loop. Rather than following fixed schemas, CodeDance dynamically decides when and how to invoke tools, orchestrates visual-symbolic operations (crop, draw, count, plot) in a sandbox, and uses intermediate visual evidence to guide subsequent reasoning over multiple interaction turns.
Our design choice yields transparent, self-checkable solutions to challenging visual search and reasoning tasks and offers a practical recipe for “thinking with images” via executable code and reinforcement learning.
CodeDance is a dynamic tool-integrated MLLM that leverages executable code as a general solver for visual reasoning, enabling a transparent, self-checkable reasoning process.

Our training / inference pipeline consists of three stages:
We optimize with a composite reward that integrates outcome and tool signals. Our two-level reward (Balanced Adaptive Tool-call), denoted as \( R_{\mathrm{BAT}} \), balances task difficulty with step-wise tool-call correctness:
\[ R_{\mathrm{BAT}} = R_{\mathrm{seq}} + R_{\mathrm{turn}}. \]Sequence-level. Difficulty-aware incentives adapt to group accuracy to discourage redundant calls on easy problems and promote exploration on hard ones. Formally:
Here, \(N_{\mathrm{succ}}(\tau)\) and \(N_{\mathrm{total}}(\tau)\) denote successful and total tool calls in trajectory \(\tau\). The difficulty factor \(d\) shrinks when the group accuracy \(\mu_{\mathrm{acc}}\) is high (easy queries) and grows when \(\mu_{\mathrm{acc}}\) is low (hard queries), suppressing redundant tool invocations on easy cases while amplifying rewards for helpful tool use on challenging ones.
\[ R_{\mathrm{seq}} = \Big(0.5 + 0.5 \cdot \mathbb{I}_{R_{\mathrm{acc}}(\tau) > 0}\Big) \cdot d \cdot \frac{N_{\mathrm{succ}}(\tau)}{N_{\mathrm{total}}(\tau)}, \quad \] where \[ d = \sigma\!\big(\gamma (0.5 - \mu_{\mathrm{acc}})\big) - \delta, \quad \sigma(z) = \frac{1}{1 + e^{-z}}. \]Turn-level. Immediate penalties for failed executions plus a batch-normalized advantage to provide dense correction. For each turn \(m\), failed code execution incurs an immediate penalty \(r_{\mathrm{turn},m} = -0.5\) (otherwise \(0\)). To capture long-term effects, we define the accumulated discounted return and advantage as:
\[ G_{\mathrm{turn}}^{m} = r_{\mathrm{turn}}^{m} + \beta \cdot G_{\mathrm{turn}}^{m+1}, \qquad A_{\mathrm{turn}}^{m} = \frac{G_{\mathrm{turn}}^{m} - \mu_{\mathrm{batch}}}{\sigma_{\mathrm{batch}}}. \]This turn-level signal assigns credit to correct intermediate tool executions, discouraging reward hacking where the final answer is correct but tool use is erroneous.
The final advantage combines sequence- and turn-level signals:
\[ A(\tau) = A_{\mathrm{seq}}\!\big(R_{\mathrm{acc}}, R_{\mathrm{format}}, R_{\mathrm{seq}}\big) + A_{\mathrm{turn}}\!\big(G_{\mathrm{turn}}\big). \]CodeDance dynamically invokes tools according to task difficulty to prevent both tool underuse (unable to invoke tools on challenging tasks) and tool overuse (redundant calls on easy tasks).

|
Motivation and effectiveness of dynamic tool invocation in CodeDance. Left: qualitative examples show that both tool underuse (unable to invoke tools on challenging tasks) and tool overuse (redundant calls on easy tasks) lead to hallucinated reasoning, incorrect answers, and unnecessary complexity (more reasoning turns and longer rollout time), whereas CodeDance dynamically invokes tools according to task difficulty to obtain correct solutions. Right: quantitative results show CodeDance-7B consistently surpasses all Qwen2.5-VL 7B baselines and exceeds the 32B version on several tasks. |
To initialize CodeDance before RL, we curate a high-quality cold-start dataset of executable multimodal trajectories. Following the pipeline described in our methodology and appendix, we apply weak-to-strong filtering to stratify difficulty and then build multi-turn atomic supervision.
Left: An example of our reasoning trajectory synthesis.
Right: (scroll to the right) An example of resulting curated reasoning trajectory.
We conduct systematic evaluations across visual search, math, counting, and chart understanding benchmarks. CodeDance consistently outperforms text-only and schema-driven baselines and, in several settings, surpasses larger open-source and advanced closed-source models.
| Model | Visual Counting | Visual Search | General | ||||
|---|---|---|---|---|---|---|---|
| CountBenchQA | PixmoCount | V*Bench | HR4K | HR8K | ChartQA | CharXiv | |
| Closed-Source MLLMs | |||||||
| GPT-4o | 87.9 | - | 67.5 | 65.0 | 59.6 | 86.7 | 47.1 |
| Open-Source MLLMs | |||||||
| Llava-OneVision-7B | 82.3 | 54.4 | 72.7 | 68.5 | 60.0 | 80.4 | 27.1 |
| Llava-OneVision-72B | - | 60.7 | 73.8 | 66.3 | 60.9 | 83.7 | - |
| InternVL2.5-8B | 55.9 | - | 73.7 | 72.0 | 65.5 | 82.8 | 37.2 |
| InternVL3-8B | 80.3 | - | 70.2 | 70.5 | 70.0 | 86.1 | 38.3 |
| InternVL3-78B | - | - | 76.4 | 75.5 | 67.3 | 89.7 | 46.0 |
| Qwen2.5-VL-72B | 93.6 | 62.3 | 84.8 | 79.4 | 76.3 | 89.5 | 49.7 |
| Qwen2.5-VL-32B | 87.8 | 56.0 | 85.9 | 74.8 | 71.6 | - | 47.6 |
| Qwen2.5-VL-7B | 76.5 | 50.4 | 76.4 | 69.0 | 66.0 | 86.3 | 42.1 |
| Open-Source MLLMs with Tools | |||||||
| Pixel Reasoner-7B | - | - | 84.3 | 72.9 | 66.9 | - | - |
| Deepeyes-7B† | 80.4 | 57.2 | 90.4 | 74.8 | 71.9 | 78.2 | - |
| Thyme-VL-7B | 84.8† | - | 82.2 | 77.0 | 72.0 | 86.1 | 44.2† |
| CodeDance-7B | 91.2 | 77.1 | 84.8 | 75.2 | 72.3 | 87.5 | 44.1 |
| Δ v.s. Qwen2.5-VL-7B | ↑19.2% | ↑53.0% | ↑11.0% | ↑9.0% | ↑9.5% | ↑1.4% | ↑4.7% |
| Model | MathVision | MathVista | MathVerse | WeMath |
|---|---|---|---|---|
| GPT-4o | 36.5 | 63.4 | 35.3 | 44.2 |
| Qwen2.5-VL-72B | 38.1 | 74.8 | 57.6 | - |
| R1-OneVision-7B† | 29.9 | 64.1 | 40.0 | - |
| R1-VL-7B† | 24.7 | 63.5 | 40.0 | - |
| InternVL2.5-8B | 22.0 | 64.4 | 39.5 | 23.9 |
| LLaVA-OV-7B | 18.4 | 63.2 | 26.2 | 17.3 |
| Qwen2.5-VL-7B | 25.0 | 68.1 | 45.1 | 35.4 |
| DeepEyes-7B | 26.6 | 70.1 | 47.3 | 38.9 |
| CodeDance-7B (Ours) | 29.6 | 70.3 | 46.8 | 39.6 |
| Components | CountBench Acc. / Turns |
PixmoCount Acc. / Turns |
MathVision Acc. / Turns |
MathVerse Acc. / Turns |
V* Acc. / Turns |
HR4K Acc. / Turns |
HR8K Acc. / Turns |
Avg. Acc. / Turns |
|---|---|---|---|---|---|---|---|---|
| SFT Cold-Start (w/o RL) | 85.3 1.2749 |
66.9 1.3902 |
23.0 2.8388 |
41.4 2.1904 |
82.7 2.0052 |
72.1 1.1713 |
67.1 1.0875 |
62.6 1.7083 |
| RL with Racc + Rformat | 88.4 1.0200 |
71.2 1.0170 |
26.0 2.1086 |
46.5 1.9569 |
82.7 1.1728 |
73.4 1.0413 |
69.0 1.0375 |
65.3 1.3363 |
| + RDeepEyes | 85.1 2.5960 |
64.4 2.5341 |
25.2 3.2270 |
44.0 2.5190 |
83.3 2.0000 |
74.6 2.0888 |
68.4 2.0525 |
63.6 2.4311 |
| + RBAT (Ours) | 89.0 1.0000 |
72.5 1.0000 |
27.0 2.0461 |
46.3 2.1662 |
82.7 1.2094 |
73.8 1.2251 |
69.4 1.1950 |
65.8 1.4060 |
Note: † denotes results reported from official papers.
Throughout the RL process, we consistently observe novel and surprising reasoning trajectories (below) that go beyond the atomic supervision provided during SFT. These findings point toward the scalability of code as a general reasoning medium, and we empirically study the potential in our main paper.
@article{Song2025codedance,
title={CodeDance: A Dynamic Tool-integrated MLLM for Executable Visual Reasoning},
author={Song, Qi and Li, Honglin and Yu, Yingchen and Zhou, Haoyi and Yang, Lin and Bai, Song and She, Qi and Huang, Zilong and Zhao, Yunqing},
journal={arXiv preprint arXiv:2512.17312},
year={2025}
}