Mull-Tokens:
Modality-Agnostic Latent Thinking

Multimodal Latent Reasoning Tokens

Figure 1: Mull-Tokens enable modality-agnostic latent thinking. Our approach uses latent tokens that can hold intermediate reasoning information in either image or text modalities, allowing models to think through spatial reasoning problems more effectively than text-only or image-only approaches.
TL;DR: Instead of text reasoning or explicit image thoughts, using modality-agnostic/multimodal reasoning tokens pretrained using multimodal chain-of-thought is more effective for visual reasoning tasks.

Abstract

Reasoning goes beyond language; the real world requires reasoning about space, time, affordances, and much more that words alone cannot convey. Existing multimodal models exploring the potential of reasoning with images are brittle and do not scale. They rely on calling specialist tools, costly generation of images, or handcrafted reasoning data to switch between text and image thoughts.

We offer a simpler alternative — Mull-Tokens — latent tokens that can be pre-trained to hold intermediate information in either image or text modalities so as to think towards the correct answer. We investigate best practices to train Mull-Tokens inspired by latent reasoning frameworks. We first train Mull-Tokens using supervision from interleaved text-image traces, and then fine-tune without any supervision by only using the final answers.

Across four challenging spatial reasoning benchmarks involving tasks such as solving puzzles and taking different perspectives, we demonstrate that Mull-Tokens improve upon several baselines utilizing text-only reasoning or interleaved image-text reasoning, achieving a +3% average improvement and up to +16% on a puzzle solving reasoning-heavy split compared to our strongest baseline. Adding to conversations around challenges in grounding textual and visual reasoning, Mull-Tokens offers a simple solution to implicitly think in multiple modalities.

Prelude: We don't know a template to solve visual reasoning problems.

Consider this spatial reasoning question:

A room with marked points — X marks where you sit, 12 marks the chair location
If I sit near the 'X' marked point in the image and face 90° to the left, will the chair (near marked 12) be to my left or right?

Maybe you'd try to visualize the scene from that viewpoint…

The scene viewed from the X point facing 90 degrees left — the chair is not visible
You can't even see the chair. If we train models to follow this blueprint for reasoning, it may not always work.

Other works use purely textual thoughts — but multiple recent works show that text reasoning can actually hurt visual reasoning performance.

What if we let models abstractly define their own thought process for visual reasoning?

We propose Mull-Tokens — discrete tokens pretrained with multimodal chain-of-thought examples, then free-form optimized without any supervision to arrive at the correct answer.

Approach

Mull-Tokens training pipeline showing two-stage approach
Figure 2: Mull-Tokens training pipeline. Our Mull-Tokens training involves two stages inspired by approaches in latent reasoning. We first pre-train/warm-up our Mull-Tokens to hold both image and text modalities depending on the context image/video and query. Next, the model free-form optimizes these Mull-Tokens to achieve the final correct answer. We see that pre-training the Mull-Tokens to hold both image and text reasoning traces is key, as opposed to using them simply as extra compute without such pretraining or using text alone.

Results

Key Contributions

Only 20 Mull-Tokens are effective over other reasoning techniques.

Unlike text chain-of-thought or interleaved image-text thoughts that employ 1000's of tokens as rationales, only 20 Mull-Tokens are effective. They improve performance while being faster for visual reasoning tasks.
Model BLINK SAT-R VSI ERQA Avg(All)
MV RelDep SpRel Jig IQT Avg Reas Avg Reas
a. Qwen2.5-VL (7B) 39.00 61.29 92.38 58.66 25.33 55.33 41.00 59.00 23.96 22.96 38.91 44.30
b. + DirAns FT 57.14 87.09 74.12 58.66 30.00 61.40 48.60 71.66 32.40 30.65 38.00 50.87
Text-Reasoning Baselines
c. + TextCoT FT 53.38 74.19 67.83 69.30 25.33 58.01 49.34 68.33 30.47 31.04 38.78 48.90
Δ (vs. DirAns) -3.76 -12.90 -6.29 +10.64 -4.67 -3.39 +0.74 -3.33 -1.93 +0.39 +0.78 -1.97
+ GRPO 54.88 71.77 69.23 72.00 25.33 58.64 50.74 69.00 30.34 30.15 36.00 48.50
Δ (vs. DirAns) -2.26 -15.32 -4.89 +13.34 -4.67 -2.76 +2.14 -2.66 -2.06 -0.50 -2.00 -2.37
Interleaved Image-Text Reasoning
e. + Interleave Im-Txt 57.14 69.36 75.52 68.67 25.33 59.20 50.38 74.00 30.27 32.96 38.50 50.49
Δ (vs. DirAns) +0.00 -17.73 +1.40 +10.01 -4.67 -2.20 +1.78 +2.34 -2.13 +2.31 +0.50 -0.38
Mull-Tokens (Ours)
f. + Mull-Tokens 63.15 83.06 81.81 74.00 32.00 66.80 56.38 77.66 32.98 32.85 38.25 53.92
Δ (vs. DirAns) +6.01 -4.03 +7.69 +15.34 +2.00 +5.40 +7.78 +6.00 +0.58 +2.20 +0.25 +3.05
+ GRPO 64.66 83.87 81.81 74.67 30.66 67.13 56.66 77.00 33.51 33.49 38.50 54.04
Δ (vs. DirAns) +7.52 -3.22 +7.69 +16.01 +0.66 +5.73 +8.06 +5.34 +1.11 +2.84 +0.50 +3.17
Table 1: Performance comparison across spatial reasoning benchmarks. Mull-Tokens (rows f, g) improve over direct answer finetuning (row b), and existing methods for visual reasoning tasks using text-only reasoning (rows c, d) or text-image interleaving (row e). Green values indicate improvements over the DirAns baseline, while red values indicate degradation. Best scores in each column are shown in bold.

Mull-Tokens are not simply extra compute: multimodal information in them is crucial.

Model BLINK SAT-R VSI ERQA Avg(All)
MV RelDep SpRel Jig IQT Avg Reas Avg Reas
a. Qwen2.5-VL (7B) 39.00 61.29 92.38 58.66 25.33 55.33 41.00 59.00 23.96 22.96 38.91 44.30
b. + DirAns FT 57.14 87.09 74.12 58.66 30.00 61.40 48.60 71.66 32.40 30.65 38.00 50.87
Ablation: Training Strategy
c. + No warm-up 55.63 79.03 81.11 51.33 28.66 59.15 45.21 67.33 29.77 27.38 37.75 48.50
Δ (vs. DirAns) -1.51 -8.06 +6.99 -7.33 -1.34 -2.25 -3.39 -4.33 -2.63 -3.27 -0.25 -2.37
d. + Text warm-up 59.39 85.48 85.31 67.33 32.00 65.90 52.91 71.33 32.02 33.47 38.50 51.94
Δ (vs. DirAns) +2.25 -1.61 +11.19 +8.67 +2.00 +4.50 +4.31 -0.33 -0.38 +2.82 +0.50 +1.07
e. + MM warm-up 63.15 83.06 81.81 74.00 32.00 66.80 56.38 77.66 32.98 32.85 38.25 53.92
Δ (vs. DirAns) +6.01 -4.03 +7.69 +15.34 +2.00 +5.40 +7.78 +5.99 +0.58 +2.20 +0.25 +3.05
Table 2: Importance of multimodal warm-up. Warming up Mull-Tokens with multimodal image-text CoT (MM warm-up, row e) is crucial for performance, compared to no warm-up (row c) or text-only warm-up (row d). The multimodal pretraining enables the tokens to effectively hold both visual and textual reasoning, leading to the best overall performance across benchmarks.

Mull-Tokens can be used in conjunction with interpretable text reasoning

Qualitative examples showing Mull-Tokens used with and without text reasoning
Figure 3: Flexible reasoning with Mull-Tokens. Mull-Tokens can be used in conjunction with text reasoning and the model can effectively decide when to also avoid using text reasoning depending on the task. For the example on the left, the model accurately uses Mull-Tokens along with some textual descriptive cues to reason about the missing piece. For the example on the right, the model decides to simply use the Mull-Tokens to directly predict the answer since textual descriptions are likely not helpful to detect camera motion.

Citation

If you find our work useful, please consider citing:

@misc{ray2025mulltokensmodalityagnosticlatentthinking,
      title={Mull-Tokens: Modality-Agnostic Latent Thinking}, 
      author={Arijit Ray and Ahmed Abdelkader and Chengzhi Mao and Bryan A. Plummer and Kate Saenko and Ranjay Krishna and Leonidas Guibas and Wen-Sheng Chu},
      year={2025},
      eprint={2512.10941},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.10941}, 
}