Mull-Tokens:
Modality-Agnostic Latent Thinking

1Google 2University of Washington 3Stanford University 4Boston University
Figure 1: Mull-Tokens enable modality-agnostic latent thinking. Our approach uses latent tokens that can hold intermediate reasoning information in either image or text modalities, allowing models to think through spatial reasoning problems more effectively than text-only or image-only approaches.

Abstract

Reasoning goes beyond language; the real world requires reasoning about space, time, affordances, and much more that words alone cannot convey. Existing multimodal models exploring the potential of reasoning with images are brittle and do not scale. They rely on calling specialist tools, costly generation of images, or handcrafted datasets.

We offer a simpler alternative — Mull-Tokens — latent tokens that can be pre-trained to hold intermediate information in either image or text modalities so as to think towards the correct answer. We investigate best practices to train Mull-Tokens inspired by latent reasoning frameworks. We first train Mull-Tokens using supervision from interleaved text-image traces, and then fine-tune without any supervision by only using the final answers.

Across four challenging spatial reasoning benchmarks involving tasks such as solving puzzles and taking different perspectives, we demonstrate that Mull-Tokens improve upon several baselines utilizing text-only reasoning or interleaved image-text reasoning, achieving a +3% average improvement and up to +16% on a puzzle solving reasoning-heavy split compared to our strongest baseline. Adding to conversations around challenges in grounding textual and visual reasoning, Mull-Tokens offers a simple solution to implicitly think in multiple modalities.

Approach

Mull-Tokens training pipeline showing two-stage approach
Figure 2: Mull-Tokens training pipeline. Our Mull-Tokens training involves two stages inspired by approaches in latent reasoning. We first pre-train/warm-up our Mull-Tokens to hold both image and text modalities depending on the context image/video and query. Next, the model free-form optimizes these Mull-Tokens to achieve the final correct answer. We see that pre-training the Mull-Tokens to hold both image and text reasoning traces is key, as opposed to using them simply as extra compute without such pretraining or using text alone.

Results

Key Contributions

Mull-Tokens are effective for visual reasoning.

Model BLINK SAT-R VSI ERQA Avg(All)
MV RelDep SpRel Jig IQT Avg Reas Avg Reas
a. Qwen2.5-VL (7B) 39.00 61.29 92.38 58.66 25.33 55.33 41.00 59.00 23.96 22.96 38.91 44.30
b. + DirAns FT 57.14 87.09 74.12 58.66 30.00 61.40 48.60 71.66 32.40 30.65 38.00 50.87
Text-Reasoning Baselines
c. + TextCoT FT 53.38 74.19 67.83 69.30 25.33 58.01 49.34 68.33 30.47 31.04 38.78 48.90
Δ (vs. DirAns) -3.76 -12.90 -6.29 +10.64 -4.67 -3.39 +0.74 -3.33 -1.93 +0.39 +0.78 -1.97
+ GRPO 54.88 71.77 69.23 72.00 25.33 58.64 50.74 69.00 30.34 30.15 36.00 48.50
Δ (vs. DirAns) -2.26 -15.32 -4.89 +13.34 -4.67 -2.76 +2.14 -2.66 -2.06 -0.50 -2.00 -2.37
Interleaved Image-Text Reasoning
e. + Interleave Im-Txt 57.14 69.36 75.52 68.67 25.33 59.20 50.38 74.00 30.27 32.96 38.50 50.49
Δ (vs. DirAns) +0.00 -17.73 +1.40 +10.01 -4.67 -2.20 +1.78 +2.34 -2.13 +2.31 +0.50 -0.38
Mull-Tokens (Ours)
f. + Mull-Tokens 63.15 83.06 81.81 74.00 32.00 66.80 56.38 77.66 32.98 32.85 38.25 53.92
Δ (vs. DirAns) +6.01 -4.03 +7.69 +15.34 +2.00 +5.40 +7.78 +6.00 +0.58 +2.20 +0.25 +3.05
+ GRPO 64.66 83.87 81.81 74.67 30.66 67.13 56.66 77.00 33.51 33.49 38.50 54.04
Δ (vs. DirAns) +7.52 -3.22 +7.69 +16.01 +0.66 +5.73 +8.06 +5.34 +1.11 +2.84 +0.50 +3.17
Table 1: Performance comparison across spatial reasoning benchmarks. Mull-Tokens (rows f, g) improve over direct answer finetuning (row b), and existing methods for visual reasoning tasks using text-only reasoning (rows c, d) or text-image interleaving (row e). Green values indicate improvements over the DirAns baseline, while red values indicate degradation. Best scores in each column are shown in bold.

Multimodal pretraining of the Mull-tokens is crucial for performance.

Model BLINK SAT-R VSI ERQA Avg(All)
MV RelDep SpRel Jig IQT Avg Reas Avg Reas
a. Qwen2.5-VL (7B) 39.00 61.29 92.38 58.66 25.33 55.33 41.00 59.00 23.96 22.96 38.91 44.30
b. + DirAns FT 57.14 87.09 74.12 58.66 30.00 61.40 48.60 71.66 32.40 30.65 38.00 50.87
Ablation: Training Strategy
c. + No warm-up 55.63 79.03 81.11 51.33 28.66 59.15 45.21 67.33 29.77 27.38 37.75 48.50
Δ (vs. DirAns) -1.51 -8.06 +6.99 -7.33 -1.34 -2.25 -3.39 -4.33 -2.63 -3.27 -0.25 -2.37
d. + Text warm-up 59.39 85.48 85.31 67.33 32.00 65.90 52.91 71.33 32.02 33.47 38.50 51.94
Δ (vs. DirAns) +2.25 -1.61 +11.19 +8.67 +2.00 +4.50 +4.31 -0.33 -0.38 +2.82 +0.50 +1.07
e. + MM warm-up 63.15 83.06 81.81 74.00 32.00 66.80 56.38 77.66 32.98 32.85 38.25 53.92
Δ (vs. DirAns) +6.01 -4.03 +7.69 +15.34 +2.00 +5.40 +7.78 +5.99 +0.58 +2.20 +0.25 +3.05
Table 2: Importance of multimodal warm-up. Warming up Mull-Tokens with multimodal image-text CoT (MM warm-up, row e) is crucial for performance, compared to no warm-up (row c) or text-only warm-up (row d). The multimodal pretraining enables the tokens to effectively hold both visual and textual reasoning, leading to the best overall performance across benchmarks.

Mull-Tokens can be used in conjunction with interpretable text reasoning

Qualitative examples showing Mull-Tokens used with and without text reasoning
Figure 3: Flexible reasoning with Mull-Tokens. Mull-Tokens can be used in conjunction with text reasoning and the model can effectively decide when to also avoid using text reasoning depending on the task. For the example on the left, the model accurately uses Mull-Tokens along with some textual descriptive cues to reason about the missing piece. For the example on the right, the model decides to simply use the Mull-Tokens to directly predict the answer since textual descriptions are likely not helpful to detect camera motion.

Citation

If you find our work useful, please consider citing:

Coming soon!