Mull-Tokens:
Modality-Agnostic Latent Thinking

Multimodal Latent Reasoning Tokens

Arijit Ray¹^,⁴, Ahmed Abdelkader¹, Chengzhi Mao¹, Bryan A. Plummer⁴, Kate Saenko⁴, Ranjay Krishna², Leonidas Guibas¹^,³, Wen-Sheng Chu¹

¹Google ²University of Washington ³Stanford University ⁴Boston University

arXiv Code 🤗 Model Thread CVPR 2026 (Findings)

Mull-Tokens visualization showing modality-agnostic latent thinking — **Figure 1: Mull-Tokens enable modality-agnostic latent thinking.** Our approach uses latent tokens that can hold intermediate reasoning information in either image or text modalities, allowing models to think through spatial reasoning problems more effectively than text-only or image-only approaches.

TL;DR: Instead of text reasoning or explicit image thoughts, using modality-agnostic/multimodal reasoning tokens pretrained using multimodal chain-of-thought is more effective for visual reasoning tasks.

Abstract

Reasoning goes beyond language; the real world requires reasoning about space, time, affordances, and much more that words alone cannot convey. Existing multimodal models exploring the potential of reasoning with images are brittle and do not scale. They rely on calling specialist tools, costly generation of images, or handcrafted reasoning data to switch between text and image thoughts.

We offer a simpler alternative — Mull-Tokens — latent tokens that can be pre-trained to hold intermediate information in either image or text modalities so as to think towards the correct answer. We investigate best practices to train Mull-Tokens inspired by latent reasoning frameworks. We first train Mull-Tokens using supervision from interleaved text-image traces, and then fine-tune without any supervision by only using the final answers.

Across four challenging spatial reasoning benchmarks involving tasks such as solving puzzles and taking different perspectives, we demonstrate that Mull-Tokens improve upon several baselines utilizing text-only reasoning or interleaved image-text reasoning, achieving a +3% average improvement and up to +16% on a puzzle solving reasoning-heavy split compared to our strongest baseline. Adding to conversations around challenges in grounding textual and visual reasoning, Mull-Tokens offers a simple solution to implicitly think in multiple modalities.

Prelude: We don't know a template to solve visual reasoning problems.

Consider this spatial reasoning question:

A room with marked points — X marks where you sit, 12 marks the chair location

If I sit near the 'X' marked point in the image and face 90° to the left, will the chair (near marked 12) be to my left or right?

Maybe you'd try to visualize the scene from that viewpoint…

The scene viewed from the X point facing 90 degrees left — the chair is not visible

You can't even see the chair. If we train models to follow this blueprint for reasoning, it may not always work.

Other works use purely textual thoughts — but multiple recent works show that text reasoning can actually hurt visual reasoning performance.

What if we let models abstractly define their own thought process for visual reasoning?

We propose Mull-Tokens — discrete tokens pretrained with multimodal chain-of-thought examples, then free-form optimized without any supervision to arrive at the correct answer.

Approach

Results

Key Contributions

Effective modality-agnostic thinking paradigm: Mull-Tokens improve upon the base model by 9% overall with up to 16% gains on reasoning-heavy splits, while being faster than existing works that produce verbose thoughts or generate explicit images as intermediate reasoning steps.
Superior to explicit modality switching: Our approach achieves 4% improvement over the latest prior work on interleaving visual thoughts with text based on explicit switching between the two modalities, establishing the superiority of our modality-agnostic approach.
Multimodal training is essential: Extensive ablations identify the best design choices for integrating visual thoughts with text reasoning, establishing the necessity of multimodal (joint visual and textual) data for training Mull-Tokens—simply using latent tokens with text only or extra compute without multimodal anchoring doesn't suffice.

Only 20 Mull-Tokens are effective over other reasoning techniques.

Unlike text chain-of-thought or interleaved image-text thoughts that employ 1000's of tokens as rationales, only 20 Mull-Tokens are effective. They improve performance while being faster for visual reasoning tasks.

Model	BLINK							SAT-R	VSI		ERQA	Avg(All)
	MV	RelDep	SpRel	Jig	IQT	Avg	Reas		Avg	Reas
a. Qwen2.5-VL (7B)	39.00	61.29	92.38	58.66	25.33	55.33	41.00	59.00	23.96	22.96	38.91	44.30
b. + DirAns FT	57.14	87.09	74.12	58.66	30.00	61.40	48.60	71.66	32.40	30.65	38.00	50.87
Text-Reasoning Baselines
c. + TextCoT FT	53.38	74.19	67.83	69.30	25.33	58.01	49.34	68.33	30.47	31.04	38.78	48.90
Δ (vs. DirAns)	-3.76	-12.90	-6.29	+10.64	-4.67	-3.39	+0.74	-3.33	-1.93	+0.39	+0.78	-1.97
+ GRPO	54.88	71.77	69.23	72.00	25.33	58.64	50.74	69.00	30.34	30.15	36.00	48.50
Δ (vs. DirAns)	-2.26	-15.32	-4.89	+13.34	-4.67	-2.76	+2.14	-2.66	-2.06	-0.50	-2.00	-2.37
Interleaved Image-Text Reasoning
e. + Interleave Im-Txt	57.14	69.36	75.52	68.67	25.33	59.20	50.38	74.00	30.27	32.96	38.50	50.49
Δ (vs. DirAns)	+0.00	-17.73	+1.40	+10.01	-4.67	-2.20	+1.78	+2.34	-2.13	+2.31	+0.50	-0.38
Mull-Tokens (Ours)
f. + Mull-Tokens	63.15	83.06	81.81	74.00	32.00	66.80	56.38	77.66	32.98	32.85	38.25	53.92
Δ (vs. DirAns)	+6.01	-4.03	+7.69	+15.34	+2.00	+5.40	+7.78	+6.00	+0.58	+2.20	+0.25	+3.05
+ GRPO	64.66	83.87	81.81	74.67	30.66	67.13	56.66	77.00	33.51	33.49	38.50	54.04
Δ (vs. DirAns)	+7.52	-3.22	+7.69	+16.01	+0.66	+5.73	+8.06	+5.34	+1.11	+2.84	+0.50	+3.17

Table 1: Performance comparison across spatial reasoning benchmarks. Mull-Tokens (rows f, g) improve over direct answer finetuning (row b), and existing methods for visual reasoning tasks using text-only reasoning (rows c, d) or text-image interleaving (row e). Green values indicate improvements over the DirAns baseline, while red values indicate degradation. Best scores in each column are shown in bold.

Mull-Tokens are not simply extra compute: multimodal information in them is crucial.

Model	BLINK							SAT-R	VSI		ERQA	Avg(All)
	MV	RelDep	SpRel	Jig	IQT	Avg	Reas		Avg	Reas
a. Qwen2.5-VL (7B)	39.00	61.29	92.38	58.66	25.33	55.33	41.00	59.00	23.96	22.96	38.91	44.30
b. + DirAns FT	57.14	87.09	74.12	58.66	30.00	61.40	48.60	71.66	32.40	30.65	38.00	50.87
Ablation: Training Strategy
c. + No warm-up	55.63	79.03	81.11	51.33	28.66	59.15	45.21	67.33	29.77	27.38	37.75	48.50
Δ (vs. DirAns)	-1.51	-8.06	+6.99	-7.33	-1.34	-2.25	-3.39	-4.33	-2.63	-3.27	-0.25	-2.37
d. + Text warm-up	59.39	85.48	85.31	67.33	32.00	65.90	52.91	71.33	32.02	33.47	38.50	51.94
Δ (vs. DirAns)	+2.25	-1.61	+11.19	+8.67	+2.00	+4.50	+4.31	-0.33	-0.38	+2.82	+0.50	+1.07
e. + MM warm-up	63.15	83.06	81.81	74.00	32.00	66.80	56.38	77.66	32.98	32.85	38.25	53.92
Δ (vs. DirAns)	+6.01	-4.03	+7.69	+15.34	+2.00	+5.40	+7.78	+5.99	+0.58	+2.20	+0.25	+3.05

Table 2: Importance of multimodal warm-up. Warming up Mull-Tokens with multimodal image-text CoT (MM warm-up, row e) is crucial for performance, compared to no warm-up (row c) or text-only warm-up (row d). The multimodal pretraining enables the tokens to effectively hold both visual and textual reasoning, leading to the best overall performance across benchmarks.

Mull-Tokens can be used in conjunction with interpretable text reasoning

Qualitative examples showing Mull-Tokens used with and without text reasoning — **Figure 3: Flexible reasoning with Mull-Tokens.** Mull-Tokens can be used in conjunction with text reasoning and the model can effectively decide when to also avoid using text reasoning depending on the task. For the example on the left, the model accurately uses Mull-Tokens along with some textual descriptive cues to reason about the missing piece. For the example on the right, the model decides to simply use the Mull-Tokens to directly predict the answer since textual descriptions are likely not helpful to detect camera motion.

Citation

If you find our work useful, please consider citing:

@misc{ray2025mulltokensmodalityagnosticlatentthinking,
      title={Mull-Tokens: Modality-Agnostic Latent Thinking}, 
      author={Arijit Ray and Ahmed Abdelkader and Chengzhi Mao and Bryan A. Plummer and Kate Saenko and Ranjay Krishna and Leonidas Guibas and Wen-Sheng Chu},
      year={2025},
      eprint={2512.10941},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.10941}, 
}