Arijit Ray's Webpage

Arijit Ray.|

Computer Science Ph.D. Student |

Projects CV Github

Google Scholar

I wish to teach machines to help humans achieve more.

About me:

I am a Computer Vision Ph.D. Student, advised by Kate Saenko and Bryan Plummer at Boston University, and by Ranjay Krishna at the University of Washington. I received my M.S from Virginia Tech, advised by Devi Parikh.

I am interested in advancing visual and textual reasoning in AI models. To address the difficulty in obtaining detailed reasoning annotations, my work uses simulations to teach causal interactions and reinforcement learning to discover reasoning processes that generalize to the real-world.

If you are interested to collaborate or just chat about research on multimodal models and world models, say hi!

Appointments:

Research Intern, Google, 2025 - Present

Student Collaborator, Allen AI (PRIOR), 2024 - 2025

AI Resident, Mineral, Google X, 2023

Research Intern, Facebook AI Research, 2022

Highlights:

2025: Excited to spend my summer at Google, working with the Pixel team and Leonidas Guibas from Deepmind.

2023: One of my student's paper got accepted as an oral at an ICCV Workshop 2023.

2022: Excited to be spending my summer as a Research Scientist Intern at Meta (Facebook) AI (FAIR), working on the compositionality of large vision-language models.

2022: We started [AI+X] of BU and Harvard, where we brainstorm research/venture ideas on how AI can impact different verticals.

2019: I won runners-up at the SRI CVT Shark Tank Competition that supported my mini-project on understanding image-text content to reduce radicalization of opinions on social media.

2017: The weed vs plant detection system I helped develop for precision fertilizing played a key part in the acquisition of Blue River Technologies by John Deere for 305 million USD.

2014: Our UAV for helping locating natural disaster victims was featured in National News : Deccan Chronicle, Indian Express

2012: I won an Academic Merit Scholarship from SRM University that waives a part of my undergraduate tuition.

Selected publications (all):

2025

Arijit Ray, Ahmed Abdelkader, Chengzhi Mao, Bryan A. Plummer, Kate Saenko, Ranjay Krishna, Leonidas Guibas, Wen-Sheng Chu

Mull-Tokens: Modality-Agnostic Latent Thinking

arXiv 2025

Coming soon!

TL;DR: Instead of text reasoning or explicit image thoughts, using modality-agnostic thinking compute tokens pretrained using multimodal thoughts is more effective for visual reasoning tasks.

Ellis Brown, Arijit Ray, Ranjay Krishna, Ross Girshick, Rob Fergus, Saining Xie

SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding

arXiv 2025

Webpage arXiv PDF

TL;DR: A systematic data-generation framework using 3D simulators to create spatially-rich video training data for multimodal language models. Our 7B-parameter model fine-tuned on just 25K simulated examples outperforms the larger 72B baseline and achieves competitive performance with proprietary models on real-world spatial reasoning benchmarks.

Abhay Deshpande, Yuquan Deng, Arijit Ray, Jordi Salvador, Winson Han, Jiafei Duan, Kuo-Hao Zeng, Yuke Zhu, Ranjay Krishna, Rose Hendrix

GraspMolmo: Generalizable Task-Oriented Grasping via Large-Scale Synthetic Data Generation

CORL 2025

Webpage arXiv Data Code

TL;DR: Use textual reasoning (e.g., one should grasp "handle" of cup to drink tea) and visual reasoning (e.g., in this image, the cup is grasped by the "handle") along with an object grasping dataset (data of 6 DOF arm positions to grasp objects) to generate high-level task description (e.g., how should I grasp cup to drink tea?) to precise 6DOF grasp data.

Arijit Ray, Ellis Brown, Jiafei Duan, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A. Plummer, Ranjay Krishna*, Kuo-Hao Zeng*, Kate Saenko*

SAT: Dynamic Spatial Aptitude Training for Multimodal Language Models

COLM 2025

Webpage arXiv Data Code

TL;DR: Simulated spatial aptitude data (SAT) can improve spatial reasoning in real images and videos for MLMs while maintaining pretraining commonsense. When instruction-tuned on SAT, LLaVA-13B outperforms some larger MLMs like GPT4-V and Gemini-1.5-pro in spatial reasoning.

2024

Arijit Ray, Dina Bashkirova, Reuben Tan, Kuo-Hao Zeng, Bryan A. Plummer, Ranjay Krishna, Kate Saenko

R2D3: Imparting Spatial Reasoning by Reconstructing 3D Scenes from 2D Images

Tech report 2024

Report

TL;DR: A visual anchor with the corresponding 3D location in text during training helps multimodal language models perform 3D estimation.

Jimuyang Zhang, Zanming Huang, Arijit Ray, Eshed-Ohn Bar

FED: Feedback-Guided Autonomous Driving

CVPR 2024 (Highlight)

Paper

TL;DR: MLMs can benefit autonomous driving by understanding natural language feedback and refining the next waypoint prediction.

2023

Arijit Ray, Filip Radenovic, Abhimanyu Dubey, Bryan Plummer, Ranjay Krishna, Kate Saenko

Cola: A Benchmark for Compositional Text-to-image Retrieval

NeurIPS 2023

arXiv Project Page & Data

TL;DR: Tuning multimodal layers improve the unseen compositional reasoning ability in CLIP-style vision-language models the most over tuning other parts of the model.

Dina Bashkirova, Arijit Ray, Rupayan Mallick, Sarah Adel Bargal, Jianming Zhang, Ranjay Krishna, Kate Saenko

Lasagna: Layered Score Distillation for Disentangled Object Relighting

arXiv Project Page & Data

TL;DR: Synthetically generated examples are effective in teaching physics-aware edits like relighting if we use score-distillation to avoid overfitting.

Katherine Deng, Arijit Ray, Reuben Tan, Saadia Gabriel, Bryan A. Plummer, Kate Saenko

Socratis: Are Large Multimodal Models Emotionally Aware?

ICCV Workshops 2023 (oral), Workshop on Emotionally and Culturally Intelligent AI

arXiv Project Page & Data

TL;DR: A preliminary benchmark to test MLMs on why different people may feel different emotions for separate reasons while viewing the same image-text content.

2021

Ajay Divakaran, Karan Sikka, Arijit Ray, Xiao Lin, Yi Yao

User-targeted content generation using multimodal embeddings

US Patent App. 17/191,698

Webpage

Arijit Ray, Michael Cogswell, Xiao Lin, Kamran Alipour, Ajay Divakaran, Yi Yao, Giedrius Burachas

Knowing What VQA Does Not: Pointing to Error-Inducing Regions to Improve Explanation Helpfulness

Applied AI Letters (Wiley) 2021

PDF arXiv Project Page

2019

Arijit Ray, Karan Sikka, Ajay Divakaran, Stefan Lee, Giedrius Burachas

Sunny and Dark Outside?! Improving Answer Consistency in VQA through Entailed Question Generation

EMNLP 2019, also at CVPR-W 2019 VQA and Visual Dialog Workshop

arXiv BibTeX Data

2016

Arijit Ray, Gordon Christie, Mohit Bansal, Dhruv Batra, Devi Parikh

Question Relevance in VQA: Identifying Non-Visual And False-Premise Questions

EMNLP 2016

PDF Code Video

Press Coverage

2023: Generative AI Podcast: I was interviewed on AI and analyzing social media responses using language models.
2019: TechXplore, Phys.org: An image-guessing game to evaluate the helpfulness of machine explanations
2014: Deccan Chronicle, Indian Express, Engineering.Careers360: UAV with Facial Recognition Capabilities for helping locating natural disaster victims, Click here

Hobbies

When I am not training LLMs, I love going to techno (a subgenre of electronic music) fests, making latte art, and engineering simple gadgets. In middle school, I opened an informal research society to encourage fellow students to take an interest in science by constructing simple gadgets. We won multiple accolades in school and city-level exhibitions.

Miscellanea

Paper writing tips and tricks: Writing a good paper (by Jitendra Mailk), shortening papers (by Devi Parikh), Writing Introductions (by Kate Saenko)
Paul Graham's essays: Some of my favorites: Crazy New Ideas, The Bus Ticket Theory of Genius, How to Think for Yourself, Cities and Ambition
On social media: Is it bad or beneficial to society, and why running a social media is hard: Twitter thread by Yishan (former CEO of Reddit).
Quotes

Contact

Best way to reach me would be to drop an email to array at bu dot edu.