Highlights:

  • 2025: Excited to spend my summer at Google, working with the Pixel team and Leonidas Guibas from Deepmind.
  • 2023: One of my student's paper got accepted as an oral at an ICCV Workshop 2023.
  • 2022: Excited to be spending my summer as a Research Scientist Intern at Meta (Facebook) AI (FAIR), working on the compositionality of large vision-language models.
  • 2022: We started [AI+X] of BU and Harvard, where we brainstorm research/venture ideas on how AI can impact different verticals.
  • 2019: I won runners-up at the SRI CVT Shark Tank Competition that supported my mini-project on understanding image-text content to reduce radicalization of opinions on social media.
  • 2017: The weed vs plant detection system I helped develop for precision fertilizing played a key part in the acquisition of Blue River Technologies by John Deere for 305 million USD.
  • 2014: Our UAV for helping locating natural disaster victims was featured in National News : Deccan Chronicle, Indian Express
  • 2012: I won an Academic Merit Scholarship from SRM University that waives a part of my undergraduate tuition.

Selected publications (all):

2025

Mull-Tokens
Mull-Tokens: Modality-Agnostic Latent Thinking
arXiv 2025
TL;DR: Instead of text reasoning or explicit image thoughts, using modality-agnostic thinking compute tokens pretrained using multimodal thoughts is more effective for visual reasoning tasks.
SIMS-V
SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding
arXiv 2025
TL;DR: A systematic data-generation framework using 3D simulators to create spatially-rich video training data for multimodal language models. Our 7B-parameter model fine-tuned on just 25K simulated examples outperforms the larger 72B baseline and achieves competitive performance with proprietary models on real-world spatial reasoning benchmarks.
GraspMolmo: Generalizable Task-Oriented Grasping via Large-Scale Synthetic Data Generation
TL;DR: Use textual reasoning (e.g., one should grasp "handle" of cup to drink tea) and visual reasoning (e.g., in this image, the cup is grasped by the "handle") along with an object grasping dataset (data of 6 DOF arm positions to grasp objects) to generate high-level task description (e.g., how should I grasp cup to drink tea?) to precise 6DOF grasp data.
SAT
SAT: Dynamic Spatial Aptitude Training for Multimodal Language Models
TL;DR: Simulated spatial aptitude data (SAT) can improve spatial reasoning in real images and videos for MLMs while maintaining pretraining commonsense. When instruction-tuned on SAT, LLaVA-13B outperforms some larger MLMs like GPT4-V and Gemini-1.5-pro in spatial reasoning.

2024

R2D3
R2D3: Imparting Spatial Reasoning by Reconstructing 3D Scenes from 2D Images
Tech report 2024
TL;DR: A visual anchor with the corresponding 3D location in text during training helps multimodal language models perform 3D estimation.
FED
FED: Feedback-Guided Autonomous Driving
TL;DR: MLMs can benefit autonomous driving by understanding natural language feedback and refining the next waypoint prediction.

2023

Cola
Cola: A Benchmark for Compositional Text-to-image Retrieval
TL;DR: Tuning multimodal layers improve the unseen compositional reasoning ability in CLIP-style vision-language models the most over tuning other parts of the model.
Lasagna
Lasagna: Layered Score Distillation for Disentangled Object Relighting
TL;DR: Synthetically generated examples are effective in teaching physics-aware edits like relighting if we use score-distillation to avoid overfitting.
Socratis
Socratis: Are Large Multimodal Models Emotionally Aware?
TL;DR: A preliminary benchmark to test MLMs on why different people may feel different emotions for separate reasons while viewing the same image-text content.

2021

User-targeted content generation
User-targeted content generation using multimodal embeddings
VQA Error Regions
Knowing What VQA Does Not: Pointing to Error-Inducing Regions to Improve Explanation Helpfulness

2019

VQA Consistency
Sunny and Dark Outside?! Improving Answer Consistency in VQA through Entailed Question Generation
EMNLP 2019, also at CVPR-W 2019 VQA and Visual Dialog Workshop

2016

VQA Relevance
Question Relevance in VQA: Identifying Non-Visual And False-Premise Questions

Press Coverage

Hobbies

When I am not training LLMs, I love going to techno (a subgenre of electronic music) fests, making latte art, and engineering simple gadgets. In middle school, I opened an informal research society to encourage fellow students to take an interest in science by constructing simple gadgets. We won multiple accolades in school and city-level exhibitions.

Miscellanea

Contact

Best way to reach me would be to drop an email to array at bu dot edu.