GraspMolmo: Generalizable Task-Oriented Grasping via Large-Scale Synthetic Data Generation
TL;DR: Use textual reasoning (e.g., one should grasp "handle" of cup to drink tea) and visual reasoning
(e.g., in this image, the cup is grasped by the "handle")
along with an object grasping dataset (data of 6 DOF arm positions to grasp objects) to
generate high-level task description (e.g., how should I grasp cup to drink tea?) to precise 6DOF grasp data.