Research Area:  Machine Learning
Language-conditioned segmentation and grasping (LCSG) requires the robot to simultaneously identify and grasp a specific object in accordance with human linguistic instruction. Existing methods generally involve semantic matching between the entire instruction and raw RGB images to ground the desired object, generating a grasp pose for that object. However, they overlook the semantic effectiveness of the nouns in the instruction, struggling with fine-grained object localization. Furthermore, they lack sufficient geometry context of the object to reason the optimal grasp pose, resulting in instability and failures in cluttered environments. In this article, we propose a knowledge-augmented refinement network (KARNet) to jointly conduct fine-grained object segmentation and grasping detection to tackle these challenges. Specifically, to mine the semantic context of the nouns, we introduce an entity semantic enhancement (ESE) module to fuse the knowledge from both the external knowledge base and the contrastive language-image pretraining (CLIP). Besides, a refinement decoder is proposed to generate segmentation masks and incorporate the geometry-aware features to yield suitable grasp poses for the desired objects. Notably, to achieve better context fusion between linguistics and vision, we further introduce a language-guided object parsing (LGOP) module to conduct coarse-to-fine multimodal fusion. We conduct extensive experiments on the cluttered household dataset and demonstrate that our proposed approach attains a grasping accuracy of 97.97% and segmentation OIoU of 94.35%, reaching state-of-the-art performance. The effectiveness of our method is further validated in real-world applications.
Keywords:  
Author(s) Name:  Jialong Xie; Jin Liu; Zhenwei Zhu; Chaoqun Wang; Peng Duan; Fengyu Zhou
Journal name:  IEEE Transactions on Instrumentation and Measurement
Conferrence name:  
Publisher name:  IEEE
DOI:  10.1109/TIM.2024.3446625
Volume Information:  Volume: 73 , (2024)
Paper Link:   https://ieeexplore.ieee.org/document/10640111