Generalized Robotic Vision-Language Learning Model via Linguistic Foreground-Aware Contrast
Published in International Journal of Computer Vision , 2025
Abstract: Contrastive learning has recently demonstrated great potential for unsupervised pre-training in 3D scene understanding tasks. However, mostexisting work randomly selects point features as anchors while building contrast, leading to a clear bias toward background points that often dominate in 3D scenes. Also, object awareness and foreground-to-background discrimination are neglected, making contrastive learning less effective. To tackle these issues, we propose a general foreground-aware feature contrast FAC++ framework to learn more effective point cloud representations in pre-training. FAC++ consists of two novel contrast designs to construct more effective and informative contrast pairs. The first is building positive pairs within the same foreground segment where points tend to have the same semantics. The second is that we prevent over discrimination between 3D segments/objects and encourage grouped foreground-to-background distinctions at the segment level with adaptive feature learning in a Siamese correspondence network, which adaptively learns feature correlations within and across point cloud views effectively. Our proposed approach enhances both the local coherence as well as the overall feature discrimination. Moreover, we have designed the linguistic foreground-aware regional point sampling to enhance more balanced foreground-aware learning, which is termed FAC++. Visualization with point activation maps shows that our contrast pairs capture clear correspondences among foreground regions during pre-training. Quantitative experiments also showthat FAC++achievessuperior knowledge transfer and data efficiency in various downstream 3D semantic segmentation, instance segmentation as well as object detection tasks. All codes, data, and models are available at: (https://github.com/KangchengLiu/FAC_Foreground_Aware_Contrast).
Recommended citation: Liu, K., Wang, C., Han, X. et al. Generalized Robot Vision-Language Model via Linguistic Foreground-Aware Contrast. Int J Comput Vis (2025). https://doi.org/10.1007/s11263-024-02340-z