[1] Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision[C]//International conference on machine learning. PMLR, 2021: 8748-8763.
[2] Zhou C, Loy C C, Dai B. Extract free dense labels from clip[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022: 696-712.
[3] Dai A, Chang A X, Savva M, et al. Scannet: Richly-annotated 3d reconstructions of indoor scenes[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 5828-5839.
[4] Caesar H, Bankiti V, Lang A H, et al. nuscenes: A multimodal dataset for autonomous driving[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 11621-11631.
[5] Chen R, Liu Y, Kong L, et al. Clip2scene: Towards label-efficient 3d scene understanding by clip[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 7020-7030.
[6] Chen N, Chu L, Pan H, et al. Self-supervised image representation learning with geometric set consistency[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 19292-19302.
[7] Chen R, Liu Y, Kong L, et al. Towards label-free scene understanding by vision foundation models[J]. Advances in Neural Information Processing Systems, 2024, 36.