Yi Xin, Junlong Du, Qiang Wang, Ke Yan, Shouhong Ding: MmAP : Multi-modal Alignment Prompt for Cross-domain Multi-task Learning. MmAP:用于跨域多任务学习的多模态对齐提示。https://arxiv.org/pdf/2312.08636
Jingsheng Gao, Jiacheng Ruan, Suncheng Xiang, Zefang Yu, Ke Ji, Mingye Xie, Ting Liu, Yuzhuo Fu: LAMM: Label Alignment for Multi-Modal Prompt Learning. LAMM:用于多模态提示学习的标签对齐。
https://arxiv.org/pdf/2312.08212
Vincent Tao Hu, Wei Zhang, Meng Tang, Pascal Mettes, Deli Zhao, Cees Snoek: Latent Space Editing in Transformer-Based Flow Matching. 基于Transformer的流匹配中的潜在空间编辑。
Wenbo Hu, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, Zhuowen Tu: BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions. BLIVA:一种用于更好处理富含文本的视觉问题的简单多模态大语言模型。
Xiaoming Hu, Zilei Wang: A Dynamic Learning Method towards Realistic Compositional Zero-Shot Learning. 一种面向现实合成零样本学习的动态学习方法。
Guibiao Liao, Jiankun Li, Xiaoqing Ye: VLM2Scene: Self-Supervised Image-Text-LiDAR Learning with Foundation Models for Autonomous Driving Scene Understanding. VLM2Scene:使用基础模型进行自动驾驶场景理解的自监督图像-文本-激光雷达学习。
Yuqi Lin, Minghao Chen, Kaipeng Zhang, Hengjia Li, Mingming Li, Zheng Yang, Dongqin Lv, Binbin Lin, Haifeng Liu, Deng Cai: TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP without Training. TagCLIP:一种从局部到全局的框架,用于在无需训练的情况下增强CLIP的开放词汇多标签分类。
Chao Liu, Ting Zhao, Nenggan Zheng: DeepBranchTracer: A Generally-Applicable Approach to Curvilinear Structure Reconstruction Using Multi-Feature Learning. DeepBranchTracer:一种使用多特征学习进行曲线结构重建的通用方法。
Fan Ma, Xiaojie Jin, Heng Wang, Jingjia Huang, Linchao Zhu, Yi Yang: Stitching Segments and Sentences towards Generalization in Video-Text Pre-training. 在视频文本预训练中拼接片段和句子以实现泛化。
Dejie Yang, Zijing Zhao, YangLiu: PlanLLM: Video Procedure Planning with Refinable Large Language Models. PlanLLM:使用可细化的大语言模型进行视频过程规划。
3D场景重建
Hao Wu, Yuxuan Liang, Wei Xiong, Zhengyang Zhou, Wei Huang, Shilong Wang, Kun Wang: Earthfarsser: Versatile Spatio-Temporal Dynamical Systems Modeling in One Model. Earthfarsser:在一个模型中实现多功能时空动力学系统建模。
Zechen Li, Weiming Huang, Kai Zhao, Min Yang, Yongshun Gong, Meng Chen: Urban Region Embedding via Multi-View Contrastive Prediction. 通过多视图对比预测进行城市区域嵌入。
Yubin Hu, Sheng Ye, Wang Zhao, Matthieu Lin, Yuze He, Yu-Hui Wen, Ying He, Yong-Jin Liu: O^2-Recon: Completing 3D Reconstruction of Occluded Objects in the Scene with a Pre-trained 2D Diffusion Model. O^2-Recon:利用预训练的2D扩散模型完成场景中被遮挡物体的3D重建。
Shijian Jiang, Qi Ye, Rengan Xie, Yuchi Huo, Xiang Li, Yang Zhou, Jiming Chen: In-Hand 3D Object Reconstruction from a Monocular RGB Video. 从单目RGB视频中进行手中3D物体重建。
GeonU Kim, Kim Youwang, Tae-Hyun Oh: FPRF: Feed-Forward Photorealistic Style Transfer of Large-Scale 3D Neural Radiance Fields. FPRF:大规模3D神经辐射场的前馈逼真风格迁移。
Shengtao Li, Ge Gao, Yudong Liu, Yu-Shen Liu, Ming Gu: GridFormer: Point-Grid Transformer for Surface Reconstruction. GridFormer:用于表面重建的点网格Transformer。
Xin Lin, Chong Shi, Yibing Zhan, Zuopeng Yang, Yaqi Wu, Dacheng Tao: TD²-Net: Toward Denoising and Debiasing for Video Scene Graph Generation. TD²-Net:迈向视频场景图生成的去噪和去偏。
Youtian Lin: Ced-NeRF: A Compact and Efficient Method for Dynamic Neural Radiance Fields. Ced-NeRF:一种紧凑高效的动态神经辐射场方法。
目标检测
Yuhao Huang, Sanping Zhou, Junjie Zhang, Jinpeng Dong, Nanning Zheng: Voxel or Pillar: Exploring Efficient Point Cloud Representation for 3D Object Detection. 体素还是柱体:探索用于3D目标检测的高效点云表示。
Joonhyun Jeong, Geondo Park, Jayeon Yoo, Hyungsik Jung, Heesu Kim: ProxyDet: Synthesizing Proxy Novel Classes via Classwise Mixup for Open-Vocabulary Object Detection. ProxyDet:通过逐类混合合成代理新类别用于开放词汇目标检测。
Xiaohui Jiang, Shuailin Li, Yingfei Liu, Shihao Wang, Fan Jia, Tiancai Wang, Lijin Han, Xiangyu Zhang: Far3D: Expanding the Horizon for Surround-View 3D Object Detection. Far3D:拓展环视3D目标检测的视野范围。
Yang Jiao, Zequn Jie, Shaoxiang Chen, Lechao Cheng, Jingjing Chen, Lin Ma, Yu-Gang Jiang: Instance-Aware Multi-Camera 3D Object Detection with Structural Priors Mining and Self-Boosting Learning. 基于结构先验挖掘和自增强学习的实例感知多相机3D目标检测。
Xin Jin, Kai Liu, Cong Ma, Ruining Yang, Fei Hui, Wei Wu: SwiftPillars: High-Efficiency Pillar Encoder for Lidar-Based 3D Detection. SwiftPillars:用于基于激光雷达的3D检测的高效柱体编码器。
Seunggu Kang, WonJun Moon, Euiyeon Kim, Jae-Pil Heo: VLCounter: Text-Aware Visual Representation for Zero-Shot Object Counting. VLCounter:用于零样本目标计数的文本感知视觉表示。
Yogesh Kumar, Saswat Mallick, Anand Mishra, Sowmya Rasipuram, Anutosh Maitra, Roshni R. Ramnani: QDETRv: Query-Guided DETR for One-Shot Object Localization in Videos. QDETRv:用于视频中一次性目标定位的查询引导的DETR。
Jinxiang Lai, Wenlong Wu, Bin-Bin Gao, Jun Liu, Jiawei Zhan, Congchong Nie, Yi Zeng, Chengjie Wang: MatchDet: A Collaborative Framework for Image Matching and Object Detection. MatchDet:一种用于图像匹配和目标检测的协作框架。
Jiaming Liu, Yue Wu, Maoguo Gong, Qiguang Miao, Wenping Ma, Cai Xu, Can Qin: M3SOT: Multi-Frame, Multi-Field, Multi-Space 3D Single Object Tracking. M3SOT:多帧、多场、多空间3D单目标跟踪。
Liu Liu, Anran Huang, Qi Wu, Dan Guo, Xun Yang, Meng Wang: KPA-Tracker: Towards Robust and Real-Time Category-Level Articulated Object 6D Pose Tracking. KPA-Tracker:迈向鲁棒且实时的类别级关节物体6D姿态跟踪。
Sahal Shaji Mullappilly, Abhishek Singh Gehlot, Rao Muhammad Anwer, Fahad Shahbaz Khan, Hisham Cholakkal: Semi-supervised Open-World Object Detection. 半监督开放世界目标检测。
3D场景理解
Bohan Li, Yasheng Sun, Jingxin Dong, Zheng Zhu, Jinming Liu, Xin Jin, Wenjun Zeng: One at a Time: Progressive Multi-Step Volumetric Probability Learning for Reliable 3D Scene Perception. 一次一个:用于可靠 3D 场景感知的渐进式多步骤体素概率学习。
Hanxuan Li, Bin Fu, Ruiping Wang, Xilin Chen: Point2Real: Bridging the Gap between Point Cloud and Realistic Image for Open-World 3D Recognition. Point2Real:为开放世界 3D 识别弥合点云与真实图像之间的差距。
Xiawei Li, Qingyuan Xu, Jing Zhang, Tianyi Zhang, Qian Yu, Lu Sheng, Dong Xu: Multi-Modality Affinity Inference for Weakly Supervised 3D Semantic Segmentation. 用于弱监督 3D 语义分割的多模态亲和力推断。
Matthieu Lin, Jenny Sheng, Yubin Hu, Yangguang Li, Lu Qi, Andrew Zhao, Gao Huang, Yong-Jin Liu: Exploring Temporal Feature Correlation for Efficient and Stable Video Semantic Segmentation. 探索时间特征相关性以实现高效稳定的视频语义分割。
Xingyu Liu, Pengfei Ren, Yuanyuan Gao, Jingyu Wang, Haifeng Sun, Qi Qi, Zirui Zhuang, Jianxin Liao: Keypoint Fusion for RGB-D Based 3D Hand Pose Estimation. 基于 RGB-D 的 3D 手部姿态估计的关键点融合。
Ziyang Lu, Yunqiang Pei, Guoqing Wang, Peiwei Li, Yang Yang, Yinjie Lei, Heng Tao Shen: ScanERU: Interactive 3D Visual Grounding Based on Embodied Reference Understanding. ScanERU:基于具身参考理解的交互式 3D 视觉定位。
Run Luo, Zikai Song, Lintao Ma, Jinlin Wei, Wei Yang, Min Yang: DiffusionTrack: Diffusion Model for Multi-Object Tracking. DiffusionTrack:用于多目标跟踪的扩散模型。
Zhipeng Luo, Gongjie Zhang, Changqing Zhou, Zhonghua Wu, Qingyi Tao, Lewei Lu, Shijian Lu: Modeling Continuous Motion for 3D Point Cloud Object Tracking. 为 3D 点云目标跟踪建模连续运动。
Wentao Mo, Yang Liu: Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA. 弥合 2D 与 3D 视觉问答之间的差距:一种用于 3D 视觉问答的融合方法。
Wenzhe Ouyang, Xiaolin Song, Bailan Feng, Zenglin Xu: OctOcc: High-Resolution 3D Occupancy Prediction with Octree. OctOcc:使用八叉树进行高分辨率 3D 占用预测。
Zhiyi Pan, Nan Zhang, Wei Gao, Shan Liu, Ge Li: Less Is More: Label Recommendation for Weakly Supervised Point Cloud Semantic Segmentation. 少即是多:用于弱监督点云语义分割的标签推荐。
视觉语言导航
Xiulong Liu, Sudipta Paul, Moitreya Chatterjee, Anoop Cherian: CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual Navigation in Noisy Environments. CAVEN:一种用于在嘈杂环境中实现高效视听导航的具身对话智能体。