大模型上车无疑是今年自动驾驶量产和预研的主旋律之一,24年12月小米也宣布大模型已经OTA更新。那么自动驾驶多模态大模型当下发展如何?又有哪些可以借鉴的工作。今天自动驾驶之心就和大家一起盘点下综述和相关工作,所有资料已汇总至『自动驾驶之心知识星球』,欢迎加入获取~
一、awesome和综述汇总
1、智能交通和自动驾驶中的LLM
https://github.com/ge25nab/Awesome-VLM-AD-ITS
2、AIGC和LLM
https://github.com/coderonion/awesome-llm-and-aigc
3、视觉语言模型综述
https://github.com/jingyi0000/VLM_survey
4、用于CLIP等视觉语言模型的出色提示/适配器学习方法
https://github.com/zhengli97/Awesome-Prompt-Adapter-Learning-for-VLMs
5、LLM/VLM 推理论文列表,并附有代码
https://github.com/DefTruth/Awesome-LLM-Inference
6、大型模型安全、安保和隐私的阅读清单(包括Awesome LLM security、safety等)
https://github.com/ThuCCSLab/Awesome-LM-SSP
7、关于单/多智能体、机器人、llm/vlm/vla、科学发现等的知识库
https://github.com/weleen/awesome-agent
8、关于Embodied AI和相关研究/行业驱动资源的精选论文列表
https://github.com/haoranD/Awesome-Embodied-AI
9、一份精心策划的推理策略和算法列表,可提高视觉语言模型(VLM)的性能
https://github.com/Patchwork53/awesome-vlm-inference-strategies
10、著名的视觉语言模型及其架构
https://github.com/gokayfem/awesome-vlm-architectures
二、视觉语言模型(VLM)基础理论
1、预训练
[arXiv 2024] RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness Paper(https://github.com/RLHF-V/RLAIF-V)
[CVPR 2024] RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback Paper(https://github.com/RLHF-V/RLHF-V)
[CVPR 2024] Do Vision and Language Encoders Represent the World Similarly? Paper(https://github.com/mayug/0-shot-llm-vision)
[CVPR 2024] Efficient Vision-Language Pre-training by Cluster Masking Paper(https://github.com/Zi-hao-Wei/Efficient-Vision-Language-Pre-training-by-Cluster-Masking)
[CVPR 2024] Non-autoregressive Sequence-to-Sequence Vision-Language Models [Paper]
[CVPR 2024] ViTamin: Designing Scalable Vision Models in the Vision-Language Era Paper(https://github.com/Beckschen/ViTamin)
[CVPR 2024] Iterated Learning Improves Compositionality in Large Vision-Language Models [Paper]
[CVPR 2024] FairCLIP: Harnessing Fairness in Vision-Language Learning Paper(https://ophai.hms.harvard.edu/datasets/fairvlmed10k)
[CVPR 2024] InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks Paper(https://github.com/OpenGVLab/InternVL)
[CVPR 2024] VILA: On Pre-training for Visual Language Models [Paper]
[CVPR 2024] Generative Region-Language Pretraining for Open-Ended Object Detection Paper(https://github.com/FoundationVision/GenerateU)]
[CVPR 2024] Enhancing Vision-Language Pre-training with Rich Supervisions [Paper]
[ICLR 2024] Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization Paper(https://github.com/jy0205/LaVIT)
[ICLR 2024] MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning Paper(https://github.com/PKUnlp-icler/MIC)
[ICLR 2024] Retrieval-Enhanced Contrastive Vision-Text Models [Paper]
[arXiv 2024] CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions Paper(https://github.com/UCSC-VLAA/CLIPS)]
2、迁移学习方法
[NeurIPS 2024] Historical Test-time Prompt Tuning for Vision Foundation Models [Paper]
[NeurIPS 2024] AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation Paper(https://github.com/MCG-NJU/AWT)