点击下方
卡片
,关注
「3D视觉工坊」
公众号
选择
星标
,干货第一时间送达
来源:具身智能之心
添加小助理:cv3d001,备注:方向+学校/公司+昵称,拉你入群。文末附3D视觉行业细分群。
扫描下方二维码,加入「
3D视觉从入门到精通
」知识星球,星球内凝聚了众多3D视觉实战问题,以及各个模块的学习资料:
近20门秘制视频课程
、
最新顶会论文
、计算机视觉书籍
、
优质3D视觉算法源码
等。想要入门3D视觉、做项目、搞科研,欢迎扫码加入!
Robot Manipulation(机器人操控)是机器人技术中的一个关键领域,涉及机器人在物理环境中与物体的交互和操作能力。它旨在让机器人能够自主感知、规划并执行复杂的物体抓取、移动、旋转和精细操作等任务。机器人操控技术广泛应用于工业自动化、医疗手术、家务辅助、物流搬运等场景,为机器人能够适应和完成多样化的任务提供了技术支撑。
本项目汇总了Robot Manipulation领域的关键研究论文,涵盖从抓取到复杂操控的各类任务、方法和应用,提供了关于表征学习、强化学习、多模态学习、3D表征等技术的最新进展,方便机器人操控领域的研究者和实践者学习阅读。
最近收集整理了300+篇关于Robotics+Manipulation的文献,公开在了github上,repo链接:https://github.com/BaiShuanghao/Awesome-Robotics-Manipulation
Grasp相关
1)Rectangle-based Grasp
-
Title:
HMT-Grasp: A Hybrid Mamba-Transformer Approach for Robot Grasping in Cluttered Environments|https://arxiv.org/abs/2410.03522
-
Title:
Lightweight Language-driven Grasp Detection using Conditional Consistency Model|https://arxiv.org/abs/2407.17967
-
Title:
grasp_det_seg_cnn:
End-to-end Trainable Deep Neural Network for Robotic Grasp Detection and Semantic Segmentation from RGB
|https://arxiv.org/abs/2107.05287
-
Title:
GR-ConvNet:
Antipodal Robotic Grasping using Generative Residual Convolutional Neural Network
|https://arxiv.org/abs/1909.04810
2)6-DoF Grasp
-
Title:
Real-to-Sim Grasp: Rethinking the Gap between Simulation and Real World in Grasp Detection|https://arxiv.org/abs/2410.06521
-
Title:
OrbitGrasp: SE(3)-Equivariant Grasp Learning|https://arxiv.org/abs/2407.03531
-
Title:
EquiGraspFlow: SE(3)-Equivariant 6-DoF Grasp Pose Generative Flows|https://openreview.net/pdf?id=5lSkn5v4LK
-
Title:
An Economic Framework for 6-DoF Grasp Detection|https://arxiv.org/abs/2407.08366
-
Title:
Generalizing 6-DoF Grasp Detection via Domain Prior Knowledge|https://arxiv.org/abs/2404.01727
-
Title:
Rethinking 6-Dof Grasp Detection: A Flexible Framework for High-Quality Grasping|https://arxiv.org/abs/2403.15054
-
Title:
AnyGrasp: Robust and Efficient Grasp Perception in Spatial and Temporal Domains|https://arxiv.org/abs/2212.08333,
-
Title:
GraspNet-1Billion: A Large-Scale Benchmark for General Object Grasping|https://openaccess.thecvf.com/content_CVPR_2020/papers/Fang_GraspNet-1Billion_A_Large-Scale_Benchmark_for_General_Object_Grasping_CVPR_2020_paper.pdf
-
Title:
6-DOF GraspNet: Variational Grasp Generation for Object Manipulation|https://arxiv.org/abs/1905.10520
3)Grasp with 3D Techniques
-
Title:
Implicit Grasp Diffusion: Bridging the Gap between Dense Prediction and Sampling-based Grasping|https://openreview.net/pdf?id=VUhlMfEekm
-
Title:
Learning Any-View 6DoF Robotic Grasping in Cluttered Scenes via Neural Surface Rendering|https://arxiv.org/abs/2306.07392,
-
Title:
Language Embedded Radiance Fields for Zero-Shot Task-Oriented Grasping|https://arxiv.org/abs/2309.07970
-
Title:
GraspNeRF: Multiview-based 6-DoF Grasp Detection for Transparent and Specular Objects Using Generalizable NeRF|https://arxiv.org/abs/2210.06575,
-
Title:
GraspSplats: Efficient Manipulation with 3D Feature Splatting|https://arxiv.org/abs/2409.02084,
-
Title:
GaussianGrasper: 3D Language Gaussian Splatting for Open-vocabulary Robotic Grasping|https://arxiv.org/abs/2403.09637,
4)Language-Driven Grasp
-
Title:
RTAGrasp: Learning Task-Oriented Grasping from Human Videos via Retrieval, Transfer, and Alignment|https://arxiv.org/abs/2409.16033
-
Title:
Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance|https://arxiv.org/abs/2407.13842,
-
Title:
Reasoning Grasping via Multimodal Large Language Model|https://arxiv.org/abs/2402.06798
-
Title:
ThinkGrasp: A Vision-Language System for Strategic Part Grasping in Clutter|https://arxiv.org/abs/2407.11298
-
Title:
Towards Open-World Grasping with Large Vision-Language Models|https://arxiv.org/abs/2406.18722
-
Title:
Reasoning Tuning Grasp: Adapting Multi-Modal Large Language Models for Robotic Grasping|https://openreview.net/pdf?id=3mKb5iyZ2V
5)Grasp for Transparent Objects
-
Title:
T
2
SQNet: A Recognition Model for Manipulating Partially Observed Transparent Tableware Objects|https://openreview.net/pdf?id=M0JtsLuhEE
-
Title:
ASGrasp: Generalizable Transparent Object Reconstruction and Grasping from RGB-D Active Stereo Camera|https://arxiv.org/abs/2405.05648
-
Title:
Dex-NeRF: Using a Neural Radiance Field to Grasp Transparent Objects|https://arxiv.org/abs/2110.14217
Manipulation相关
1)Representation Learning with Auxiliary Tasks
-
Title:
Contrastive Imitation Learning for Language-guided Multi-Task Robotic Manipulation|https://arxiv.org/abs/2406.09738
-
Title:
Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers|https://arxiv.org/abs/2403.12943
-
Title:
R3M: A Universal Visual Representation for Robot Manipulation|https://arxiv.org/abs/2203.12601
-
Title:
HULC:
What Matters in Language Conditioned Robotic Imitation Learning over Unstructured Data
|https://arxiv.org/abs/2204.06252
-
Title:
BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning|https://arxiv.org/abs/2202.02005
-
Title:
Spatiotemporal Predictive Pre-training for Robotic Motor Control|https://arxiv.org/abs/2403.05304
-
Title:
MUTEX: Learning Unified Policies from Multimodal Task Specifications|https://arxiv.org/abs/2309.14320
-
Title:
Language-Driven Representation Learning for Robotics|https://arxiv.org/abs/2302.12766
-
Title:
Real-World Robot Learning with Masked Visual Pre-training|https://arxiv.org/abs/2210.03109
-
Title:
RACER: Rich Language-Guided Failure Recovery Policies for Imitation Learning|https://arxiv.org/abs/2409.14674
-
Title:
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought|https://arxiv.org/abs/2305.15021
-
Title:
Chain-of-Thought Predictive Control|https://arxiv.org/abs/2304.00776
-
Title:
VIRT: Vision Instructed Transformer for Robotic Manipulation|https://arxiv.org/abs/2410.07169
-
Title:
KOI: Accelerating Online Imitation Learning via Hybrid Key-state Guidance|https://www.arxiv.org/abs/2408.02912
-
Title:
GENIMA:
Generative Image as Action Models
|https://arxiv.org/abs/2407.07875
-
Title:
ATM:
Any-point Trajectory Modeling for Policy Learning
|https://arxiv.org/abs/2401.00025
-
Title:
Learning Manipulation by Predicting Interaction|https://www.arxiv.org/abs/2406.00439
-
Title:
Object-Centric Instruction Augmentation for Robotic Manipulation|https://arxiv.org/abs/2401.02814
-
Title:
Towards Generalizable Zero-Shot Manipulation via Translating Human Interaction Plans|https://arxiv.org/abs/2312.00775
-
Title:
CALAMARI: Contact-Aware and Language conditioned spatial Action MApping for contact-RIch manipulation|https://openreview.net/pdf?id=Nii0_rRJwN
-
Title:
GHIL-Glue: Hierarchical Control with Filtered Subgoal Images|https://arxiv.org/abs/2410.20018
-
Title:
FoAM: Foresight-Augmented Multi-Task Imitation Policy for Robotic Manipulation|https://arxiv.org/abs/2409.19528
-
Title:
VideoAgent: Self-Improving Video Generation|https://arxiv.org/abs/2410.10076
-
Title:
GR-MG: Leveraging Partially Annotated Data via Multi-Modal Goal Conditioned Policy|https://arxiv.org/abs/2408.14368
-
Title:
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation|https://arxiv.org/abs/2410.06158
-
Title:
VLMPC: Vision-Language Model Predictive Control for Robotic Manipulation|https://arxiv.org/abs/2407.09829
-
Title:
GR-1:
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation
|https://arxiv.org/abs/2312.13139
-
Title:
SuSIE:
Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models
|https://arxiv.org/abs/2310.10639
-
Title:
VLP:
Video Language Planning
|https://arxiv.org/abs/2310.10625,
2)Visual Representation Learning
-
Title:
Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Datasets|https://arxiv.org/abs/2410.22325
-
Title:
Theia: Distilling Diverse Vision Foundation Models for Robot Learning|https://arxiv.org/abs/2407.20179
-
Title:
Learning Manipulation by Predicting Interaction|https://www.arxiv.org/abs/2406.00439
-
Title:
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware|https://arxiv.org/abs/2304.13705
-
Title:
Language-Driven Representation Learning for Robotics|https://arxiv.org/abs/2302.12766
-
Title:
VIMA: General Robot Manipulation with Multimodal Prompts|https://arxiv.org/abs/2210.03094
-
Title:
Real-World Robot Learning with Masked Visual Pre-training|https://arxiv.org/abs/2210.03109
-
Title:
R3M: A Universal Visual Representation for Robot Manipulation|https://arxiv.org/abs/2203.12601
-
Title:
LIV: Language-Image Representations and Rewards for Robotic Control|https://arxiv.org/abs/2306.00958
-
Title:
VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training|https://arxiv.org/abs/2210.00030
-
Title:
Can Foundation Models Perform Zero-Shot Task Specification For Robot Manipulation?|https://arxiv.org/abs/2204.11134
3)Multimodal Representation Learning
-
Title:
Play to the Score: Stage-Guided Dynamic Multi-Sensory Fusion for Robotic Manipulation|https://arxiv.org/abs/2408.01366
-
Title:
MUTEX: Learning Unified Policies from Multimodal Task Specifications|https://arxiv.org/abs/2309.14320
4)Latent Action Learning
-
Title:
Discrete Policy: Learning Disentangled Action Space for Multi-Task Robotic Manipulation|https://arxiv.org/abs/2409.18707, -
Title:
IGOR: Image-GOal Representations Atomic Control Units for Foundation Models in Embodied AI|https://www.microsoft.com/en-us/research/uploads/prod/2024/10/Project_IGOR_for_arXiv.pdf
-
Title:
Latent Action Pretraining from Videos|https://arxiv.org/abs/2410.11758
-
Title:
Goal Representations for Instruction Following: A Semi-Supervised Language Interface to Control|https://arxiv.org/abs/2307.00117
-
Title:
MimicPlay: Long-Horizon Imitation Learning by Watching Human Play|https://arxiv.org/abs/2302.12422
-
Title:
Imitation Learning with Limited Actions via Diffusion Planners and Deep Koopman Controllers|https://arxiv.org/abs/2410.07584
-
Title:
Learning to Act without Actions|https://arxiv.org/abs/2312.10812
-
Title:
Imitating Latent Policies from Observation|https://arxiv.org/abs/1805.07914
5)World Model
-
Title:
MOTO: Offline Pre-training to Online Fine-tuning for Model-based Robot Learning|https://arxiv.org/abs/2401.03306, -
Title:
Finetuning Offline World Models in the Real World|https://arxiv.org/abs/2310.16029,
-
Title:
Surfer: Progressive Reasoning with World Models for Robotic Manipulation|https://arxiv.org/abs/2306.11335,
6)Asynchronous Action Learning
-
Title:
PIVOT-R: Primitive-Driven Waypoint-Aware World Model for Robotic Manipulation|https://arxiv.org/abs/2410.10394
-
Title:
HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers|https://arxiv.org/abs/2410.05273
-
Title:
MResT: Multi-Resolution Sensing for Real-Time Control with Vision-Language Models|https://arxiv.org/abs/2401.14502
7)Diffusion Policy Learning
-
Title:
Diffusion Transformer Policy|https://arxiv.org/abs/2410.15959,
-
Title:
SDP: Spiking Diffusion Policy for Robotic Manipulation with Learnable Channel-Wise Membrane Thresholds|https://arxiv.org/abs/2409.11195,
-
Title:
The Ingredients for Robotic Diffusion Transformers|https://arxiv.org/abs/2410.10088,
-
Title:
GenDP: 3D Semantic Fields for Category-Level Generalizable Diffusion Policy|https://arxiv.org/abs/2410.17488
-
Title:
EquiBot: SIM(3)-Equivariant Diffusion Policy for Generalizable and Data Efficient Learning|https://arxiv.org/abs/2407.01479
-
Title:
Sparse Diffusion Policy: A Sparse, Reusable, and Flexible Policy for Robot Learning|https://arxiv.org/abs/2407.01531
-
Title:
MDT:
Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals
|https://arxiv.org/abs/2407.05996
-
Title:
Render and Diffuse: Aligning Image and Action Spaces for Diffusion-based Behaviour Cloning|https://arxiv.org/abs/2405.18196,
-
Title:
DP3:
3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
|https://arxiv.org/abs/2403.03954
-
Title:
PlayFusion: Skill Acquisition via Diffusion from Language-Annotated Play|https://arxiv.org/abs/2312.04549
-
Title:
Equivariant Diffusion Policy|https://arxiv.org/abs/2407.01812
-
Title:
StructDiffusion: Language-Guided Creation of Physically-Valid Structures using Unseen Objects|https://arxiv.org/abs/2211.04604
-
Title:
Goal-Conditioned Imitation Learning using Score-based Diffusion Policies|https://arxiv.org/abs/2304.02532
-
Title:
Diffusion Policy: Visuomotor Policy Learning via Action Diffusion|https://arxiv.org/abs/2303.04137
8)Other Policies
-
Title:
Autoregressive Action Sequence Learning for Robotic Manipulation|https://arxiv.org/abs/2410.03132,
-
Title:
MaIL: Improving Imitation Learning with Selective State Space Models|https://arxiv.org/abs/2406.08234,
9)Vision Language Action Models
-
Title:
Run-time Observation Interventions Make Vision-Language-Action Models More Visually Robust|https://arxiv.org/abs/2410.01971
-
Title:
TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation|https://arxiv.org/abs/2409.12514
-
Title:
RoboMamba: Multimodal State Space Model for Efficient Robot Reasoning and Manipulation|https://arxiv.org/abs/2406.04339
-
Title:
A Dual Process VLA: Efficient Robotic Manipulation Leveraging VLM|https://arxiv.org/abs/2410.15549
-
Title:
OpenVLA: An Open-Source Vision-Language-Action Model|https://arxiv.org/abs/2406.09246
-
Title:
LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning|https://arxiv.org/abs/2406.11815
-
Title:
Robotic Control via Embodied Chain-of-Thought Reasoning|https://arxiv.org/abs/2406.11815
-
Title:
3D-VLA: A 3D Vision-Language-Action Generative World Model|https://arxiv.org/abs/2403.09631
-
Title:
Octo: An Open-Source Generalist Robot Policy|https://arxiv.org/abs/2405.12213,
-
Title:
RoboFlamingo:
Vision-Language Foundation Models as Effective Robot Imitators
|https://arxiv.org/abs/2311.01378
-
Title:
RT-H: Action Hierarchies Using Language|https://arxiv.org/abs/2403.01823
-
Title:
Open X-Embodiment: Robotic Learning Datasets and RT-X Models|https://arxiv.org/abs/2310.08864,
-
Title:
MOO:
Open-World Object Manipulation using Pre-trained Vision-Language Models
|https://arxiv.org/abs/2303.00905
-
Title:
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control|https://arxiv.org/abs/2307.15818
-
Title:
RT-1: Robotics Transformer for Real-World Control at Scale|https://arxiv.org/abs/2212.06817
10)Reinforcement Learning
-
Title:
Precise and Dexterous Robotic Manipulation via Human-in-the-Loop Reinforcement Learning|https://arxiv.org/abs/2410.21845
-
Title:
PointPatchRL -- Masked Reconstruction Improves Reinforcement Learning on Point Clouds|https://arxiv.org/abs/2410.18800
-
Title:
SPIRE: Synergistic Planning, Imitation, and Reinforcement for Long-Horizon Manipulation|https://arxiv.org/abs/2410.18065
-
Title:
Learning to Manipulate Anywhere: A Visual Generalizable Framework For Reinforcement Learning|https://arxiv.org/abs/2407.15815
-
Title:
Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasks|https://arxiv.org/abs/2405.01534,
-
Title:
Expansive Latent Planning for Sparse Reward Offline Reinforcement Learning|https://openreview.net/pdf?id=xQx1O7WXSA,
-
Title:
Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions|https://arxiv.org/abs/2309.10150,
-
Title:
Sim2Real Transfer for Reinforcement Learning without Dynamics Randomization|https://arxiv.org/abs/2002.11635,
-
Title:
Pre-Training for Robots: Offline RL Enables Learning New Tasks from a Handful of Trials|https://arxiv.org/abs/2210.05178
11)Motion, Tranjectory and Flow
-
Title:
Language-Conditioned Path Planning|https://arxiv.org/abs/2308.16893
-
Title:
DiffusionSeeder: Seeding Motion Optimization with Diffusion for Rapid Motion Planning|https://arxiv.org/abs/2410.16727
-
Title:
ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation|https://arxiv.org/abs/2409.01652
-
Title:
CoPa: General Robotic Manipulation through Spatial Constraints of Parts with Foundation Models|https://arxiv.org/abs/2403.08248
-
Title:
Task Generalization with Stability Guarantees via Elastic Dynamical System Motion Policies|https://arxiv.org/abs/2309.01884
-
Title:
ORION: Vision-based Manipulation from Single Human Video with Open-World Object Graphs|https://arxiv.org/abs/2405.20321
-
Title:
Learning Robotic Manipulation Policies from Point Clouds with Conditional Flow Matching|https://arxiv.org/abs/2409.07343
-
Title:
RoboTAP: Tracking Arbitrary Points for Few-Shot Visual Imitation|https://arxiv.org/abs/2308.15975
-
Title:
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models|https://arxiv.org/abs/2307.05973
-
Title:
LATTE: LAnguage Trajectory TransformEr|https://arxiv.org/abs/2208.02918
-
Title:
Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation|https://arxiv.org/abs/2405.01527
-
Title:
Any-point Trajectory Modeling for Policy Learning|https://arxiv.org/abs/2401.00025
-
Title:
Waypoint-Based Imitation Learning for Robotic Manipulation|https://arxiv.org/abs/2307.14326
-
Title:
Flow as the Cross-Domain Manipulation Interface|https://www.arxiv.org/abs/2407.15208
-
Title:
Learning to Act from Actionless Videos through Dense Correspondences|https://arxiv.org/abs/2310.08576
12)Data Collection, Selection and Augmentation
-
Title:
SkillMimicGen: Automated Demonstration Generation for Efficient Skill Learning and Deployment|https://arxiv.org/abs/2410.18907
-
Title:
Scaling Robot Policy Learning via Zero-Shot Labeling with Foundation Models|https://arxiv.org/abs/2410.17772
-
Title:
Autonomous Improvement of Instruction Following Skills via Foundation Models|https://arxiv.org/abs/2407.20635
-
Title:
Manipulate-Anything: Automating Real-World Robots using Vision-Language Models|https://arxiv.org/abs/2406.18915,
-
Title:
DexCap: Scalable and Portable Mocap Data Collection System for Dexterous Manipulation|https://arxiv.org/abs/2403.07788,
-
Title:
SPRINT: Scalable Policy Pre-Training via Language Instruction Relabeling|https://arxiv.org/abs/2306.11886,
-
Title:
Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition|https://arxiv.org/abs/2307.14535
-
Title:
Robotic Skill Acquisition via Instruction Augmentation with Vision-Language Models|https://arxiv.org/abs/2211.11736
-
Title:
RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation|https://arxiv.org/abs/2306.11706,
-
Title:
Active Fine-Tuning of Generalist Policies|https://arxiv.org/abs/2410.05026
-
Title:
Re-Mix: Optimizing Data Mixtures for Large Scale Imitation Learning|https://arxiv.org/abs/2408.14037
-
Title:
An Unbiased Look at Datasets for Visuo-Motor Pre-Training|https://arxiv.org/abs/2310.09289,
-
Title:
Retrieval-Augmented Embodied Agents|https://arxiv.org/abs/2404.11699,
-
Title:
Behavior Retrieval: Few-Shot Imitation Learning by Querying Unlabeled Datasets|https://arxiv.org/abs/2304.08742,
-
Title:
RoVi-Aug: Robot and Viewpoint Augmentation for Cross-Embodiment Robot Learning|https://arxiv.org/abs/2409.03403
-
Title:
Diffusion Augmented Agents: A Framework for Efficient Exploration and Transfer Learning|https://arxiv.org/abs/2407.20798
-
Title:
Diffusion Meets DAgger: Supercharging Eye-in-hand Imitation Learning|https://arxiv.org/abs/2402.17768,
-
Title:
GenAug: Retargeting behaviors to unseen situations via Generative Augmentation|https://arxiv.org/abs/2302.06671
-
Title:
Contrast Sets for Evaluating Language-Guided Robot Policies|https://arxiv.org/abs/2406.13636
13)Affordance Learning
-
Title:
UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models|https://arxiv.org/abs/2409.20551,
-
Title:
A3VLM: Actionable Articulation-Aware Vision Language Model|https://arxiv.org/abs/2406.07549,
-
Title:
AIC MLLM: Autonomous Interactive Correction MLLM for Robust Robotic Manipulation|https://arxiv.org/abs/2406.11548,
-
Title:
SAGE: Bridging Semantic and Actionable Parts for Generalizable Manipulation of Articulated Objects|https://arxiv.org/abs/2312.01307,
-
Title:
Kinematic-aware Prompting for Generalizable Articulated Object Manipulation with LLMs|https://arxiv.org/abs/2311.02847,
-
Title:
Ditto: Building Digital Twins of Articulated Objects from Interaction|https://arxiv.org/abs/2202.08227,
-
Title:
Language-Conditioned Affordance-Pose Detection in 3D Point Clouds|https://arxiv.org/abs/2309.10911,
-
Title:
Composable Part-Based Manipulation|https://arxiv.org/abs/2405.05876,
-
Title:
PartManip: Learning Cross-Category Generalizable Part Manipulation Policy from Point Cloud Observations|https://arxiv.org/abs/2303.16958,
-
Title:
GAPartNet: Cross-Category Domain-Generalizable Object Perception and Manipulation via Generalizable and Actionable Parts|https://arxiv.org/abs/2211.05272,
-
Title:
SpatialBot: Precise Spatial Understanding with Vision Language Models|https://arxiv.org/abs/2406.13642,
-
Title:
RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics|https://arxiv.org/abs/2406.10721,
-
Title:
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities|https://arxiv.org/abs/2401.12168,
-
Title:
RAM: Retrieval-Based Affordance Transfer for Generalizable Zero-Shot Robotic Manipulation|https://arxiv.org/abs/2407.04689,
-
Title:
MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting|https://arxiv.org/abs/2403.03174
-
Title:
SLAP: Spatial-Language Attention Policies|https://arxiv.org/abs/2304.11235,
-
Title:
KITE: Keypoint-Conditioned Policies for Semantic Manipulation|https://arxiv.org/abs/2306.16605,
-
Title:
HULC++:
Grounding Language with Visual Affordances over Unstructured Data
|https://arxiv.org/abs/2210.01911
-
Title:
CLIPort: What and Where Pathways for Robotic Manipulation|https://arxiv.org/abs/2109.12098,
-
Title:
Affordance Learning from Play for Sample-Efficient Policy Learning|https://arxiv.org/abs/2203.00352
-
Title:
Transporter Networks: Rearranging the Visual World for Robotic Manipulation|https://arxiv.org/abs/2010.14406,
14)3D Representation for Manipulation
-
Title:
MSGField: A Unified Scene Representation Integrating Motion, Semantics, and Geometry for Robotic Manipulation|https://arxiv.org/abs/2410.15730
-
Title:
Splat-MOVER: Multi-Stage, Open-Vocabulary Robotic Manipulation via Editable Gaussian Splatting|https://arxiv.org/abs/2405.04378
-
Title:
IMAGINATION POLICY: Using Generative Point Cloud Models for Learning Manipulation Policies|https://arxiv.org/abs/2406.11740
-
Title:
Physically Embodied Gaussian Splatting: A Realtime Correctable World Model for Robotics|https://arxiv.org/abs/2406.10788
-
Title:
RiEMann: Near Real-Time SE(3)-Equivariant Robot Manipulation without Point Cloud Segmentation|https://arxiv.org/abs/2403.19460
-
Title:
RoboEXP: Action-Conditioned Scene Graph via Interactive Exploration for Robotic Manipulation|https://arxiv.org/abs/2402.15487
-
Title:
D
3
Fields: Dynamic 3D Descriptor Fields for Zero-Shot Generalizable Rearrangement|https://arxiv.org/abs/2309.16118
-
Title:
Object-Aware Gaussian Splatting for Robotic Manipulation|https://openreview.net/pdf?id=gdRI43hDgo
-
Title:
Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation|https://arxiv.org/abs/2308.07931
-
Title:
Neural Descriptor Fields: SE(3)-Equivariant Object Representations for Manipulation|https://arxiv.org/abs/2112.05124
-
Title:
SE(3)-Equivariant Relational Rearrangement with Neural Descriptor Fields|https://arxiv.org/abs/2211.09786
15)3D Representation Policy Learning
-
Title:
GravMAD: Grounded Spatial Value Maps Guided Action Diffusion for Generalized 3D Manipulation|https://arxiv.org/abs/2409.20154
-
Title:
3D Diffuser Actor: Policy Diffusion with 3D Scene Representations|https://arxiv.org/abs/2402.10885
-
Title:
DP3:
3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
|https://arxiv.org/abs/2403.03954
-
Title:
ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation|https://arxiv.org/abs/2403.08321
-
Title:
SGRv2:
Leveraging Locality to Boost Sample Efficiency in Robotic Manipulation
|https://arxiv.org/abs/2406.10615
-
Title:
GNFactor: Multi-Task Real Robot Learning with Generalizable Neural Feature Fields|https://arxiv.org/abs/2308.16891
-
Title:
Visual Reinforcement Learning with Self-Supervised 3D Representations|https://arxiv.org/abs/2210.07241
-
Title:
PolarNet: 3D Point Clouds for Language-Guided Robotic Manipulation|https://arxiv.org/abs/2309.15596
-
Title:
M2T2: Multi-Task Masked Transformer for Object-centric Pick and Place|https://arxiv.org/abs/2311.00926
-
Title:
PerAct:
Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation
|https://arxiv.org/abs/2209.05451
-
Title:
3D-MVP: 3D Multiview Pretraining for Robotic Manipulation|https://arxiv.org/abs/2406.18158
-
Title:
Discovering Robotic Interaction Modes with Discrete Representation Learning|https://arxiv.org/abs/2410.20258
-
Title:
SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation|https://arxiv.org/abs/2405.19586
-
Title:
RVT: Robotic View Transformer for 3D Object Manipulation|https://arxiv.org/abs/2306.14896
-
Title:
Learning Generalizable Manipulation Policies with Object-Centric 3D Representations|https://arxiv.org/abs/2310.14386
-
Title:
SGR:
A Universal Semantic-Geometric Representation for Robotic Manipulation
|https://arxiv.org/abs/2306.10474
16)Reasoning, Planning and Code Generation
-
Title:
AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation|https://arxiv.org/abs/2410.00371
-
Title:
REFLECT: Summarizing Robot Experiences for Failure Explanation and Correction|https://arxiv.org/abs/2306.15724,
-
Title:
Polaris: Open-ended Interactive Robotic Manipulation via Syn2Real Visual Grounding and Large Language Models|https://arxiv.org/abs/2408.07975
-
Title:
Physically Grounded Vision-Language Models for Robotic Manipulation|https://arxiv.org/abs/2309.02561
-
Title:
Socratic Planner: Inquiry-Based Zero-Shot Planning for Embodied Instruction Following|https://arxiv.org/abs/2404.15190,
-
Title:
Saycan:
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
|https://arxiv.org/abs/2204.01691,
-
Title:
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency|https://arxiv.org/abs/2304.11477,
-
Title:
Inner Monologue: Embodied Reasoning through Planning with Language Models|https://arxiv.org/abs/2207.05608,
-
Title:
Teaching Robots with Show and Tell: Using Foundation Models to Synthesize Robot Policies from Language and Visual Demonstrations|https://openreview.net/pdf?id=G8UcwxNAoD
-
Title:
RoCo: Dialectic Multi-Robot Collaboration with Large Language Models|https://arxiv.org/abs/2307.04738,
-
Title:
Gesture-Informed Robot Assistance via Foundation Models|https://arxiv.org/abs/2309.02721,
-
Title:
Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model|https://arxiv.org/abs/2305.11176
-
Title:
ProgPrompt: Generating Situated Robot Task Plans using Large Language Models|https://arxiv.org/abs/2209.11302
-
Title:
ChatGPT for Robotics: Design Principles and Model Abilities|https://arxiv.org/abs/2306.17582
-
Title:
Code as Policies: Language Model Programs for Embodied Control|https://arxiv.org/abs/2209.07753
-
Title:
TidyBot: Personalized Robot Assistance with Large Language Models|https://arxiv.org/abs/2305.05658
-
Title:
Statler: State-Maintaining Language Models for Embodied Reasoning|https://arxiv.org/abs/2306.17840
-
Title:
InterPreT: Interactive Predicate Learning from Language Feedback for Generalizable Task Planning|https://arxiv.org/abs/2405.19758
-
Title:
Text2Motion: From Natural Language Instructions to Feasible Plans|https://arxiv.org/abs/2303.12153
-
Title:
AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation|https://arxiv.org/abs/2410.00371
-
Title:
Task Success Prediction for Open-Vocabulary Manipulation Based on Multi-Level Aligned Representations|https://arxiv.org/abs/2410.00436
-
Title:
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought|https://arxiv.org/abs/2305.15021
-
Title:
ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation|https://arxiv.org/abs/2312.16217
-
Title:
Chat with the Environment: Interactive Multimodal Perception Using Large Language Models|https://arxiv.org/abs/2303.08268
-
Title:
PaLM-E: An Embodied Multimodal Language Model|https://arxiv.org/abs/2303.03378
-
Title:
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language|https://arxiv.org/abs/2204.00598
17)Generalization
-
Title:
Mirage: Cross-Embodiment Zero-Shot Policy Transfer with Cross-Painting|https://arxiv.org/abs/2402.19249
-
Title:
Policy Architectures for Compositional Generalization in Control|https://arxiv.org/abs/2203.05960
-
Title:
Programmatically Grounded, Compositionally Generalizable Robotic Manipulation|https://arxiv.org/abs/2304.13826
-
Title:
Efficient Data Collection for Robotic Manipulation via Compositional Generalization|https://arxiv.org/abs/2403.05110
-
Title:
Natural Language Can Help Bridge the Sim2Real Gap|https://arxiv.org/abs/2405.10020
-
Title:
Reconciling Reality through Simulation: A Real-to-Sim-to-Real Approach for Robust Manipulation|https://arxiv.org/abs/2403.03949
-
Title:
Local Policies Enable Zero-shot Long-horizon Manipulation|https://arxiv.org/abs/2410.22332,
-
Title:
A Backbone for Long-Horizon Robot Task Understanding|https://arxiv.org/abs/2408.01334,
-
Title:
STAP: Sequencing Task-Agnostic Policies|https://arxiv.org/abs/2210.12250
-
Title:
BOSS:
Bootstrap Your Own Skills: Learning to Solve New Tasks with Large Language Model Guidance
|https://arxiv.org/abs/2310.10021
-
Title:
Learning Compositional Behaviors from Demonstration and Language|https://openreview.net/pdf?id=fR1rCXjCQX
-
Title:
Policy Adaptation via Language Optimization: Decomposing Tasks for Few-Shot Imitation|https://arxiv.org/abs/2408.16228
18)Generalist
-
Title:
Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation|https://arxiv.org/abs/2408.11812
-
Title:
All Robots in One: A New Standard and Unified Dataset for Versatile, General-Purpose Embodied Agents|https://arxiv.org/abs/2408.10899
-
Title:
Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers|https://arxiv.org/abs/2409.20537
-
Title:
An Embodied Generalist Agent in 3D World|https://arxiv.org/abs/2311.12871
-
Title:
Towards Synergistic, Generalized, and Efficient Dual-System for Robotic Manipulation|https://arxiv.org/abs/2410.08001
-
Title:
Effective Tuning Strategies for Generalist Robot Manipulation Policies|https://arxiv.org/abs/2410.01220,
-
Title:
Octo: An Open-Source Generalist Robot Policy|https://arxiv.org/abs/2405.12213,
-
Title:
Steering Your Generalists: Improving Robotic Foundation Models via Value Guidance|https://arxiv.org/abs/2410.13816
-
Title:
Open X-Embodiment: Robotic Learning Datasets and RT-X Models|https://arxiv.org/abs/2310.08864,
-
Title:
RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking|https://arxiv.org/abs/2309.01918,
-
Title:
Learning to Manipulate Anywhere: A Visual Generalizable Framework For Reinforcement Learning|https://arxiv.org/abs/2407.15815
-
Title:
CAGE: Causal Attention Enables Data-Efficient Generalizable Robotic Manipulation|https://arxiv.org/abs/2407.15815
-
Title:
Robot Utility Models: General Policies for Zero-Shot Deployment in New Environments|https://arxiv.org/abs/2409.05865
19)Human-Robot Interaction and Collaboration
-
Title:
Vocal Sandbox: Continual Learning and Adaptation for Situated Human-Robot Collaboration|https://openreview.net/pdf?id=ypaYtV1CoG
-
Title:
APRICOT: Active Preference Learning and Constraint-Aware Task Planning with LLMs|https://openreview.net/pdf?id=nQslM6f7dW
-
Title:
Text2Interaction: Establishing Safe and Preferable Human-Robot Interaction|https://arxiv.org/abs/2408.06105
-
Title:
KNOWNO:
Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners
|https://arxiv.org/abs/2307.01928,