Paper List for Robotics & Embodied AI - Tianxing Chen
- Diffusion Model for Planning, Policy, and RL
- 3D-based Manipulation
- 2D-based Manipulation
- LLM for robotics
- LLM Agent (Planning)
- Generative Model for Embodied
- Visual Feature: Correspondence, Affordance
- Detection & Segmentation
- Pose Estimation and Tracking
- Humanoid
- Dataset & Benchmark
- Hardware
- 2D to 3D Generation
- Gaussion Splatting
- Robotics for Medical
- Companies
-
[arXiv] Diffusion Models for Reinforcement Learning: A Survey, arXiv
-
[ICLR 2023 (Top 5% Notable)] Is Conditional Generative Modeling all you need for Decision-Making?, website
-
[RSS 2023] Diffusion Policy: Visuomotor Policy Learning via Action Diffusion, website
-
[ICML 2022 (Long Talk)] Planning with Diffusion for Flexible Behavior Synthesis, website
-
[ICML 2023 Oral] Adaptdiffuser: Diffusion models as adaptive self-evolving planners, website
-
[CVPR 2024] SkillDiffuser: Interpretable Hierarchical Planning via Skill Abstractions in Diffusion-Based Task Execution, website
-
[arXiv] Learning a Diffusion Model Policy From Reward via Q-Score Matching, arXiv
-
[CoRL 2023] ChainedDiffuser: Unifying Trajectory Diffusion and Keypose Prediction for Robotic Manipulation, website
-
[CVPR 2023] Affordance Diffusion: Synthesizing Hand-Object Interactions, website
-
[arXiv] DiffuserLite: Towards Real-time Diffusion Planning, arXiv
-
[arXiv] 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations, website
-
[arXiv] 3D Diffuser Actor: Policy Diffusion with 3D Scene Representations, website
-
[arXiv] SafeDiffuser: Safe Planning with Diffusion Probabilistic Models, arXiv
-
[CVPR 2024] Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation, arXiv
-
[arXiv 2024] Render and Diffuse: Aligning Image and Action Spaces for Diffusion-based Behaviour Cloning, arXiv
-
[arXiv 2024] Surgical Robot Transformer: Imitation Learning for Surgical Tasks, website
-
[CoRL 2024] GenDP: 3D Semantic Fields for Category-Level Generalizable Diffusion Policy, website
-
[RSS 2024] RVT-2: Learning Precise Manipulation from Few Examples website
-
[arXiv 2023] D3 Fields: Dynamic 3D Descriptor Fields for Zero-Shot Generalizable Robotic Manipulation, website
-
[arXiv 2024] UniDoorManip: Learning Universal Door Manipulation Policy Over Large-scale and Diverse Door Manipulation Environments, website
-
[CoRL 2023 (Oral)] GNFactor: Multi-Task Real Robot Learning with Generalizable Neural Feature Fields, website
-
[ECCV 2024] ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation, website
-
[IROS 2024] RISE: 3D Perception Makes Real-World Robot Imitation Simple and Effective, website
- GraspNet website:
- [TRO 2023] AnyGrasp: Robust and Efficient Grasp Perception in Spatial and Temporal Domains, arXiv
- [arXiv 2024] ThinkGrasp: A Vision-Language System for Strategic Part Grasping in Clutter, website
- [arXiv 2024] GaussianGrasper: 3D Language Gaussian Splatting for Open-vocabulary Robotic Grasping, website
- [CVPR 2022 Oral] Ditto: Building Digital Twins of Articulated Objects from Interaction, website
- [ICRA 2024] RGBManip: Monocular Image-based Robotic Manipulation through Active Object Pose Estimation, website
- [NIPS 2023] MoVie: Visual Model-Based Policy Adaptation for View Generalization, website
-
[arXiv 2024] OK-Robot: What Really Matters in Integrating Open-Knowledge Models for Robotics, website
-
[CoRL 2023] VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models, website
-
[arXiv 2023] ChatGPT for Robotics: Design Principles and Model Abilities, arXiv
-
[arXiv 2024] Language-Guided Object-Centric Diffusion Policy for Collision-Aware Robotic Manipulation, arXiv
-
[PMLR 2023] RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control, website
- [NIPS 2023] Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning, website
-
[arXiv 2024] Generative Image as Action Models, website
-
[arXiv 2024] Genie: Generative Interactive Environments, website
-
[arXiv 2023] D3 Fields: Dynamic 3D Descriptor Fields for Zero-Shot Generalizable Robotic Manipulation, website
-
[CoRL 2020] Transporter Networks: Rearranging the Visual World for Robotic Manipulation, website
-
[ICLR 2024] SparseDFF: Sparse-View Feature Distillation for One-Shot Dexterous Manipulation, website
-
[ICRA 2024] UniGarmentManip: A Unified Framework for Category-Level Garment Manipulation via Dense Visual Correspondence, website
-
[CoRL 2018] Dense Object Nets: Learning Dense Visual Object Descriptors By and For Robotic Manipulation, PDF
-
[arXiv 2024] Theia: Distilling Diverse Vision Foundation Models for Robot Learning, website, Github repo
-
[CoRL 2022] Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation, website
-
[arXiv 2024] Robo-ABC: Affordance Generalization Beyond Categories via Semantic Correspondence for Robot Manipulation, arXiv
-
[arXiv 2024] PreAfford: Universal Affordance-Based Pre-Grasping for Diverse Objects and Environments, arXiv
-
[ICLR 2022] VAT-Mart: Learning Visual Action Trajectory Proposals for Manipulating 3D ARTiculated Objects, website
-
[ICLR 2023] DualAfford: Learning Collaborative Visual Affordance for Dual-gripper Object Manipulation, arXiv
-
[CVPR 2022] Joint Hand Motion and Interaction Hotspots Prediction from Egocentric Videos, website
-
[ICCV 2023] AffordPose: A Large-scale Dataset of Hand-Object Interactions with Affordance-driven Hand Pose, website
-
[ECCV 2024] Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection, Github repo
-
[arXiv 2024] Grounded SAM: Marrying Grounding DINO with Segment Anything & Stable Diffusion & Recognize Anything - Automatically Detect, Segment and Generate Anything, Github repo
-
[ICCV 2023] DEVA: Tracking Anything with Decoupled Video Segmentation, website
-
[ECCV 2022] Mem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model, website
-
[ICCV 2023] VLPart: Going Denser with Open-Vocabulary Part Segmentation, website
-
LangSAM Github repo, combining Grounding DINO and SAM
-
[CVPR 2024 (Highlight)] FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects, website
-
[CVPR 2023 (Highlight)] GAPartNet: Cross-Category Domain-Generalizable Object Perception and Manipulation via Generalizable and Actionable Parts, website
-
[arXiv 2023] GAMMA: Generalizable Articulation Modeling and Manipulation for Articulated Objects, website
-
[arXiv 2024] ManiPose: A Comprehensive Benchmark for Pose-aware Object Manipulation in Robotics, website
-
[ICCV 2023] AffordPose: A Large-scale Dataset of Hand-Object Interactions with Affordance-driven Hand Pose, website
-
[CVPR 2023] BundleSDF: Neural 6-DoF Tracking and 3D Reconstruction of Unknown Objects, website
- [arXiv 2024] HumanPlus: Humanoid Shadowing and Imitation from Humans, website
- [arXiv 2024] Empowering Embodied Manipulation: A Bimanual-Mobile Robot Manipulation Dataset for Household Tasks, website, zhihu
- [arXiv 2024] GRUtopia: Dream General Robots in a City at Scale, Github Repo
- [ICLR 2024] AgentBoard: An Analytical Evaluation Board of Multi-Turn LLM Agents, website
- [arXiv 2024] RoboCAS: A Benchmark for Robotic Manipulation in Complex Object Arrangement Scenarios, Github repo
- [arXiv 2024] BiGym: A Demo-Driven Mobile Bi-Manual Manipulation Benchmark, website
- [arXiv 2024] Evaluating Real-World Robot Manipulation Policies in Simulation, website
- [arXiv 2024] DexCap: Scalable and Portable Mocap Data Collection System for Dexterous Manipulation, website
- [arXiv 2024] Unique3D: High-Quality and Efficient 3D Mesh Generation from a Single Image, website
- [SIGGRAPH 2024] 2DGS: 2D Gaussian Splatting for Geometrically Accurate Radiance Fields, website
- [arXiv 2024] Surgical Robot Transformer: Imitation Learning for Surgical Tasks, website
-
Where2Act: From Pixels to Actions for Articulated 3D Objects
-
PreAfford: Universal Affordance-Based Pre-Grasping for Diverse Objects and Environments
-
Decision Transformer: Reinforcement Learning via Sequence Modeling
-
Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis
-
AO-Grasp: Articulated Object Grasp Generation
-
Human-to-Robot Imitation in the Wild
-
RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots
-
SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation https://sam-embodied.github.io/, ICML2024
-
PerAct, Act3D
-
Probing the 3D Awareness of Visual Foundation Model: https://arxiv.org/pdf/2404.08636
-
ManipVQA: Injecting Robotic Affordance and Physically Grounded Information into Multi-Modal Large Language Models
-
CLIP: Zero-shot Jack of All Trades, website, CLIP GradCAM CLIP_GradCAM_Visualization
-
Articulated Object Manipulation with Coarse-to-fine Affordance for Mitigating the Effect of Point Cloud Noise: https://arxiv.org/pdf/2402.18699
-
3D-VLA: A 3D Vision-Language-Action Generative World Model
-
PDDLGym: Gym Environments from PDDL Problems: https://arxiv.org/abs/2002.06432
-
TravelPlanner: A Benchmark for Real-World Planning with Language Agents
-
VisionLLM: https://arxiv.org/abs/2305.11175
-
Ferret: Refer and Ground Anything Anywhere at Any Granularity: https://github.com/apple/ml-ferret
-
LangSplat
-
Embodied AI with Two Arms: Zero-shot Learning, Safety and Modularity
-
SparseDFF
-
ManiPose: A Comprehensive Benchmark for Pose-aware Object Manipulation in Robotics
-
Stabilizing Transformers for Reinforcement Learning
- Summary: 本文提出了Gated Transformer-XL (GTrXL),一种改进的Transformer架构,用于解决标准Transformer在强化学习中的优化难题。通过引入层归一化和门控机制,GTrXL在部分可观察性环境中取得了优于LSTM的性能。
- 链接
-
CoBERL: Contrastive BERT for Reinforcement Learning
- Summary: 文章介绍了CoBERL,它结合了对比损失和Transformer架构,通过双向掩码预测和对比学习方法提高强化学习中的数据效率和性能。
- 链接
-
Adaptive Transformers in RL
- Summary: 该研究探索了在强化学习中使用具有自适应注意力跨度的Transformer模型,发现这种方法能够提高模型在需要长期依赖的环境中的性能。
- 链接
-
Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation
- Summary: 本文提出了Actor-Learner Distillation (ALD)方法,通过从大型学习者模型向小型执行者模型进行知识蒸馏,以提高Transformer在强化学习中的样本效率。
- 链接
-
Deep Transformer Q-Networks for Partially Observable Reinforcement Learning
- Summary: 介绍了Deep Transformer Q-Networks (DTQN),这是一种新型的强化学习架构,使用Transformer的自注意力机制来处理部分可观察性任务,并在多个挑战性环境中展示了有效性。
- 链接
-
CtrlFormer: Learning Transferable State Representation for Visual Control via Transformer
- Summary: CtrlFormer是一种新型的Transformer架构,专注于通过学习可迁移的状态表示来提高视觉控制任务的样本效率,特别强调了在跨任务迁移学习方面的优势。
- 链接
Sapiens: Foundation for Human Vision Models: https://about.meta.com/realitylabs/codecavatars/sapiens General Flow as Foundation Affordance for Scalable Robot Learning https://general-flow.github.io/