[晓理紫]每日论文分享(有中文摘要,源码或项目地址)--强化学习、模仿学习、机器人

专属领域论文订阅

VX 关注{晓理紫|},每日更新论文,如感兴趣,请转发给有需要的同学,谢谢支持

如果你感觉对你有所帮助,请关注我,每日准时为你推送最新论文。

为了答谢各位网友的支持,从今日起免费为300名读者提供订阅主题论文服务,只需VX关注公号并回复{邮箱+论文主题}(如:[email protected] + chatgpt@large language model @LLM),主题必须是同一个领域,最多三个关键词。解释权归博主所有

[晓理紫]每日论文分享(有中文摘要,源码或项目地址)--强化学习、模仿学习、机器人_第1张图片

分类:

  • 大语言模型LLM
  • 视觉模型VLM
  • 扩散模型
  • 视觉语言导航VLN
  • 强化学习 RL
  • 模仿学习 IL
  • 机器人
  • 开放词汇,检测分割

== RL ==

标题: M2CURL: Sample-Efficient Multimodal Reinforcement Learning via Self-Supervised Representation Learning for Robotic Manipulation

作者: Fotios Lygerakis, Vedant Dave, Elmar Rueckert

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.17032v1

Project: https://sites.google.com/view/M2CURL/home|

中文摘要: 多模态强化学习(RL)最关键的一个方面是不同观察模态的有效集成。从这些模态中获得鲁棒和准确的表示是增强RL算法的鲁棒性和采样效率的关键。然而,在视觉触觉数据的RL设置中学习表征提出了重大挑战,特别是由于数据的高维度以及将视觉和触觉输入与动态环境和任务目标相关联所涉及的复杂性。为了应对这些挑战,我们提出了多模态对比无监督强化学习(M2CURL)。我们的方法采用了一种新的多模态自监督学习技术,该技术学习有效的表示,并有助于RL算法的更快收敛。我们的方法与RL算法不可知,因此能够与任何可用的RL算法集成。我们在触觉健身房2模拟器上评估了M2CURL,我们表明它显著提高了不同操作任务的学习效率。与没有我们的表示学习方法的标准RL算法相比,每集更快的收敛速度和更高的累积回报证明了这一点。

摘要: One of the most critical aspects of multimodal Reinforcement Learning (RL) is the effective integration of different observation modalities. Having robust and accurate representations derived from these modalities is key to enhancing the robustness and sample efficiency of RL algorithms. However, learning representations in RL settings for visuotactile data poses significant challenges, particularly due to the high dimensionality of the data and the complexity involved in correlating visual and tactile inputs with the dynamic environment and task objectives. To address these challenges, we propose Multimodal Contrastive Unsupervised Reinforcement Learning (M2CURL). Our approach employs a novel multimodal self-supervised learning technique that learns efficient representations and contributes to faster convergence of RL algorithms. Our method is agnostic to the RL algorithm, thus enabling its integration with any available RL algorithm. We evaluate M2CURL on the Tactile Gym 2 simulator and we show that it significantly enhances the learning efficiency in different manipulation tasks. This is evidenced by faster convergence rates and higher cumulative rewards per episode, compared to standard RL algorithms without our representation learning approach.


标题: CORE: Towards Scalable and Efficient Causal Discovery with Reinforcement Learning

作者: Andreas W. M. Sauter, Nicolò Botteghi, Erman Acar

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.16974v1

GitHub: https://github.com/sa-and/CORE|

中文摘要: 因果发现是从数据中推断因果结构的挑战性任务。珀尔的因果层次(PCH)告诉我们,仅靠被动观察不足以区分相关性和因果关系,受此激励,最近有一股将干预纳入机器学习研究的趋势。强化学习为这种积极的学习方法提供了一个方便的框架。本文介绍了CORE,一种基于深度强化学习的因果发现和干预计划方法。CORE学习从数据中顺序重建因果图,同时学习执行信息干预。我们的结果表明,核心推广到看不见的图,并有效地揭示因果结构。此外,核心扩展到具有多达10个变量的更大的图,并且在结构估计精度和样本效率方面优于现有的方法。所有相关代码和补充材料均可在https://github.com/sa-and/CORE

摘要: Causal discovery is the challenging task of inferring causal structure from data. Motivated by Pearl’s Causal Hierarchy (PCH), which tells us that passive observations alone are not enough to distinguish correlation from causation, there has been a recent push to incorporate interventions into machine learning research. Reinforcement learning provides a convenient framework for such an active approach to learning. This paper presents CORE, a deep reinforcement learning-based approach for causal discovery and intervention planning. CORE learns to sequentially reconstruct causal graphs from data while learning to perform informative interventions. Our results demonstrate that CORE generalizes to unseen graphs and efficiently uncovers causal structures. Furthermore, CORE scales to larger graphs with up to 10 variables and outperforms existing approaches in structure estimation accuracy and sample efficiency. All relevant code and supplementary material can be found at https://github.com/sa-and/CORE


标题: Reinforcement Unlearning

作者: Dayong Ye, Tianqing Zhu, Congcong Zhu

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2312.15910v2

Project: https://anonymous.4open.science/r/Reinforcement-Unlearning-D347|

摘要: Machine unlearning refers to the process of mitigating the influence of specific training data on machine learning models based on removal requests from data owners. However, one important area that has been largely overlooked in the research of unlearning is reinforcement learning. Reinforcement learning focuses on training an agent to make optimal decisions within an environment to maximize its cumulative rewards. During the training, the agent tends to memorize the features of the environment, which raises a significant concern about privacy. As per data protection regulations, the owner of the environment holds the right to revoke access to the agent’s training data, thus necessitating the development of a novel and pressing research field, known as \emph{reinforcement unlearning}. Reinforcement unlearning focuses on revoking entire environments rather than individual data samples. This unique characteristic presents three distinct challenges: 1) how to propose unlearning schemes for environments; 2) how to avoid degrading the agent’s performance in remaining environments; and 3) how to evaluate the effectiveness of unlearning. To tackle these challenges, we propose two reinforcement unlearning methods. The first method is based on decremental reinforcement learning, which aims to erase the agent’s previously acquired knowledge gradually. The second method leverages environment poisoning attacks, which encourage the agent to learn new, albeit incorrect, knowledge to remove the unlearning environment. Particularly, to tackle the third challenge, we introduce the concept of ``environment inference attack’’ to evaluate the unlearning outcomes. The source code is available at \url{https://anonymous.4open.science/r/Reinforcement-Unlearning-D347}.


标题: Optimal service resource management strategy for IoT-based health information system considering value co-creation of users

作者: Ji Fang, Vincent CS Lee, Haiyan Wang

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2204.02521v2

Project: https://doi.org/10.1108/IMDS-03-2023-0173|

中文摘要: 本文探讨了最优服务资源管理策略,这是卫生信息服务在提高服务绩效、优化服务资源利用和提供交互式卫生信息服务方面面临的持续挑战。考虑健康信息服务中的价值共创模型,开发了一种自适应最优服务资源管理策略,重点关注与用户的协作和互动。深度强化学习算法被嵌入到基于物联网(IoT)的健康信息服务系统(I-HISS)中,通过基于用户参与行为控制服务提供和服务适应来分配服务资源。进行仿真实验以评估所提出的算法在不同用户对健康信息服务的反应下的重要性。

摘要: This paper explores optimal service resource management strategy, a continuous challenge for health information service to enhance service performance, optimise service resource utilisation and deliver interactive health information service. An adaptive optimal service resource management strategy was developed considering a value co-creation model in health information service with a focus on collaborative and interactive with users. The deep reinforcement learning algorithm was embedded in the Internet of Things (IoT)-based health information service system (I-HISS) to allocate service resources by controlling service provision and service adaptation based on user engagement behaviour. The simulation experiments were conducted to evaluate the significance of the proposed algorithm under different user reactions to the health information service.


标题: A comparison of RL-based and PID controllers for 6-DOF swimming robots: hybrid underwater object tracking

作者: Faraz Lotfi, Khalil Virji, Nicholas Dudek

PubTime: 2024-01-29

Downlink: http://arxiv.org/abs/2401.16618v1

GitHub: https://github.com/FARAZLOTFI/underwater-object-tracking|

中文摘要: 在本文中,我们对采用集中式深度Q网络(DQN)控制器作为6DOF游泳机器人环境中普遍使用的PID控制器的替代进行了探索和评估。我们的主要焦点集中在用水下物体跟踪的具体例子来说明这种转变。DQN提供了数据效率和非策略学习等优势,同时比其他强化学习方法更容易实现。鉴于我们的机器人缺乏动态模型,我们提出了一个RL代理来控制这个多输入多输出(MIMO)系统,其中集中式控制器可以提供比不同PID更鲁棒的控制。我们的方法包括最初使用经典控制器进行安全探索,然后逐渐转移到DQN来完全控制机器人。将水下跟踪任务分为视觉和控制两个模块。我们使用已建立的基于视觉的跟踪方法,并引入了集中式DQN控制器。通过将边界框数据从视觉模块传输到控制模块,我们能够适应各种对象并轻松更换视觉系统。此外,处理低维数据有助于控制器的成本有效的在线学习。我们在基于Unity的模拟器中进行的实验验证了集中式RL代理相对于分离式PID控制器的有效性,展示了我们的框架对于训练水下RL代理的适用性,以及与传统控制方法相比改进的性能。真实和模拟实现的代码都在https://github.com/FARAZLOTFI/underated-object-tracking。

摘要: In this paper, we present an exploration and assessment of employing a centralized deep Q-network (DQN) controller as a substitute for the prevalent use of PID controllers in the context of 6DOF swimming robots. Our primary focus centers on illustrating this transition with the specific case of underwater object tracking. DQN offers advantages such as data efficiency and off-policy learning, while remaining simpler to implement than other reinforcement learning methods. Given the absence of a dynamic model for our robot, we propose an RL agent to control this multi-input-multi-output (MIMO) system, where a centralized controller may offer more robust control than distinct PIDs. Our approach involves initially using classical controllers for safe exploration, then gradually shifting to DQN to take full control of the robot. We divide the underwater tracking task into vision and control modules. We use established methods for vision-based tracking and introduce a centralized DQN controller. By transmitting bounding box data from the vision module to the control module, we enable adaptation to various objects and effortless vision system replacement. Furthermore, dealing with low-dimensional data facilitates cost-effective online learning for the controller. Our experiments, conducted within a Unity-based simulator, validate the effectiveness of a centralized RL agent over separated PID controllers, showcasing the applicability of our framework for training the underwater RL agent and improved performance compared to traditional control methods. The code for both real and simulation implementations is at https://github.com/FARAZLOTFI/underwater-object-tracking.


标题: Zero-Shot Reinforcement Learning via Function Encoders

作者: Tyler Ingebrand, Amy Zhang, Ufuk Topcu

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.17173v1

摘要: Although reinforcement learning (RL) can solve many challenging sequential decision making problems, achieving zero-shot transfer across related tasks remains a challenge. The difficulty lies in finding a good representation for the current task so that the agent understands how it relates to previously seen tasks. To achieve zero-shot transfer, we introduce the function encoder, a representation learning algorithm which represents a function as a weighted combination of learned, non-linear basis functions. By using a function encoder to represent the reward function or the transition function, the agent has information on how the current task relates to previously seen tasks via a coherent vector representation. Thus, the agent is able to achieve transfer between related tasks at run time with no additional training. We demonstrate state-of-the-art data efficiency, asymptotic performance, and training stability in three RL fields by augmenting basic RL algorithms with a function encoder task representation.


== Imitation Learning ==

标题: Dual RL: Unification and New Methods for Reinforcement and Imitation Learning

作者: Harshit Sikchi, Qinqing Zheng, Amy Zhang

PubTime: 2024-01-26

Downlink: http://arxiv.org/abs/2302.08560v3

Project: https://hari-sikchi.github.io/dual-rl|

中文摘要: 强化学习(RL)的目标是找到一个最大化预期累积回报的策略。已经表明,这个目标可以表示为线性约束下状态——行动访问分配的优化问题。这个公式的对偶问题,我们称之为对偶RL,是无约束的,更容易优化。在这项工作中,我们首先将几个最先进的离线RL和离线模仿学习(IL)算法作为具有共享结构的双重RL方法的实例。这种统一使我们能够确定现有方法缺点的根本原因。对于离线IL,我们的分析表明,现有的方法是基于限制性覆盖假设,这极大地限制了它们在实践中的性能。为了解决这一限制,我们提出了一种新的无鉴别器方法反冲,该方法从任意非策略数据中学习模仿,以获得接近专家的性能。对于离线RL,我们的分析将最近的离线RL方法XQL框架在对偶框架中,我们进一步提出了一种新的方法f-DVL,它为Gumbel回归损失提供了替代选择,修复了XQL的已知训练不稳定性问题。我们提出的两种方法反冲和f-DVL在IL和RL中的性能改进在大量模拟机器人运动和操纵任务中得到验证。项目代码和细节可以在这个https://hari-sikchi.github.io/dual-rl。找到

摘要: The goal of reinforcement learning (RL) is to find a policy that maximizes the expected cumulative return. It has been shown that this objective can be represented as an optimization problem of state-action visitation distribution under linear constraints. The dual problem of this formulation, which we refer to as dual RL, is unconstrained and easier to optimize. In this work, we first cast several state-of-the-art offline RL and offline imitation learning (IL) algorithms as instances of dual RL approaches with shared structures. Such unification allows us to identify the root cause of the shortcomings of prior methods. For offline IL, our analysis shows that prior methods are based on a restrictive coverage assumption that greatly limits their performance in practice. To fix this limitation, we propose a new discriminator-free method ReCOIL that learns to imitate from arbitrary off-policy data to obtain near-expert performance. For offline RL, our analysis frames a recent offline RL method XQL in the dual framework, and we further propose a new method f-DVL that provides alternative choices to the Gumbel regression loss that fixes the known training instability issue of XQL. The performance improvements by both of our proposed methods, ReCOIL and f-DVL, in IL and RL are validated on an extensive suite of simulated robot locomotion and manipulation tasks. Project code and details can be found at this https://hari-sikchi.github.io/dual-rl.


标题: Multi-task robot data for dual-arm fine manipulation

作者: Heecheol Kim, Yoshiyuki Ohmura, Yasuo Kuniyoshi

PubTime: 2024-01-26

Downlink: http://arxiv.org/abs/2401.07603v2

Project: https://sites.google.com/view/multi-task-fine|https://sites.google.com/view/multi-task-fine|

中文摘要: 在机器人操纵领域,深度模仿学习被认为是获得操纵技能的一种有前途的方法。此外,从不同的机器人数据集学习被认为是实现多功能性和适应性的可行方法。在这样的研究中,通过学习各种任务,机器人实现了跨多个对象的通用性。然而,这种多任务机器人数据集主要集中在相对不精确的单臂任务上,而没有解决机器人在现实世界中预期执行的细粒度对象操作。本文介绍了一个不同对象操作的数据集,包括双臂任务和/或需要精细操作的任务。为此,我们生成了224k集(150小时,1,104种语言指令)的数据集,其中包括双臂精细任务,如移动碗、打开铅笔盒或剥香蕉,这些数据是公开可用的。此外,该数据集包括视觉注意力信号以及双动作标签,该信号将动作分成稳健的到达轨迹和与对象的精确交互,以及实现稳健和精确的对象操作的语言指令。我们将该数据集应用于我们的双动作和注意力(DAA),这是一个为细粒度双臂操作任务设计的模型,对协变量偏移具有鲁棒性。该模型在实际机器人操作任务中进行了超过7k次试验,证明了其精细操作能力。该数据集可在https://sites.google.com/view/multi-task-fine查阅。

摘要: In the field of robotic manipulation, deep imitation learning is recognized as a promising approach for acquiring manipulation skills. Additionally, learning from diverse robot datasets is considered a viable method to achieve versatility and adaptability. In such research, by learning various tasks, robots achieved generality across multiple objects. However, such multi-task robot datasets have mainly focused on single-arm tasks that are relatively imprecise, not addressing the fine-grained object manipulation that robots are expected to perform in the real world. This paper introduces a dataset of diverse object manipulations that includes dual-arm tasks and/or tasks requiring fine manipulation. To this end, we have generated dataset with 224k episodes (150 hours, 1,104 language instructions) which includes dual-arm fine tasks such as bowl-moving, pencil-case opening or banana-peeling, and this data is publicly available. Additionally, this dataset includes visual attention signals as well as dual-action labels, a signal that separates actions into a robust reaching trajectory and precise interaction with objects, and language instructions to achieve robust and precise object manipulation. We applied the dataset to our Dual-Action and Attention (DAA), a model designed for fine-grained dual arm manipulation tasks and robust against covariate shifts. The model was tested with over 7k total trials in real robot manipulation tasks, demonstrating its capability in fine manipulation. The dataset is available at https://sites.google.com/view/multi-task-fine.


标题: Interpretable Imitation Learning with Dynamic Causal Relations

作者: Tianxiang Zhao, Wenchao Yu, Suhang Wang

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2310.00489v4

中文摘要: 模仿学习通过模仿专家演示来学习代理策略,在医疗制度和自动驾驶汽车等许多应用中显示出有希望的结果。然而,解释代理学习的控制策略仍然是一项困难的任务。困难主要来自两个方面:1)模仿学习中的agent通常实现为深度神经网络,是黑盒模型,缺乏可解释性;2)代理决策背后的潜在因果机制可能会沿轨迹变化,而不是在整个时间步骤中保持静止。为了增加透明度并提供神经代理更好的可解释性,我们建议以有向无环因果图的形式公开其捕获的知识,节点是动作和状态变量,边表示预测背后的因果关系。此外,我们将这个因果发现过程设计为状态相关的,使其能够对潜在因果图中的动态进行建模。具体来说,我们从格兰杰因果关系的角度进行因果发现,并提出一个可自我解释的模仿学习框架{\method}。所提出的框架由三部分组成:动态因果发现模块、因果编码模块和预测模块,并以端到端的方式进行训练。在模型被学习之后,我们可以获得其决策背后的状态和行动变量之间的因果关系,暴露它所学习的政策。在合成和真实世界数据集上的实验结果证明了所提出的{\method}在学习动态因果图以理解模仿学习的决策同时保持高预测精度方面的有效性。

摘要: Imitation learning, which learns agent policy by mimicking expert demonstration, has shown promising results in many applications such as medical treatment regimes and self-driving vehicles. However, it remains a difficult task to interpret control policies learned by the agent. Difficulties mainly come from two aspects: 1) agents in imitation learning are usually implemented as deep neural networks, which are black-box models and lack interpretability; 2) the latent causal mechanism behind agents’ decisions may vary along the trajectory, rather than staying static throughout time steps. To increase transparency and offer better interpretability of the neural agent, we propose to expose its captured knowledge in the form of a directed acyclic causal graph, with nodes being action and state variables and edges denoting the causal relations behind predictions. Furthermore, we design this causal discovery process to be state-dependent, enabling it to model the dynamics in latent causal graphs. Concretely, we conduct causal discovery from the perspective of Granger causality and propose a self-explainable imitation learning framework, {\method}. The proposed framework is composed of three parts: a dynamic causal discovery module, a causality encoding module, and a prediction module, and is trained in an end-to-end manner. After the model is learned, we can obtain causal relations among states and action variables behind its decisions, exposing policies learned by it. Experimental results on both synthetic and real-world datasets demonstrate the effectiveness of the proposed {\method} in learning the dynamic causal graphs for understanding the decision-making of imitation learning meanwhile maintaining high prediction accuracy.


标题: Extrinsicaly Rewarded Soft Q Imitation Learning with Discriminator

作者: Ryoma Furuyama, Daiki Kuyoshi, Satoshi Yamane

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.16772v1

中文摘要: 在奖励设计困难或奖励稀疏的环境中,除了强化学习之外,还经常使用模仿学习,但很难从少量的专家数据和采样数据中很好地模仿未知状态。行为克隆等监督学习方法不需要采样数据,但通常会出现分布偏移。基于强化学习的方法,如逆强化学习和生成对抗模仿学习(GAIL),只能从少数专家数据中学习。然而,它们经常需要与环境互动。软Q模仿学习(SQIL)解决了这些问题,并表明它可以通过将行为克隆和软Q学习与持续奖励相结合来有效地学习。为了使该算法对分布转移更鲁棒,我们提出了更有效和鲁棒的算法,通过向该方法添加基于对抗逆强化学习的奖励函数,该奖励函数奖励代理在类似于演示的状态下执行动作。我们称这种算法为鉴别器软Q模仿学习(DSQIL)。我们在MuJoCo环境中对其进行了评估。

摘要: Imitation learning is often used in addition to reinforcement learning in environments where reward design is difficult or where the reward is sparse, but it is difficult to be able to imitate well in unknown states from a small amount of expert data and sampling data. Supervised learning methods such as Behavioral Cloning do not require sampling data, but usually suffer from distribution shift. The methods based on reinforcement learning, such as inverse reinforcement learning and Generative Adversarial imitation learning (GAIL), can learn from only a few expert data. However, they often need to interact with the environment. Soft Q imitation learning (SQIL) addressed the problems, and it was shown that it could learn efficiently by combining Behavioral Cloning and soft Q-learning with constant rewards. In order to make this algorithm more robust to distribution shift, we propose more efficient and robust algorithm by adding to this method a reward function based on adversarial inverse reinforcement learning that rewards the agent for performing actions in status similar to the demo. We call this algorithm Discriminator Soft Q Imitation Learning (DSQIL). We evaluated it on MuJoCo environments.


标题: ILBiT: Imitation Learning for Robot Using Position and Torque Information based on Bilateral Control with Transformer

作者: Masato Kobayashi, Thanpimon Buamanee, Yuki Uranishi

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.16653v1

摘要: Autonomous manipulation in robot arms is a complex and evolving field of study in robotics. This paper introduces an innovative approach to this challenge by focusing on imitation learning (IL). Unlike traditional imitation methods, our approach uses IL based on bilateral control, allowing for more precise and adaptable robot movements. The conventional IL based on bilateral control method have relied on Long Short-Term Memory (LSTM) networks. In this paper, we present the IL for robot using position and torque information based on Bilateral control with Transformer (ILBiT). This proposed method employs the Transformer model, known for its robust performance in handling diverse datasets and its capability to surpass LSTM’s limitations, especially in tasks requiring detailed force adjustments. A standout feature of ILBiT is its high-frequency operation at 100 Hz, which significantly improves the system’s adaptability and response to varying environments and objects of different hardness levels. The effectiveness of the Transformer-based ILBiT method can be seen through comprehensive real-world experiments.


标题: Inverse Reinforcement Learning without Reinforcement Learning

作者: Gokul Swamy, Sanjiban Choudhury, J. Andrew Bagnell

PubTime: 2024-01-29

Downlink: http://arxiv.org/abs/2303.14623v4

摘要: Inverse Reinforcement Learning (IRL) is a powerful set of techniques for imitation learning that aims to learn a reward function that rationalizes expert demonstrations. Unfortunately, traditional IRL methods suffer from a computational weakness: they require repeatedly solving a hard reinforcement learning (RL) problem as a subroutine. This is counter-intuitive from the viewpoint of reductions: we have reduced the easier problem of imitation learning to repeatedly solving the harder problem of RL. Another thread of work has proved that access to the side-information of the distribution of states where a strong policy spends time can dramatically reduce the sample and computational complexities of solving an RL problem. In this work, we demonstrate for the first time a more informed imitation learning reduction where we utilize the state distribution of the expert to alleviate the global exploration component of the RL subroutine, providing an exponential speedup in theory. In practice, we find that we are able to significantly speed up the prior art on continuous control tasks.


== robotic agent==

标题: Generative Expressive Robot Behaviors using Large Language Models

作者: Karthik Mahadevan, Jonathan Chien, Noah Brown

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.14673v2

Project: https://generative-expressive-motion.github.io/|

中文摘要: 人们使用表达行为来有效地与他人交流和协调他们的行动,例如点头表示对瞥他们一眼的人的认可,或者在繁忙的走廊上说“对不起”从人群中经过。我们希望机器人也能在人机交互中表现出富有表现力的行为。先前的工作提出了基于规则的方法,这些方法很难扩展到新的通信模式或社交场合,而数据驱动的方法需要针对机器人所处的每个社交场合的专门数据集。我们建议利用大型语言模型(LLMs)提供的丰富的社会背景及其基于指令或用户偏好生成运动的能力,来生成具有适应性和可组合性的机器人运动,并相互构建。我们的方法利用机器人可用和学习的技能,利用少量思维链提示将人类语言指令翻译成参数化的控制代码。通过用户研究和模拟实验,我们证明了我们的方法产生了用户认为有能力和容易理解的行为。补充材料可以在https://generative-expressive-motion.github.io/。

摘要: People employ expressive behaviors to effectively communicate and coordinate their actions with others, such as nodding to acknowledge a person glancing at them or saying “excuse me” to pass people in a busy corridor. We would like robots to also demonstrate expressive behaviors in human-robot interaction. Prior work proposes rule-based methods that struggle to scale to new communication modalities or social situations, while data-driven methods require specialized datasets for each social situation the robot is used in. We propose to leverage the rich social context available from large language models (LLMs) and their ability to generate motion based on instructions or user preferences, to generate expressive robot motion that is adaptable and composable, building upon each other. Our approach utilizes few-shot chain-of-thought prompting to translate human language instructions into parametrized control code using the robot’s available and learned skills. Through user studies and simulation experiments, we demonstrate that our approach produces behaviors that users found to be competent and easy to understand. Supplementary material can be found at https://generative-expressive-motion.github.io/.


标题: Towards Unified Interactive Visual Grounding in The Wild

作者: Jie Xu, Hanbo Zhang, Qingyi Si

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.16699v1

GitHub: https://github.com/jxu124/TiO|

摘要: Interactive visual grounding in Human-Robot Interaction (HRI) is challenging yet practical due to the inevitable ambiguity in natural languages. It requires robots to disambiguate the user input by active information gathering. Previous approaches often rely on predefined templates to ask disambiguation questions, resulting in performance reduction in realistic interactive scenarios. In this paper, we propose TiO, an end-to-end system for interactive visual grounding in human-robot interaction. Benefiting from a unified formulation of visual dialogue and grounding, our method can be trained on a joint of extensive public data, and show superior generality to diversified and challenging open-world scenarios. In the experiments, we validate TiO on GuessWhat?! and InViG benchmarks, setting new state-of-the-art performance by a clear margin. Moreover, we conduct HRI experiments on the carefully selected 150 challenging scenes as well as real-robot platforms. Results show that our method demonstrates superior generality to diversified visual and language inputs with a high success rate. Codes and demos are available at https://github.com/jxu124/TiO.


标题: MResT: Multi-Resolution Sensing for Real-Time Control with Vision-Language Models

作者: Saumya Saxena, Mohit Sharma, Oliver Kroemer

PubTime: 2024-01-25

Downlink: http://arxiv.org/abs/2401.14502v1

Project: http://tinyurl.com/multi-res-realtime-control|

中文摘要: 利用不同空间和时间分辨率的传感模式可以提高机器人操纵任务的性能。多空间分辨率传感提供了在不同空间尺度下捕获的分层信息,并支持粗略和精确的运动。同时,多时间分辨率传感使代理能够表现出高反应性和实时控制。在这项工作中,我们提出了一个框架,MResT(多分辨率Transformer model),用于学习可概括的语言条件多任务策略,该策略利用不同空间和时间分辨率的传感,使用不同容量的网络来有效地执行精确和反应性任务的实时控制。我们利用现成的预训练视觉语言模型来操作低频全局特征,以及小型非预训练模型来适应高频局部反馈。通过在3个领域(粗略、精确和动态操作任务)的广泛实验,我们表明我们的方法在最近的多任务基线上显著改进(平均2倍)。此外,我们的方法可以很好地推广到目标物体的视觉和几何变化以及不同的相互作用力。

摘要: Leveraging sensing modalities across diverse spatial and temporal resolutions can improve performance of robotic manipulation tasks. Multi-spatial resolution sensing provides hierarchical information captured at different spatial scales and enables both coarse and precise motions. Simultaneously multi-temporal resolution sensing enables the agent to exhibit high reactivity and real-time control. In this work, we propose a framework, MResT (Multi-Resolution Transformer), for learning generalizable language-conditioned multi-task policies that utilize sensing at different spatial and temporal resolutions using networks of varying capacities to effectively perform real time control of precise and reactive tasks. We leverage off-the-shelf pretrained vision-language models to operate on low-frequency global features along with small non-pretrained models to adapt to high frequency local feedback. Through extensive experiments in 3 domains (coarse, precise and dynamic manipulation tasks), we show that our approach significantly improves (2X on average) over recent multi-task baselines. Further, our approach generalizes well to visual and geometric variations in target objects and to varying interaction forces.


标题: Memory-centered and Affordance-based Framework for Mobile Manipulation

作者: Christoph Pohl, Fabian Reister, Fabian Peller-Konrad

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.16899v1

中文摘要: 在以人为中心的环境中执行多功能移动操作动作需要高度复杂的软件框架,这些框架足够灵活以处理特殊用例,但又足够通用以适用于不同的机器人系统、任务和环境。本文提出了一个全面的以记忆为中心、基于启示的模块化单手和多手抓取和移动操纵框架,适用于仿人机器人等具有高自由度的复杂机器人系统。通过启示表示移动操纵动作,即机器人与其环境的交互可能性,我们统一了任意环境中已知和未知物体的自主操纵过程。我们的框架被集成并嵌入到ARMAR人形机器人家族以记忆为中心的认知架构中。通过这种方式,机器人不仅可以与物理世界互动,还可以使用关于物体的常识,学习和调整操纵策略。我们在真实世界的实验中证明了该框架的适用性,包括在两个不同的人形机器人平台上抓取已知和未知的物体,放置物体,以及半自动双手抓取物体。

摘要: Performing versatile mobile manipulation actions in human-centered environments requires highly sophisticated software frameworks that are flexible enough to handle special use cases, yet general enough to be applicable across different robotic systems, tasks, and environments. This paper presents a comprehensive memory-centered, affordance-based, and modular uni- and multi-manual grasping and mobile manipulation framework, applicable to complex robot systems with a high number of degrees of freedom such as humanoid robots. By representing mobile manipulation actions through affordances, i.e., interaction possibilities of the robot with its environment, we unify the autonomous manipulation process for known and unknown objects in arbitrary environments. Our framework is integrated and embedded into the memory-centric cognitive architecture of the ARMAR humanoid robot family. This way, robots can not only interact with the physical world but also use common knowledge about objects, and learn and adapt manipulation strategies. We demonstrate the applicability of the framework in real-world experiments, including grasping known and unknown objects, object placing, and semi-autonomous bimanual grasping of objects on two different humanoid robot platforms.


标题: Excitation Trajectory Optimization for Dynamic Parameter Identification Using Virtual Constraints in Hands-on Robotic System

作者: Huanyu Tian, Martin Huber, Christopher E. Mower

PubTime: 2024-01-29

Downlink: http://arxiv.org/abs/2401.16566v1

中文摘要: 本文提出了一种新的、计算效率更高的方法来优化机器人激励轨迹,以进行动态参数识别,强调自碰撞避免。这解决了系统识别的挑战,以获得与可以配备各种工具的共同操纵的机械臂相关联的高质量训练数据,这是工业以及临床和研究环境中的常见场景。利用统一机器人描述格式(URDF)来实现递归牛顿——欧拉算法(RNEA)的符号Python实现,该方法有助于使用对来自真实机器人的数据的回归分析来动态估计参数,例如惯性。与不考虑自碰撞和工具校准的最新报告结果相比,激励轨迹的评估和实现符合par标准。此外,在外科手术环境中进行了物理人机交互(pHRI)导纳控制实验,以评估导出的逆动力学模型,该模型显示NASA TLX问卷显示工作量减少了30.1%。

摘要: This paper proposes a novel, more computationally efficient method for optimizing robot excitation trajectories for dynamic parameter identification, emphasizing self-collision avoidance. This addresses the system identification challenges for getting high-quality training data associated with co-manipulated robotic arms that can be equipped with a variety of tools, a common scenario in industrial but also clinical and research contexts. Utilizing the Unified Robotics Description Format (URDF) to implement a symbolic Python implementation of the Recursive Newton-Euler Algorithm (RNEA), the approach aids in dynamically estimating parameters such as inertia using regression analyses on data from real robots. The excitation trajectory was evaluated and achieved on par criteria when compared to state-of-the-art reported results which didn’t consider self-collision and tool calibrations. Furthermore, physical Human-Robot Interaction (pHRI) admittance control experiments were conducted in a surgical context to evaluate the derived inverse dynamics model showing a 30.1% workload reduction by the NASA TLX questionnaire.


标题: Security Considerations in AI-Robotics: A Survey of Current Methods, Challenges, and Opportunities

作者: Subash Neupane, Shaswata Mitra, Ivan A. Fernandez

PubTime: 2024-01-26

Downlink: http://arxiv.org/abs/2310.08565v3

中文摘要: 机器人和人工智能(AI)从一开始就密不可分地交织在一起。今天,人工智能机器人系统已经成为我们日常生活中不可或缺的一部分,从机器人吸尘器到半自动汽车。这些系统建立在三个基本架构元素之上:感知、导航和规划以及控制。然而,尽管人工智能机器人系统的集成提高了我们的生活质量,但它也带来了一个严重的问题——这些系统容易受到安全攻击。构成人工智能机器人系统的物理组件、算法和数据可能会被恶意行为者利用,潜在地导致可怕的后果。出于解决人工智能机器人系统中安全问题的需要,本文提出了一个跨三个维度的全面调查和分类:攻击面,伦理和法律问题,以及人机交互(HRI)安全。我们的目标是为用户、开发者和其他利益相关者提供对这些领域的整体理解,以增强整体人工智能机器人系统的安全性。我们从调查潜在的攻击面开始,并提供缓解防御策略。然后,我们深入研究伦理问题,如依赖性和心理影响,以及关于这些系统问责制的法律问题。此外,还讨论了HRI等新兴趋势,考虑了隐私、完整性、安全性、可信度和可解释性问题。最后,我们提出了我们对这一充满活力和前景的领域的未来研究方向的展望。

摘要: Robotics and Artificial Intelligence (AI) have been inextricably intertwined since their inception. Today, AI-Robotics systems have become an integral part of our daily lives, from robotic vacuum cleaners to semi-autonomous cars. These systems are built upon three fundamental architectural elements: perception, navigation and planning, and control. However, while the integration of AI-Robotics systems has enhanced the quality our lives, it has also presented a serious problem - these systems are vulnerable to security attacks. The physical components, algorithms, and data that make up AI-Robotics systems can be exploited by malicious actors, potentially leading to dire consequences. Motivated by the need to address the security concerns in AI-Robotics systems, this paper presents a comprehensive survey and taxonomy across three dimensions: attack surfaces, ethical and legal concerns, and Human-Robot Interaction (HRI) security. Our goal is to provide users, developers and other stakeholders with a holistic understanding of these areas to enhance the overall AI-Robotics system security. We begin by surveying potential attack surfaces and provide mitigating defensive strategies. We then delve into ethical issues, such as dependency and psychological impact, as well as the legal concerns regarding accountability for these systems. Besides, emerging trends such as HRI are discussed, considering privacy, integrity, safety, trustworthiness, and explainability concerns. Finally, we present our vision for future research directions in this dynamic and promising field.


== Object Detection==

标题: YOLO-World: Real-Time Open-Vocabulary Object Detection

作者: Tianheng Cheng, Lin Song, Yixiao Ge

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.17270v1

GitHub: https://github.com/AILab-CVC/YOLO-World|

中文摘要: You Only Look Once(YOLO)系列探测器已经成为高效实用的工具。然而,它们对预定义和训练过的对象类别的依赖限制了它们在开放场景中的适用性。针对这一限制,我们引入了YOLO世界,这是一种创新的方法,通过视觉语言建模和大规模数据集的预训练,增强了YOLO的开放词汇检测能力。具体来说,我们提出了一种新的可重新参数化的视觉——语言路径聚合网络(RepVL-PAN)和区域——文本对比丢失,以促进视觉和语言信息之间的交互。我们的方法擅长以零镜头的方式高效率地检测大范围的物体。在具有挑战性的LVIS数据集上,YOLO世界在V100上以52.0 FPS实现了35.4 AP,在准确性和速度方面都超过了许多最先进的方法。此外,经过微调的YOLO世界在几个下游任务上取得了显著的性能,包括对象检测和开放词汇实例分割。

摘要: The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools. However, their reliance on predefined and trained object categories limits their applicability in open scenarios. Addressing this limitation, we introduce YOLO-World, an innovative approach that enhances YOLO with open-vocabulary detection capabilities through vision-language modeling and pre-training on large-scale datasets. Specifically, we propose a new Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and region-text contrastive loss to facilitate the interaction between visual and linguistic information. Our method excels in detecting a wide range of objects in a zero-shot manner with high efficiency. On the challenging LVIS dataset, YOLO-World achieves 35.4 AP with 52.0 FPS on V100, which outperforms many state-of-the-art methods in terms of both accuracy and speed. Furthermore, the fine-tuned YOLO-World achieves remarkable performance on several downstream tasks, including object detection and open-vocabulary instance segmentation.


标题: H-SynEx: Using synthetic images and ultra-high resolution ex vivo MRI for hypothalamus subregion segmentation

作者: Livia Rodrigues, Martina Bocchetta, Oula Puonti

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.17104v1

GitHub: https://github.com/liviamarodrigues/hsynex|

中文摘要: 目的:开发一种由超高分辨率离体磁共振图像(MRI)提供信息的下丘脑亚区自动分割方法,该方法无需重新训练即可跨MRI序列和分辨率进行推广。材料和方法:我们用从超高分辨率离体MRI扫描构建的标签图中获得的合成图像训练了我们的深度学习方法H-synEx,与1毫米等距体内图像相比,这使得能够进行更细粒度的手动分割。我们使用来自六个数据集和六个MRI序列的1535幅体内图像验证了这项回顾性研究。定量评估采用戴斯系数(DC)和平均豪斯多夫距离(AVD)。统计分析使用曲线下面积(AUC)和Wilcoxon秩和检验比较了对照组、阿尔茨海默病(AD)和行为变异额颞叶痴呆(bvFTD)受试者的下丘脑亚区体积。结果:H-SynEx可以在各种MRI序列中分割下丘脑,包括具有显著切片间距(5mm)的FLAIR序列。使用T1w图像上的下丘脑体积来区分对照组与AD和bvFTD患者,我们观察到AUC值分别为0.74和0.79。此外,当比较对照组和非患者时,发现FLAIR扫描的体积变化AUC=0.66。结论:我们的结果表明,H-SynEx成功地利用了来自超高分辨率扫描的信息,从不同的MRI序列(如T1w、T2w、PD、qT1、FA和FLAIR)中进行体内分割。我们还发现,我们的自动分割能够在间距为5毫米的FLAIR图像上区分对照组和患者。H-SynEx可在https://github.com/liviamarodrigues/hsynex。

摘要: Purpose: To develop a method for automated segmentation of hypothalamus subregions informed by ultra-high resolution ex vivo magnetic resonance images (MRI), which generalizes across MRI sequences and resolutions without retraining. Materials and Methods: We trained our deep learning method, H-synEx, with synthetic images derived from label maps built from ultra-high resolution ex vivo MRI scans, which enables finer-grained manual segmentation when compared with 1mm isometric in vivo images. We validated this retrospective study using 1535 in vivo images from six datasets and six MRI sequences. The quantitative evaluation used the Dice Coefficient (DC) and Average Hausdorff distance (AVD). Statistical analysis compared hypothalamic subregion volumes in controls, Alzheimer’s disease (AD), and behavioral variant frontotemporal dementia (bvFTD) subjects using the area under the curve (AUC) and Wilcoxon rank sum test. Results: H-SynEx can segment the hypothalamus across various MRI sequences, encompassing FLAIR sequences with significant slice spacing (5mm). Using hypothalamic volumes on T1w images to distinguish control from AD and bvFTD patients, we observed AUC values of 0.74 and 0.79 respectively. Additionally, AUC=0.66 was found for volume variation on FLAIR scans when comparing control and non-patients. Conclusion: Our results show that H-SynEx successfully leverages information from ultra-high resolution scans to segment in vivo from different MRI sequences such as T1w, T2w, PD, qT1, FA, and FLAIR. We also found that our automated segmentation was able to discriminate controls versus patients on FLAIR images with 5mm spacing. H-SynEx is openly available at https://github.com/liviamarodrigues/hsynex.


标题: MF-MOS: A Motion-Focused Model for Moving Object Segmentation

作者: Jintao Cheng, Kang Zeng, Zhuoxu Huang

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.17023v1

GitHub: https://github.com/SCNU-RISLAB/MF-MOS|

中文摘要: 移动对象分割(MOS)为检测交通参与者提供了可靠的解决方案,因此在自动驾驶领域具有重要意义。动态捕获在MOS问题中总是至关重要的。以前的方法直接从距离图像中捕获运动特征。不同地,我们认为残差图为运动信息提供了更大的潜力,而距离图像包含丰富的语义指导。基于这种直觉,我们提出了一种新的双分支结构的运动聚焦模型MF-MOS,用于激光雷达运动目标分割。新颖的是,我们通过从残差图中捕捉运动并从距离图像中生成语义特征来解耦时空信息,这些语义特征被用作运动分支的可移动对象引导。我们简单而独特的解决方案可以充分利用距离图像和残差图,从而大大提高基于激光雷达的MOS任务的性能。值得注意的是,我们的MF-MOS在提交后在SemanticKITTI数据集的MOS排行榜上获得了76.7%的领先IoU,展示了当前最先进的性能。我们的MF-MOS的实现已经在https://github.com/SCNU-RISLAB/MF-MOS上发布。

摘要: Moving object segmentation (MOS) provides a reliable solution for detecting traffic participants and thus is of great interest in the autonomous driving field. Dynamic capture is always critical in the MOS problem. Previous methods capture motion features from the range images directly. Differently, we argue that the residual maps provide greater potential for motion information, while range images contain rich semantic guidance. Based on this intuition, we propose MF-MOS, a novel motion-focused model with a dual-branch structure for LiDAR moving object segmentation. Novelly, we decouple the spatial-temporal information by capturing the motion from residual maps and generating semantic features from range images, which are used as movable object guidance for the motion branch. Our straightforward yet distinctive solution can make the most use of both range images and residual maps, thus greatly improving the performance of the LiDAR-based MOS task. Remarkably, our MF-MOS achieved a leading IoU of 76.7% on the MOS leaderboard of the SemanticKITTI dataset upon submission, demonstrating the current state-of-the-art performance. The implementation of our MF-MOS has been released at https://github.com/SCNU-RISLAB/MF-MOS.


标题: CoSSegGaussians: Compact and Swift Scene Segmenting 3D Gaussians with Dual Feature Fusion

作者: Bin Dou, Tianyu Zhang, Yongjia Ma

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.05925v3

Project: https://David-Dou.github.io/CoSSegGaussians|

中文摘要: 我们提出了紧凑和快速的分割3D高斯(CoSSegGaussians),这是一种仅使用RGB图像输入以快速渲染速度进行紧凑3D一致场景分割的方法。以前基于NeRF的分割方法依赖于耗时的神经场景优化。虽然最近的3D高斯飞溅显著提高了速度,但现有的基于高斯的分割方法很难产生紧凑的掩模,特别是在零镜头分割中。这个问题可能源于他们直接将可学习参数分配给每个高斯,导致对交叉视图不一致的2D机器生成的标签缺乏鲁棒性。我们的方法旨在通过使用双特征融合网络作为高斯域来解决这个问题。具体来说,我们首先在RGB监督下优化3D高斯。在高斯定位之后,通过显式反投影应用从图像中提取的DINO特征,这些特征进一步与来自高效点云处理网络的空间特征相结合。特征聚合用于将它们融合在全局到局部的策略中,以实现紧凑的分割特征。实验结果表明,与基于NeRF的方法相比,我们的模型在语义和全景零镜头分割任务上都优于基线,同时消耗不到10%的推理时间。代码和更多结果将在https://David-Dou.github.io/CoSSegGaussians

摘要: We propose Compact and Swift Segmenting 3D Gaussians(CoSSegGaussians), a method for compact 3D-consistent scene segmentation at fast rendering speed with only RGB images input. Previous NeRF-based segmentation methods have relied on time-consuming neural scene optimization. While recent 3D Gaussian Splatting has notably improved speed, existing Gaussian-based segmentation methods struggle to produce compact masks, especially in zero-shot segmentation. This issue probably stems from their straightforward assignment of learnable parameters to each Gaussian, resulting in a lack of robustness against cross-view inconsistent 2D machine-generated labels. Our method aims to address this problem by employing Dual Feature Fusion Network as Gaussians’ segmentation field. Specifically, we first optimize 3D Gaussians under RGB supervision. After Gaussian Locating, DINO features extracted from images are applied through explicit unprojection, which are further incorporated with spatial features from the efficient point cloud processing network. Feature aggregation is utilized to fuse them in a global-to-local strategy for compact segmentation features. Experimental results show that our model outperforms baselines on both semantic and panoptic zero-shot segmentation task, meanwhile consumes less than 10% inference time compared to NeRF-based methods. Code and more results will be available at https://David-Dou.github.io/CoSSegGaussians


标题: Fourier Prompt Tuning for Modality-Incomplete Scene Segmentation

作者: Ruiping Liu, Jiaming Zhang, Kunyu Peng

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.16923v1

GitHub: https://github.com/RuipingL/MISS|https://github.com/RuipingL/MISS|

中文摘要: 整合来自多个模态的信息增强了自动驾驶汽车中场景感知系统的鲁棒性,提供了更全面、更可靠的感知框架。然而,多模态分割中的模态不完全性仍未得到充分探索。在这项工作中,我们建立了一个称为模态不完全场景分割(MISS)的任务,它包括系统级模态缺失和传感器级模态误差。为了避免多模态融合中主要的模态依赖,我们引入了一种缺失感知模态切换(MMS)策略来主动管理训练期间的缺失模态。利用位级批量采样增强了模型在完全和不完全测试场景中的性能。此外,我们引入傅立叶提示调谐(FPT)方法,将代表性的光谱信息合并到有限数量的可学习提示中,以保持对所有未命中场景的鲁棒性。类似于微调效果,但可调参数较少(1.1%)。大量的实验证明了我们提出的方法的有效性,显示了在模态缺失方面比先前最先进的参数有效方法提高了5.84%的mIoU。源代码将在https://github.com/RuipingL/MISS。

摘要: Integrating information from multiple modalities enhances the robustness of scene perception systems in autonomous vehicles, providing a more comprehensive and reliable sensory framework. However, the modality incompleteness in multi-modal segmentation remains under-explored. In this work, we establish a task called Modality-Incomplete Scene Segmentation (MISS), which encompasses both system-level modality absence and sensor-level modality errors. To avoid the predominant modality reliance in multi-modal fusion, we introduce a Missing-aware Modal Switch (MMS) strategy to proactively manage missing modalities during training. Utilizing bit-level batch-wise sampling enhances the model’s performance in both complete and incomplete testing scenarios. Furthermore, we introduce the Fourier Prompt Tuning (FPT) method to incorporate representative spectral information into a limited number of learnable prompts that maintain robustness against all MISS scenarios. Akin to fine-tuning effects but with fewer tunable parameters (1.1%). Extensive experiments prove the efficacy of our proposed approach, showcasing an improvement of 5.84% mIoU over the prior state-of-the-art parameter-efficient methods in modality missing. The source code will be publicly available at https://github.com/RuipingL/MISS.


标题: Pixel-Wise Recognition for Holistic Surgical Scene Understanding

作者: Nicolás Ayobi, Santiago Rodríguez, Alejandra Pérez

PubTime: 2024-01-26

Downlink: http://arxiv.org/abs/2401.11174v2

Project: https://link.springer.com/chapter/10.1007/978-3-031-16449-1_42|https://ieeexplore.ieee.org/document/10230819|

GitHub: https://github.com/BCV-Uniandes/GraSP|

摘要: This paper presents the Holistic and Multi-Granular Surgical Scene Understanding of Prostatectomies (GraSP) dataset, a curated benchmark that models surgical scene understanding as a hierarchy of complementary tasks with varying levels of granularity. Our approach enables a multi-level comprehension of surgical activities, encompassing long-term tasks such as surgical phases and steps recognition and short-term tasks including surgical instrument segmentation and atomic visual actions detection. To exploit our proposed benchmark, we introduce the Transformers for Actions, Phases, Steps, and Instrument Segmentation (TAPIS) model, a general architecture that combines a global video feature extractor with localized region proposals from an instrument segmentation model to tackle the multi-granularity of our benchmark. Through extensive experimentation, we demonstrate the impact of including segmentation annotations in short-term recognition tasks, highlight the varying granularity requirements of each task, and establish TAPIS’s superiority over previously proposed baselines and conventional CNN-based models. Additionally, we validate the robustness of our method across multiple public benchmarks, confirming the reliability and applicability of our dataset. This work represents a significant step forward in Endoscopic Vision, offering a novel and comprehensive framework for future research towards a holistic understanding of surgical procedures.


专属领域论文订阅

关注{晓理紫|小李子},每日更新论文,如感兴趣,请转发给有需要的同学,谢谢支持。谢谢提供建议

如果你感觉对你有所帮助,请关注我,每日准时为你推送最新论文

为了答谢各位网友的支持,从今日起免费为300名读者提供订阅主题论文服务,只需VX关注公号并回复{邮箱+论文主题}(如:[email protected] + chatgpt@large language model @LLM),主题必须是同一个领域,最多三个关键词。解释权归博主所有

[晓理紫]每日论文分享(有中文摘要,源码或项目地址)--强化学习、模仿学习、机器人_第2张图片

你可能感兴趣的:(每日论文,学习,机器人,人工智能)