[晓理紫]每日论文分享(有中文摘要,源码或项目地址)--大模型、扩散模型、视觉语言导航

专属领域论文订阅

VX关注{晓理紫},每日更新论文,如感兴趣,请转发给有需要的同学,谢谢支持

如果你感觉对你有所帮助,请关注我,每日准时为你推送最新论文。

为了答谢各位网友的支持,从今日起免费为300名读者提供订阅主题论文服务,只需VX关注公号并回复{邮箱+论文主题}(如:[email protected] + chatgpt@large language model @LLM),主题必须是同一个领域,最多三个关键词。解释权归博主所有

[晓理紫]每日论文分享(有中文摘要,源码或项目地址)--大模型、扩散模型、视觉语言导航_第1张图片

分类:

  • 大语言模型LLM
  • 视觉模型VLM
  • 扩散模型
  • 视觉语言导航VLN
  • 强化学习 RL
  • 模仿学习 IL
  • 机器人
  • 开放词汇,检测分割

== LLM ==

标题: Small LLMs Are Weak Tool Learners: A Multi-LLM Agent

作者: Weizhou Shen, Chenliang Li, Hongzhan Chen

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2401.07324v2

GitHub: https://github.com/X-PLUG/Multi-LLM-Agent|

中文摘要: 大型语言模型(LLM)代理显著扩展了独立LLM的功能,使它们能够与外部工具(例如,API、函数)进行交互,并以自我指导的方式完成复杂的任务。工具使用的挑战要求LLMs不仅理解用户查询并生成答案,而且在任务规划、内存管理、工具调用和结果汇总方面表现出色。虽然传统的方法集中于训练具有所有这些能力的单个LLM,但是性能限制变得明显,特别是对于较小的模型。此外,当工具更新时,整个LLM可能需要再培训。为了克服这些挑战,我们提出了一种新颖的策略,将上述功能分解为计划器、调用器和摘要器。每个组件都由一个LLM实现,该LLM专注于特定的功能,并与其他组件协作来完成任务。这种模块化框架有助于单独更新,并可能使用较小的LLM来构建每种能力。为了有效地训练这个框架,我们引入了一个两阶段的训练范式。首先,我们在整个数据集上微调主干LLM,而不区分子任务,为模型提供对任务的全面理解。其次,微调的LLM分别用于实例化计划器、调用器和摘要器,它们在各自的子任务上不断地被微调。跨各种工具使用基准的评估表明,我们提出的多LLM框架超越了传统的单LLM方法,突出了其在工具学习方面的功效和优势。

摘要: Large Language Model (LLM) agents significantly extend the capabilities of standalone LLMs, empowering them to interact with external tools (e.g., APIs, functions) and complete complex tasks in a self-directed fashion. The challenge of tool use demands that LLMs not only understand user queries and generate answers but also excel in task planning, memory management, tool invocation, and result summarization. While traditional approaches focus on training a single LLM with all these capabilities, performance limitations become apparent, particularly with smaller models. Moreover, the entire LLM may require retraining when tools are updated. To overcome these challenges, we propose a novel strategy that decomposes the aforementioned capabilities into a planner, caller, and summarizer. Each component is implemented by a single LLM that focuses on a specific capability and collaborates with other components to accomplish the task. This modular framework facilitates individual updates and the potential use of smaller LLMs for building each capability. To effectively train this framework, we introduce a two-stage training paradigm. First, we fine-tune a backbone LLM on the entire dataset without discriminating sub-tasks, providing the model with a comprehensive understanding of the task. Second, the fine-tuned LLM is used to instantiate the planner, caller, and summarizer respectively, which are continually fine-tuned on respective sub-tasks. Evaluation across various tool-use benchmarks illustrates that our proposed multi-LLM framework surpasses the traditional single-LLM approach, highlighting its efficacy and advantages in tool learning.


标题: Meta Prompting for AGI Systems

作者: Yifan Zhang, Yang Yuan, Andrew Chi-Chih Yao

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2311.11482v4

GitHub: https://github.com/meta-prompting/meta-prompting|

中文摘要: 本文介绍了对元提示的全面研究,这是一种创新技术,重塑了大型语言模型(LLMs)、多模态基础模型和人工智能系统在问题解决和数据交互中的应用。基于类型理论和范畴理论,元提示强调信息的结构和语法,而不是传统的以内容为中心的方法。本文探讨了元提示(MP)的正式定义,将其与少数镜头提示区分开来,并强调了其在各种人工智能应用中的有效性。一个关键的焦点是将元提示应用于复杂推理(MP-CR)任务,展示它如何有效地将复杂的问题解构为更简单的子问题,提高令牌效率,并实现更公平的问题解决比较,特别是针对少数提示方法。此外,本文还介绍了提示任务的元提示,允许LLMs以递归的、类似元编程的方式自行生成新的提示。这种方法标志着人工智能自主和适应能力的重大飞跃。本文还介绍了元提示到多模态基础模型设置中的集成,解决了在结构化元提示框架中整合各种数据类型(如图像、音频和视频)的挑战和机遇。实证实验,包括以100%的成功率解决24个任务的游戏,展示了MP-CR代理增强的推理能力,实现了高准确性和高效率,并展示了元提示对人工智能问题解决的变革性影响。(代码见https://github.com/meta-prompting/meta-prompting)

摘要: This paper presents a comprehensive study of Meta Prompting, an innovative technique reshaping the utilization of large language models (LLMs), multi-modal foundation models, and AI systems in problem-solving and data interaction. Grounded in type theory and category theory, Meta Prompting emphasizes the structure and syntax of information over traditional content-centric methods. The paper explores the formal definitions of Meta Prompting (MP), sets it apart from Few-Shot Prompting, and underlines its effectiveness in various AI applications. A key focus is applying Meta Prompting for complex reasoning (MP-CR) tasks, showing how it effectively deconstructs intricate problems into simpler sub-problems, enhancing token efficiency, and enabling more equitable problem-solving comparisons, especially against few-shot prompting methods. Additionally, the paper introduces Meta Prompting for prompting tasks, allowing LLMs to self-generate new prompts in a recursive, metaprogramming-like manner. This approach marks a significant leap in AI’s autonomous and adaptive capabilities. The paper also introduces the integration of Meta Prompting into multi-modal foundation model settings, tackling the challenges and opportunities of incorporating varied data types such as images, audio, and video within the structured Meta Prompting framework. Empirical experiments, including solving the Game of 24 tasks with 100% success rate, demonstrate the MP-CR Agent’s enhanced reasoning capabilities, achieving high accuracy and efficiency, and showcasing Meta Prompting’s transformative impact on AI problem-solving. (The code is available at https://github.com/meta-prompting/meta-prompting)


标题: Augmenting Math Word Problems via Iterative Question Composing

作者: Haoxiong Liu, Yifan Zhang, Yifan Luo

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.09003v3

Project: https://huggingface.co/datasets/Vivacem/MMIQC|

GitHub: https://github.com/iiis-ai/IterativeQuestionComposing|

中文摘要: 尽管用于数学推理的大型语言模型(LLM)取得了进步,但解决竞争级别的数学问题仍然是一个重大挑战,尤其是对于没有外部工具的开源LLM。我们引入了MMIQC数据集,它由处理过的web数据和合成的问题——回答对组成,旨在增强基础语言模型的数学推理能力。在MMIQC上微调的模型在各种模型尺寸的数学基准测试中的性能始终超过其同行。值得注意的是,Qwen-72B-MMIQC实现了45.0%的准确率,比之前的开源技术水平高出8.2%,并超过了2023年发布的初始版本GPT-4。对匈牙利高中期末考试的广泛评估结果表明,这种改善可以推广到看不见的数据。我们对MMIQC的消融研究表明,很大一部分改进可以归因于我们的新增强方法,迭代问题合成(IQC),它涉及使用LLM从种子问题迭代合成新问题,并通过另一个LLM应用拒绝采样。MMIQC数据集可在https://huggingface.co/datasets/Vivacem/MMIQC。的HuggingFace中心获得我们的代码可以在https://github.com/iiis-ai/iterative question composing找到。

摘要: Despite the advancements in large language models (LLMs) for mathematical reasoning, solving competition-level math problems remains a significant challenge, especially for open-source LLMs without external tools. We introduce the MMIQC dataset, comprising a mixture of processed web data and synthetic question-response pairs, aimed at enhancing the mathematical reasoning capabilities of base language models. Models fine-tuned on MMIQC consistently surpass their counterparts in performance on the MATH benchmark across various model sizes. Notably, Qwen-72B-MMIQC achieves a 45.0% accuracy, exceeding the previous open-source state-of-the-art by 8.2% and outperforming the initial version GPT-4 released in 2023. Extensive evaluation results on Hungarian high school finals suggest that such improvement can generalize to unseen data. Our ablation study on MMIQC reveals that a large part of the improvement can be attributed to our novel augmentation method, Iterative Question Composing (IQC), which involves iteratively composing new questions from seed problems using an LLM and applying rejection sampling through another LLM. The MMIQC dataset is available on the HuggingFace hub at https://huggingface.co/datasets/Vivacem/MMIQC. Our code is available at https://github.com/iiis-ai/IterativeQuestionComposing.


标题: Engineering A Large Language Model From Scratch

作者: Abiodun Finbarrs Oketunji

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2401.16736v2

中文摘要: 自然语言处理(NLP)中深度学习的扩散导致了能够以非凡的熟练程度理解和生成人类语言的创新技术的开发和发布。Atinuke是一种基于Transformer model的神经网络,通过利用独特的配置来优化各种语言任务的性能。该体系结构将处理顺序数据的层与注意机制交织在一起,以在输入和输出之间绘制有意义的关联。由于其拓扑结构和超参数调整,它可以通过提取特征和学习复杂映射来模拟类人语言。Atinuke是模块化的,可扩展的,并与现有的机器学习管道无缝集成。softmax、嵌入和多头注意力等高级矩阵运算能够细致入微地处理文本、声音和视觉信号。通过将现代深度学习技术与软件设计原则和数学理论相结合,该系统在自然语言任务上实现了最先进的结果,同时保持了可解释性和鲁棒性。

摘要: The proliferation of deep learning in natural language processing (NLP) has led to the development and release of innovative technologies capable of understanding and generating human language with remarkable proficiency. Atinuke, a Transformer-based neural network, optimises performance across various language tasks by utilising a unique configuration. The architecture interweaves layers for processing sequential data with attention mechanisms to draw meaningful affinities between inputs and outputs. Due to the configuration of its topology and hyperparameter tuning, it can emulate human-like language by extracting features and learning complex mappings. Atinuke is modular, extensible, and integrates seamlessly with existing machine learning pipelines. Advanced matrix operations like softmax, embeddings, and multi-head attention enable nuanced handling of textual, acoustic, and visual signals. By unifying modern deep learning techniques with software design principles and mathematical theory, the system achieves state-of-the-art results on natural language tasks whilst remaining interpretable and robust.


标题: Hierarchical Continual Reinforcement Learning via Large Language Model

作者: Chaofan Pan, Xin Yang, Hao Wang

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2401.15098v2

中文摘要: 在动态环境中持续学习的能力是强化学习(RL)代理在现实世界中应用的关键要求。尽管在持续强化学习(CRL)方面取得了进展,但现有的方法经常受到知识转移不足的影响,特别是当任务多样化时。为了应对这一挑战,我们提出了一个新的框架,通过大型语言模型的分层连续强化学习(Hi-Core),旨在促进高级知识的转移。Hi-Core编排了一个双层结构:通过大型语言模型(LLM)的高级策略制定,它代表一系列目标,以及与面向目标的RL实践紧密结合的低级策略学习,产生代理响应所设定的目标的行动。该框架采用反馈来迭代调整和验证高级策略,将它们与低级策略一起存储在技能库中。当遇到新任务时,Hi-Core从该库中检索相关经验以帮助学习。通过在Minigrid上的实验,Hi-Core已经证明了它在处理各种CRL任务方面的有效性,其性能优于流行的基线。

摘要: The ability to learn continuously in dynamic environments is a crucial requirement for reinforcement learning (RL) agents applying in the real world. Despite the progress in continual reinforcement learning (CRL), existing methods often suffer from insufficient knowledge transfer, particularly when the tasks are diverse. To address this challenge, we propose a new framework, Hierarchical Continual reinforcement learning via large language model (Hi-Core), designed to facilitate the transfer of high-level knowledge. Hi-Core orchestrates a twolayer structure: high-level policy formulation by a large language model (LLM), which represents agenerates a sequence of goals, and low-level policy learning that closely aligns with goal-oriented RL practices, producing the agent’s actions in response to the goals set forth. The framework employs feedback to iteratively adjust and verify highlevel policies, storing them along with low-level policies within a skill library. When encountering a new task, Hi-Core retrieves relevant experience from this library to help to learning. Through experiments on Minigrid, Hi-Core has demonstrated its effectiveness in handling diverse CRL tasks, which outperforms popular baselines.


标题: Object-Centric Instruction Augmentation for Robotic Manipulation

作者: Junjie Wen, Yichen Zhu, Minjie Zhu

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2401.02814v2

摘要: Humans interpret scenes by recognizing both the identities and positions of objects in their observations. For a robot to perform tasks such as \enquote{pick and place}, understanding both what the objects are and where they are located is crucial. While the former has been extensively discussed in the literature that uses the large language model to enrich the text descriptions, the latter remains underexplored. In this work, we introduce the \textit{Object-Centric Instruction Augmentation (OCI)} framework to augment highly semantic and information-dense language instruction with position cues. We utilize a Multi-modal Large Language Model (MLLM) to weave knowledge of object locations into natural language instruction, thus aiding the policy network in mastering actions for versatile manipulation. Additionally, we present a feature reuse mechanism to integrate the vision-language features from off-the-shelf pre-trained MLLM into policy networks. Through a series of simulated and real-world robotic tasks, we demonstrate that robotic manipulator imitation policies trained with our enhanced instructions outperform those relying solely on traditional language instructions.


== VLM ==

标题: Language-Conditioned Robotic Manipulation with Fast and Slow Thinking

作者: Minjie Zhu, Yichen Zhu, Jinming Li

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2401.04181v2

Project: https://jlm-z.github.io/RSFT/|

中文摘要: 语言条件机器人操作旨在将自然语言指令转换为可执行的动作,从简单的拾取和放置到需要意图识别和视觉推理的任务。受认知科学中的双重过程理论的启发,该理论提出了人类决策中快速和慢速思维的两个平行系统,我们引入了快速和慢速思维机器人(RFST),这是一个模仿人类认知架构对任务进行分类并根据指令类型在两个系统上做出决策的框架。我们的RFST由两个关键组件组成:1)根据当前用户指令确定应该激活哪个系统的指令鉴别器,以及2)由与策略网络一致的微调视觉语言模型组成的慢速思考系统,该系统允许机器人识别用户意图或执行推理任务。为了评估我们的方法,我们建立了一个以真实世界轨迹为特色的数据集,捕捉从自发冲动到需要深思熟虑的任务的各种行为。我们在模拟和真实世界场景中的结果证实,我们的方法能够熟练地管理需要意图识别和推理的复杂任务。该项目可在https://jlm-z.github.io/RSFT/

摘要: The language-conditioned robotic manipulation aims to transfer natural language instructions into executable actions, from simple pick-and-place to tasks requiring intent recognition and visual reasoning. Inspired by the dual process theory in cognitive science, which suggests two parallel systems of fast and slow thinking in human decision-making, we introduce Robotics with Fast and Slow Thinking (RFST), a framework that mimics human cognitive architecture to classify tasks and makes decisions on two systems based on instruction types. Our RFST consists of two key components: 1) an instruction discriminator to determine which system should be activated based on the current user instruction, and 2) a slow-thinking system that is comprised of a fine-tuned vision language model aligned with the policy networks, which allows the robot to recognize user intention or perform reasoning tasks. To assess our methodology, we built a dataset featuring real-world trajectories, capturing actions ranging from spontaneous impulses to tasks requiring deliberate contemplation. Our results, both in simulation and real-world scenarios, confirm that our approach adeptly manages intricate tasks that demand intent recognition and reasoning. The project is available at https://jlm-z.github.io/RSFT/


标题: MPTQ-ViT: Mixed-Precision Post-Training Quantization for Vision Transformer

作者: Yu-Shan Tai, An-Yeu, Wu

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2401.14895v2

摘要: While vision transformers (ViTs) have shown great potential in computer vision tasks, their intense computation and memory requirements pose challenges for practical applications. Existing post-training quantization methods leverage value redistribution or specialized quantizers to address the non-normal distribution in ViTs. However, without considering the asymmetry in activations and relying on hand-crafted settings, these methods often struggle to maintain performance under low-bit quantization. To overcome these challenges, we introduce SmoothQuant with bias term (SQ-b) to alleviate the asymmetry issue and reduce the clamping loss. We also introduce optimal scaling factor ratio search (OPT-m) to determine quantization parameters by a data-dependent mechanism automatically. To further enhance the compressibility, we incorporate the above-mentioned techniques and propose a mixed-precision post-training quantization framework for vision transformers (MPTQ-ViT). We develop greedy mixed-precision quantization (Greedy MP) to allocate layer-wise bit-width considering both model performance and compressibility. Our experiments on ViT, DeiT, and Swin demonstrate significant accuracy improvements compared with SOTA on the ImageNet dataset. Specifically, our proposed methods achieve accuracy improvements ranging from 0.90% to 23.35% on 4-bit ViTs with single-precision and from 3.82% to 78.14% on 5-bit fully quantized ViTs with mixed-precision.


标题: What Do Self-Supervised Speech Models Know About Words?

作者: Ankita Pasad, Chung-Ming Chien, Shane Settle

PubTime: 2024-01-31

Downlink: http://arxiv.org/abs/2307.00162v3

中文摘要: 在过去几年中引入了许多自我监督语音模型(S3Ms),提高了各种语音任务的性能和数据效率。然而,仅仅这些经验上的成功并不能完整地描述在预培训中学到了什么。最近的工作已经开始分析S3Ms如何编码某些属性,如语音和说话者信息,但我们仍然缺乏对单词级别和其他级别编码的知识的正确理解。在这项工作中,我们使用轻量级分析方法来研究S3Ms中编码的片段级语言属性——单词身份、边界、发音、句法特征和语义特征。我们对来自10个S3Ms的分层表示进行了比较研究,发现(i)每个词段内的帧级表示并不都具有相同的信息量,以及(ii)预训练目标和模型大小严重影响跨层语言信息的可访问性和分布。我们还发现,在几项任务上——单词辨别、单词分割和语义句子相似性——用视觉基础训练的S3Ms优于纯语音的S3Ms。最后,我们基于任务的分析证明了在使用比先前工作更简单的方法的同时,在分词和声学单词辨别方面的改进性能。

摘要: Many self-supervised speech models (S3Ms) have been introduced over the last few years, improving performance and data efficiency on various speech tasks. However, these empirical successes alone do not give a complete picture of what is learned during pre-training. Recent work has begun analyzing how S3Ms encode certain properties, such as phonetic and speaker information, but we still lack a proper understanding of knowledge encoded at the word level and beyond. In this work, we use lightweight analysis methods to study segment-level linguistic properties – word identity, boundaries, pronunciation, syntactic features, and semantic features – encoded in S3Ms. We present a comparative study of layer-wise representations from ten S3Ms and find that (i) the frame-level representations within each word segment are not all equally informative, and (ii) the pre-training objective and model size heavily influence the accessibility and distribution of linguistic information across layers. We also find that on several tasks – word discrimination, word segmentation, and semantic sentence similarity – S3Ms trained with visual grounding outperform their speech-only counterparts. Finally, our task-based analyses demonstrate improved performance on word segmentation and acoustic word discrimination while using simpler methods than prior work.


标题: Towards Few-shot Out-of-Distribution Detection

作者: Jiuqing Dong, Yongbin Gao, Heng Zhou

PubTime: 2024-01-31

Downlink: http://arxiv.org/abs/2311.12076v3

中文摘要: 分布外(OOD)检测对于确保开放世界智能系统的可靠性至关重要。尽管现有的OOD检测方法取得了显著的进步,但我们的研究发现,在训练样本稀缺的情况下,性能会显著下降。在这种情况下,我们介绍了一种新的少数镜头OOD检测基准,精心构建以解决这一差距。我们的实证分析揭示了参数高效微调(PEFT)策略的优越性,如视觉提示调谐和视觉适配器调谐,优于传统技术,包括在少数镜头OOD检测任务中的完全微调和线性探测调谐。认识到来自预训练模型的一些关键信息,这些信息对于OOD检测是关键的,可能在微调过程中丢失,我们提出了一种称为领域特异性和一般知识融合(DSGF)的方法。这种方法旨在与不同的微调框架兼容。我们的实验表明,DSGF的集成显著增强了跨各种方法和微调方法的少量OOD检测能力,包括完全微调、可视化适配器调整和可视化提示调整。代码将被释放。

摘要: Out-of-distribution (OOD) detection is critical for ensuring the reliability of open-world intelligent systems. Despite the notable advancements in existing OOD detection methodologies, our study identifies a significant performance drop under the scarcity of training samples. In this context, we introduce a novel few-shot OOD detection benchmark, carefully constructed to address this gap. Our empirical analysis reveals the superiority of ParameterEfficient Fine-Tuning (PEFT) strategies, such as visual prompt tuning and visual adapter tuning, over conventional techniques, including fully fine-tuning and linear probing tuning in the few-shot OOD detection task. Recognizing some crucial information from the pre-trained model, which is pivotal for OOD detection, may be lost during the fine-tuning process, we propose a method termed DomainSpecific and General Knowledge Fusion (DSGF). This approach is designed to be compatible with diverse fine-tuning frameworks. Our experiments show that the integration of DSGF significantly enhances the few-shot OOD detection capabilities across various methods and fine-tuning methodologies, including fully fine-tuning, visual adapter tuning, and visual prompt tuning. The code will be released.


标题: Image Translation as Diffusion Visual Programmers

作者: Cheng Han, James C. Liang, Qifan Wang

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.09742v2

中文摘要: 我们介绍了新颖的扩散视觉编程器(DVP),一个神经符号图像翻译框架。我们提出的DVP在GPT架构中无缝嵌入了条件灵活的扩散模型,为各种亲符号步骤编排了一系列连贯的视觉程序(即计算机视觉模型),这些步骤跨越了RoI识别、风格转移和位置操纵,促进了透明和可控的图像翻译过程。大量的实验证明了DVP的卓越性能,超越了并行艺术。这一成功可以归因于DVP的几个关键特征:首先,DVP通过实例规范化实现了条件灵活的翻译,使模型能够消除由手动指导引起的敏感性,并优化地专注于文本描述,以生成高质量的内容。第二,该框架通过将特征空间中复杂的高维概念破译为更容易访问的低维符号(例如,【提示】、【RoI对象】)来增强上下文推理,允许本地化的、上下文无关的编辑,同时保持整体一致性。最后但同样重要的是,DVP通过在每个编程阶段提供明确的符号表示,使用户能够直观地解释和修改结果,从而提高了系统的可控性和可解释性。我们的研究标志着向协调人工图像翻译过程与认知智能迈出了实质性的一步,有望获得更广泛的应用。

摘要: We introduce the novel Diffusion Visual Programmer (DVP), a neuro-symbolic image translation framework. Our proposed DVP seamlessly embeds a condition-flexible diffusion model within the GPT architecture, orchestrating a coherent sequence of visual programs (i.e., computer vision models) for various pro-symbolic steps, which span RoI identification, style transfer, and position manipulation, facilitating transparent and controllable image translation processes. Extensive experiments demonstrate DVP’s remarkable performance, surpassing concurrent arts. This success can be attributed to several key features of DVP: First, DVP achieves condition-flexible translation via instance normalization, enabling the model to eliminate sensitivity caused by the manual guidance and optimally focus on textual descriptions for high-quality content generation. Second, the framework enhances in-context reasoning by deciphering intricate high-dimensional concepts in feature spaces into more accessible low-dimensional symbols (e.g., [Prompt], [RoI object]), allowing for localized, context-free editing while maintaining overall coherence. Last but not least, DVP improves systemic controllability and explainability by offering explicit symbolic representations at each programming stage, empowering users to intuitively interpret and modify results. Our research marks a substantial step towards harmonizing artificial image translation processes with cognitive intelligence, promising broader applications.


标题: SAM-based instance segmentation models for the automation of structural damage detection

作者: Zehao Ye, Lucy Lovell, Asaad Faramarzi

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.15266v2

中文摘要: 基于土木结构外观捕获缺陷的自动化视觉检测是至关重要的,因为其目前具有劳动密集型和耗时性。自动检测的一个重要方面是图像采集,考虑到近年来软件和硬件计算的普遍发展,图像采集是快速和经济的。以前的研究主要集中在混凝土和沥青上,很少关注砖石裂缝。后者也缺乏公开可用的数据集。在本文中,我们首先提出了一个相应的数据集,例如具有1,300个注释图像(640像素x 640像素)的分割,命名为MCrack1300,涵盖砖块、碎砖和裂缝。然后,我们测试了几个领先的基准测试算法,包括最新的大规模模型,基于提示的分段任何模型(SAM)。我们使用低秩自适应(LoRA)对编码器进行微调,并提出了两种自动化SAM执行的新方法。第一种方法涉及放弃提示编码器并将SAM编码器连接到其他解码器,而第二种方法引入了可学习的自生成提示器。为了保证这两种方法与SAM编码器部分的无缝集成,我们重新设计了特征提取器。两种提出的方法都超过了最先进的性能,所有类别都超过了最佳基准约3%,特别是裂缝约超过了6%。在成功检测的基础上,我们提出了一种基于单目摄像机和霍夫线变换的方法来自动将图像转换成正交投影图。通过结合砖单元的已知真实尺寸,我们精确地估计裂缝尺寸,结果与激光扫描获得的结果相差不到10%。总的来说,我们解决了自动化砌体裂缝检测和尺寸估计方面的重要研究空白。

摘要: Automating visual inspection for capturing defects based on civil structures appearance is crucial due to its currently labour-intensive and time-consuming nature. An important aspect of automated inspection is image acquisition, which is rapid and cost-effective considering the pervasive developments in both software and hardware computing in recent years. Previous studies largely focused on concrete and asphalt, with less attention to masonry cracks. The latter also lacks publicly available datasets. In this paper, we first present a corresponding data set for instance segmentation with 1,300 annotated images (640 pixels x 640 pixels), named as MCrack1300, covering bricks, broken bricks, and cracks. We then test several leading algorithms for benchmarking, including the latest large-scale model, the prompt-based Segment Anything Model (SAM). We fine-tune the encoder using Low-Rank Adaptation (LoRA) and proposed two novel methods for automation of SAM execution. The first method involves abandoning the prompt encoder and connecting the SAM encoder to other decoders, while the second method introduces a learnable self-generating prompter. In order to ensure the seamless integration of the two proposed methods with SAM encoder section, we redesign the feature extractor. Both proposed methods exceed state-of-the-art performance, surpassing the best benchmark by approximately 3% for all classes and around 6% for cracks specifically. Based on successful detection, we propose a method based on a monocular camera and the Hough Line Transform to automatically transform images into orthographic projection maps. By incorporating known real sizes of brick units, we accurately estimate crack dimensions, with the results differing by less than 10% from those obtained by laser scanning. Overall, we address important research gaps in automated masonry crack detection and size estimation.


== diffusion model ==

标题: BlockFusion: Expandable 3D Scene Generation using Latent Tri-plane Extrapolation

作者: Zhennan Wu, Yang Li, Han Yan

PubTime: 2024-01-31

Downlink: http://arxiv.org/abs/2401.17053v2

Project: https://www.youtube.com/watch?v=PxIBtd6G0mA|

中文摘要: 我们展示了BlockFusion,这是一个基于扩散的模型,它将3D场景生成为单元块,并无缝地合并新块来扩展场景。使用从完整的3D场景网格中随机裁剪的3D块数据集来训练块融合。通过逐块拟合,所有训练块被转换成混合神经场:具有包含几何特征的三平面,随后是用于解码符号距离值的多层感知器(MLP)。采用变分自动编码器将三平面压缩到潜在的三平面空间中,并在该空间上进行去噪扩散过程。应用于潜在表示的扩散允许高质量和多样化的3D场景生成。为了在生成过程中扩展场景,只需要附加空块以与当前场景重叠,并外推现有的潜在三平面以填充新块。外推是通过在去噪迭代期间用来自重叠三平面的特征样本调节生成过程来完成的。潜在的三平面外推产生语义和几何上有意义的过渡,与现有场景和谐融合。2D布局调节机制用于控制场景元素的放置和排列。实验结果表明,BlockFusion能够在室内和室外场景中生成具有前所未有的高质量形状的多样、几何一致和无界的大型3D场景。

摘要: We present BlockFusion, a diffusion-based model that generates 3D scenes as unit blocks and seamlessly incorporates new blocks to extend the scene. BlockFusion is trained using datasets of 3D blocks that are randomly cropped from complete 3D scene meshes. Through per-block fitting, all training blocks are converted into the hybrid neural fields: with a tri-plane containing the geometry features, followed by a Multi-layer Perceptron (MLP) for decoding the signed distance values. A variational auto-encoder is employed to compress the tri-planes into the latent tri-plane space, on which the denoising diffusion process is performed. Diffusion applied to the latent representations allows for high-quality and diverse 3D scene generation. To expand a scene during generation, one needs only to append empty blocks to overlap with the current scene and extrapolate existing latent tri-planes to populate new blocks. The extrapolation is done by conditioning the generation process with the feature samples from the overlapping tri-planes during the denoising iterations. Latent tri-plane extrapolation produces semantically and geometrically meaningful transitions that harmoniously blend with the existing scene. A 2D layout conditioning mechanism is used to control the placement and arrangement of scene elements. Experimental results indicate that BlockFusion is capable of generating diverse, geometrically consistent and unbounded large 3D scenes with unprecedented high-quality shapes in both indoor and outdoor scenarios.


标题: Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance

作者: Qingcheng Zhao, Pengyu Long, Qixuan Zhang

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.15687v2

Project: https://sites.google.com/view/media2face|

中文摘要: 从语音合成3D面部动画已经获得了相当大的关注。由于缺乏高质量的4D面部数据和注释良好的丰富多模态标签,以前的方法经常遭受有限的真实性和缺乏词汇条件。我们通过三部曲来应对这一挑战。我们首先介绍了广义神经参数面部资产(GNPFA),这是一种有效的变分自动编码器,将面部几何和图像映射到高度广义的表情潜在空间,解耦表情和身份。然后,我们利用GNPFA从大量视频中提取高质量的表情和准确的头部姿势。这展示了M2F-D数据集,这是一个大型、多样化和扫描级的协同语音3D面部动画数据集,具有良好注释的情感和风格标签。最后,我们提出了Media2Face,这是一个用于协同语音面部动画生成的GNPFA潜在空间中的扩散模型,接受来自音频、文本和图像的丰富的多模态指导。大量实验表明,该模型不仅实现了人脸动画合成的高保真度,而且拓宽了三维人脸动画的表达范围和风格适应性。

摘要: The synthesis of 3D facial animations from speech has garnered considerable attention. Due to the scarcity of high-quality 4D facial data and well-annotated abundant multi-modality labels, previous methods often suffer from limited realism and a lack of lexible conditioning. We address this challenge through a trilogy. We first introduce Generalized Neural Parametric Facial Asset (GNPFA), an efficient variational auto-encoder mapping facial geometry and images to a highly generalized expression latent space, decoupling expressions and identities. Then, we utilize GNPFA to extract high-quality expressions and accurate head poses from a large array of videos. This presents the M2F-D dataset, a large, diverse, and scan-level co-speech 3D facial animation dataset with well-annotated emotional and style labels. Finally, we propose Media2Face, a diffusion model in GNPFA latent space for co-speech facial animation generation, accepting rich multi-modality guidances from audio, text, and image. Extensive experiments demonstrate that our model not only achieves high fidelity in facial animation synthesis but also broadens the scope of expressiveness and style adaptability in 3D facial animation.


标题: Diffusion Model Conditioning on Gaussian Mixture Model and Negative Gaussian Mixture Gradient

作者: Weiguo Lu, Xuan Wu, Deng Ding

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2401.11261v2

中文摘要: 扩散模型(DMs)是一种生成模型,对图像合成和其他方面有着巨大的影响。它们在各种生成任务中实现最先进的生成结果。各种各样的条件输入,如文本或边界框,都可以用来控制生成。在这项工作中,我们提出了一种利用高斯混合模型(GMM)作为特征条件来指导去噪过程的条件机制。基于集合论,我们提供了一个全面的理论分析,表明基于特征和类的条件潜在分布是显著不同的,因此特征上的条件潜在分布比基于类的条件潜在分布产生更少的缺陷生成。分别训练了基于高斯混合模型的两个扩散模型进行比较。实验支持我们的发现。提出了一种新的梯度函数,称为负高斯混合梯度(NGMG),并将其应用于带有附加分类器的扩散模型训练中。训练稳定性提高了。我们还从理论上证明,当学习低维流形支持的分布时,NGMG作为一个更合理的成本函数,与推土机距离(Wasserstein)具有相同的优势。

摘要: Diffusion models (DMs) are a type of generative model that has a huge impact on image synthesis and beyond. They achieve state-of-the-art generation results in various generative tasks. A great diversity of conditioning inputs, such as text or bounding boxes, are accessible to control the generation. In this work, we propose a conditioning mechanism utilizing Gaussian mixture models (GMMs) as feature conditioning to guide the denoising process. Based on set theory, we provide a comprehensive theoretical analysis that shows that conditional latent distribution based on features and classes is significantly different, so that conditional latent distribution on features produces fewer defect generations than conditioning on classes. Two diffusion models conditioned on the Gaussian mixture model are trained separately for comparison. Experiments support our findings. A novel gradient function called the negative Gaussian mixture gradient (NGMG) is proposed and applied in diffusion model training with an additional classifier. Training stability has improved. We also theoretically prove that NGMG shares the same benefit as the Earth Mover distance (Wasserstein) as a more sensible cost function when learning distributions supported by low-dimensional manifolds.


标题: Wind speed super-resolution and validation: from ERA5 to CERRA via diffusion models

作者: Fabio Merizzi, Andrea Asperti, Stefano Colamonaco

PubTime: 2024-01-31

Downlink: http://arxiv.org/abs/2401.15469v2

中文摘要: 哥白尼欧洲区域再分析,CERRA,是欧洲区域的高分辨率区域再分析数据集。近年来,它在各种与气候相关的任务中显示出重要的效用,从预测和气候变化研究到可再生能源预测、资源管理、空气质量风险评估和罕见事件的预测等。遗憾的是,由于获取必要的外部数据方面的限制以及生成CERRA所固有的大量计算需求,CERRA的可用性比目前的日期晚了两年。作为解决方案,本文介绍了一种新的方法,使用扩散模型以数据驱动的方式近似CERRA降尺度,而无需额外的信息。通过利用为CERRA提供边界条件的低分辨率ERA5数据集,我们将其作为超分辨率任务来处理。专注于意大利周围的风速,我们的模型,根据现有的CERRA数据训练,显示了有希望的结果,密切反映了原始的CERRA数据。现场观测的验证进一步证实了模型在近似地面测量方面的准确性。

摘要: The Copernicus Regional Reanalysis for Europe, CERRA, is a high-resolution regional reanalysis dataset for the European domain. In recent years it has shown significant utility across various climate-related tasks, ranging from forecasting and climate change research to renewable energy prediction, resource management, air quality risk assessment, and the forecasting of rare events, among others. Unfortunately, the availability of CERRA is lagging two years behind the current date, due to constraints in acquiring the requisite external data and the intensive computational demands inherent in its generation. As a solution, this paper introduces a novel method using diffusion models to approximate CERRA downscaling in a data-driven manner, without additional informations. By leveraging the lower resolution ERA5 dataset, which provides boundary conditions for CERRA, we approach this as a super-resolution task. Focusing on wind speed around Italy, our model, trained on existing CERRA data, shows promising results, closely mirroring original CERRA data. Validation with in-situ observations further confirms the model’s accuracy in approximating ground measurements.


标题: Generative Design of Crystal Structures by Point Cloud Representations and Diffusion Model

作者: Zhelin Li, Rami Mrad, Runxian Jiao

PubTime: 2024-01-31

Downlink: http://arxiv.org/abs/2401.13192v2

中文摘要: 有效地产生能量稳定的晶体结构长期以来一直是材料设计中的一个挑战,这主要是由于晶格中原子的巨大排列。为了促进稳定材料的发现,我们提出了一个生成可合成材料的框架,利用点云表示来编码复杂的结构信息。这个框架的核心是引入一个扩散模型作为其基础支柱。为了衡量我们的方法的有效性,我们使用它从我们的训练数据集重建输入结构,严格验证其高重建性能。此外,我们通过产生全新的材料,强调它们的可合成性,展示了基于点云的晶体扩散(PCCD)的巨大潜力。我们的研究通过创成式设计的前沿途径,而不是传统的替代或基于经验的发现,为材料设计和合成的进步做出了显著的贡献。

摘要: Efficiently generating energetically stable crystal structures has long been a challenge in material design, primarily due to the immense arrangement of atoms in a crystal lattice. To facilitate the discovery of stable material, we present a framework for the generation of synthesizable materials, leveraging a point cloud representation to encode intricate structural information. At the heart of this framework lies the introduction of a diffusion model as its foundational pillar. To gauge the efficacy of our approach, we employ it to reconstruct input structures from our training datasets, rigorously validating its high reconstruction performance. Furthermore, we demonstrate the profound potential of Point Cloud-Based Crystal Diffusion (PCCD) by generating entirely new materials, emphasizing their synthesizability. Our research stands as a noteworthy contribution to the advancement of materials design and synthesis through the cutting-edge avenue of generative design instead of the conventional substitution or experience-based discovery.


== Visual Language Navigation ==

标题: SubPipe: A Submarine Pipeline Inspection Dataset for Segmentation and Visual-inertial Localization

作者: Olaya Álvarez-Tuñón, Luiza Ribeiro Marnet, László Antal

PubTime: 2024-01-31

Downlink: http://arxiv.org/abs/2401.17907v1

GitHub: https://github.com/remaro-network/SubPipe-dataset|

中文摘要: 本文介绍了SubPipe,这是一个用于SLAM、对象检测和图像分割的水下数据集。SubPipe已经使用由OceanScan MST运营的\gls{LAUV}进行了记录,并携带了一套传感器,包括两个摄像机、一个侧扫声纳和一个惯性导航系统以及其他传感器。AUV已经部署在管道检查环境中,海底管道部分被沙子覆盖。AUV的姿态地面真实值由导航传感器估计。侧扫声纳和RGB图像分别包括目标检测和分割注释。最先进的分割、对象检测和SLAM方法在SubPipe上进行了基准测试,以展示数据集在利用计算机视觉算法方面的挑战和机遇。据作者所知,这是第一个带注释的水下数据集,提供了真实的管道检查场景。数据集和实验可在https://github.com/remaro-network/SubPipe-dataset

摘要: This paper presents SubPipe, an underwater dataset for SLAM, object detection, and image segmentation. SubPipe has been recorded using a \gls{LAUV}, operated by OceanScan MST, and carrying a sensor suite including two cameras, a side-scan sonar, and an inertial navigation system, among other sensors. The AUV has been deployed in a pipeline inspection environment with a submarine pipe partially covered by sand. The AUV’s pose ground truth is estimated from the navigation sensors. The side-scan sonar and RGB images include object detection and segmentation annotations, respectively. State-of-the-art segmentation, object detection, and SLAM methods are benchmarked on SubPipe to demonstrate the dataset’s challenges and opportunities for leveraging computer vision algorithms. To the authors’ knowledge, this is the first annotated underwater dataset providing a real pipeline inspection scenario. The dataset and experiments are publicly available online at https://github.com/remaro-network/SubPipe-dataset


标题: ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models

作者: Rohan Wadhawan, Hritik Bansal, Kai-Wei Chang

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2401.13311v1

Project: https://con-textual.github.io/|

中文摘要: 人工智能的最新进展导致了大型多模态模型(LMM)的发展,这些模型能够处理复杂的任务,包括对图像中的文本和视觉内容进行联合推理(例如,在公共场所导航地图)。本文介绍了ConTextual,这是一个新颖的基准测试,包括明确设计的指令,用于评估LMMs执行上下文敏感的文本丰富的可视化推理的能力。上下文强调不同的真实世界场景(例如,时间阅读、导航、购物等),要求更深入地理解文本和视觉元素之间的交互。我们的发现揭示了表现最好的LMM、GPT-4V(ision)和使用人类评估的人类能力之间30.8%的显著性能差距,表明在上下文敏感的文本丰富的视觉推理方面有很大的改进空间。值得注意的是,虽然GPT-4V在模因和引用解释等抽象类别中表现出色,但其整体表现仍落后于人类。除了人工评估,我们还采用了使用GPT-4的自动评估指标,揭示了绩效差异的类似趋势。我们还在不同的视觉环境中进行细粒度的评估,并提供定性分析,为LMM设计的未来发展提供了一个强大的框架。https://con-textual.github.io/

摘要: Recent advancements in AI have led to the development of large multimodal models (LMMs) capable of processing complex tasks involving joint reasoning over text and visual content in the image (e.g., navigating maps in public places). This paper introduces ConTextual, a novel benchmark comprising instructions designed explicitly to evaluate LMMs’ ability to perform context-sensitive text-rich visual reasoning. ConTextual emphasizes diverse real-world scenarios (e.g., time-reading, navigation, shopping and more) demanding a deeper understanding of the interactions between textual and visual elements. Our findings reveal a significant performance gap of 30.8% between the best-performing LMM, GPT-4V(ision), and human capabilities using human evaluation indicating substantial room for improvement in context-sensitive text-rich visual reasoning. Notably, while GPT-4V excelled in abstract categories like meme and quote interpretation, its overall performance still lagged behind humans. In addition to human evaluations, we also employed automatic evaluation metrics using GPT-4, uncovering similar trends in performance disparities. We also perform a fine-grained evaluation across diverse visual contexts and provide qualitative analysis which provides a robust framework for future advancements in the LMM design. https://con-textual.github.io/


标题: SemanticSLAM: Learning based Semantic Map Construction and Robust Camera Localization

作者: Mingyang Li, Yue Ma, Qinru Qiu

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2401.13076v1

GitHub: https://github.com/Leomingyangli/SemanticSLAM|

中文摘要: 视觉同步定位和绘图(VSLAM)中的当前技术通过比较连续场景的图像特征来估计相机位移。这些算法依赖于场景的连续性,因此需要频繁的摄像机输入。然而,频繁处理图像会导致大量的内存使用和计算开销。在这项研究中,我们介绍了SemanticSLAM,这是一个端到端的视觉惯性里程计系统,它利用了从RGB-D传感器提取的语义特征。这种方法能够创建环境的语义图,并确保可靠的相机定位。SemanticSLAM是场景不可知的,这意味着它不需要针对不同的环境进行重新训练。它可以在室内环境中有效地工作,即使没有频繁的摄像机输入,也不需要事先知道。SemanticSLAM的优势在于它能够逐步细化语义图并改进姿态估计。这是通过卷积长短期记忆(ConvLSTM)网络实现的,该网络经过训练可以在地图构建过程中纠正错误。与现有的VSLAM算法相比,SemanticSLAM将姿态估计提高了17%。由此产生的语义图提供了关于环境的可解释信息,并且可以容易地应用于各种下游任务,例如路径规划、避障和机器人导航。该代码将在https://github.com/Leomingyangli/SemanticSLAM

摘要: Current techniques in Visual Simultaneous Localization and Mapping (VSLAM) estimate camera displacement by comparing image features of consecutive scenes. These algorithms depend on scene continuity, hence requires frequent camera inputs. However, processing images frequently can lead to significant memory usage and computation overhead. In this study, we introduce SemanticSLAM, an end-to-end visual-inertial odometry system that utilizes semantic features extracted from an RGB-D sensor. This approach enables the creation of a semantic map of the environment and ensures reliable camera localization. SemanticSLAM is scene-agnostic, which means it doesn’t require retraining for different environments. It operates effectively in indoor settings, even with infrequent camera input, without prior knowledge. The strength of SemanticSLAM lies in its ability to gradually refine the semantic map and improve pose estimation. This is achieved by a convolutional long-short-term-memory (ConvLSTM) network, trained to correct errors during map construction. Compared to existing VSLAM algorithms, SemanticSLAM improves pose estimation by 17%. The resulting semantic map provides interpretable information about the environment and can be easily applied to various downstream tasks, such as path planning, obstacle avoidance, and robot navigation. The code will be publicly available at https://github.com/Leomingyangli/SemanticSLAM


标题: ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments

作者: Dong An, Hanqing Wang, Wenguan Wang

PubTime: 2024-01-22

Downlink: http://arxiv.org/abs/2304.03047v3

GitHub: https://github.com/MarSaKi/ETPNav.|https://github.com/MarSaKi/ETPNav|

中文摘要: Vision-language导航是一项需要代理按照指令在环境中导航的任务。它在具体化人工智能领域变得越来越重要,在自主导航、搜索和救援以及人机交互方面具有潜在的应用。在本文中,我们提出了一个更实际但具有挑战性的对应设置——连续环境中的视觉语言导航(VLN-CE)。为了开发一个鲁棒的VLN-CE代理,我们提出了一个新的导航框架ETPNav,它专注于两个关键技能:1)抽象环境和生成远程导航计划的能力,以及2)在连续环境中的避障控制能力。ETPNav通过沿着穿越路径自组织预测的航路点来执行环境的在线拓扑映射,而无需先前的环境经验。它赋予代理将导航过程分解为高级规划和低级控制的特权。同时,ETPNav利用基于Transformer model的跨模态规划器来基于拓扑图和指令生成导航计划。然后,该计划通过避障控制器来执行,该控制器利用试错法来防止导航陷入障碍物。实验结果证明了该方法的有效性。ETPNav的产量超过10%和20%的改进比以前的属性R2R-CE和RxR-CE数据集的最新技术。我们的代码可在https://github.com/MarSaKi/ETPNav。获得

摘要: Vision-language navigation is a task that requires an agent to follow instructions to navigate in environments. It becomes increasingly crucial in the field of embodied AI, with potential applications in autonomous navigation, search and rescue, and human-robot interaction. In this paper, we propose to address a more practical yet challenging counterpart setting - vision-language navigation in continuous environments (VLN-CE). To develop a robust VLN-CE agent, we propose a new navigation framework, ETPNav, which focuses on two critical skills: 1) the capability to abstract environments and generate long-range navigation plans, and 2) the ability of obstacle-avoiding control in continuous environments. ETPNav performs online topological mapping of environments by self-organizing predicted waypoints along a traversed path, without prior environmental experience. It privileges the agent to break down the navigation procedure into high-level planning and low-level control. Concurrently, ETPNav utilizes a transformer-based cross-modal planner to generate navigation plans based on topological maps and instructions. The plan is then performed through an obstacle-avoiding controller that leverages a trial-and-error heuristic to prevent navigation from getting stuck in obstacles. Experimental results demonstrate the effectiveness of the proposed method. ETPNav yields more than 10% and 20% improvements over prior state-of-the-art on R2R-CE and RxR-CE datasets, respectively. Our code is available at https://github.com/MarSaKi/ETPNav.


标题: Multimotion Visual Odometry (MVO)

作者: Kevin M. Judd, Jonathan D. Gammell

PubTime: 2024-01-15

Downlink: http://arxiv.org/abs/2110.15169v3

Project: https://www.youtube.com/watch?v=mNj3s1nf-6A|https://www.youtube.com/playlist?list=PLbaQBz4TuPcxMIXKh5Q80s0N9ISezFcpi|

中文摘要: 视觉运动估计是自主导航中一个研究得很好的挑战。最近的工作集中于解决高度动态环境中的多运动估计问题。这些环境不仅包括多个复杂的运动,而且往往表现出显著的遮挡。很难同时估计第三方运动和传感器自运动,因为物体的观测运动包括其真实运动和传感器运动。先前在多运动估计中的大多数工作通过依赖于基于外观的对象检测或特定于应用程序的运动约束来简化这个问题。这些方法在特定的应用程序和环境中是有效的,但不能很好地推广到完整的多运动估计问题(MEP)。本文介绍了Multimotion Visual Odometry(MVO),这是一种多运动估计管道,它估计场景中每个运动的完整SE(3)轨迹,包括传感器自身运动,而不依赖于基于外观的信息。MVO通过多运动分割和跟踪技术扩展了传统的视觉里程计(VO)管道。它使用物理建立的运动先验来推断通过临时遮挡的运动,并通过运动闭合来识别运动的再现。对牛津多运动数据集(OMD)和KITTI Vision Benchmark Suite的真实世界数据的评估表明,与类似方法相比,MVO实现了良好的估计精度,并适用于各种多运动估计挑战

摘要: Visual motion estimation is a well-studied challenge in autonomous navigation. Recent work has focused on addressing multimotion estimation in highly dynamic environments. These environments not only comprise multiple, complex motions but also tend to exhibit significant occlusion. Estimating third-party motions simultaneously with the sensor egomotion is difficult because an object’s observed motion consists of both its true motion and the sensor motion. Most previous works in multimotion estimation simplify this problem by relying on appearance-based object detection or application-specific motion constraints. These approaches are effective in specific applications and environments but do not generalize well to the full multimotion estimation problem (MEP). This paper presents Multimotion Visual Odometry (MVO), a multimotion estimation pipeline that estimates the full SE(3) trajectory of every motion in the scene, including the sensor egomotion, without relying on appearance-based information. MVO extends the traditional visual odometry (VO) pipeline with multimotion segmentation and tracking techniques. It uses physically founded motion priors to extrapolate motions through temporary occlusions and identify the reappearance of motions through motion closure. Evaluations on real-world data from the Oxford Multimotion Dataset (OMD) and the KITTI Vision Benchmark Suite demonstrate that MVO achieves good estimation accuracy compared to similar approaches and is applicable to a variety of multimotion estimation challenges.


标题: Learning Interactive Real-World Simulators

作者: Mengjiao Yang, Yilun Du, Kamyar Ghasemipour

PubTime: 2024-01-13

Downlink: http://arxiv.org/abs/2310.06114v2

Project: https://universal-simulator.github.io.|https://universal-simulator.github.io|

中文摘要: 基于互联网数据训练的生成模型彻底改变了文本、图像和视频内容的创建方式。也许生成模型的下一个里程碑是模拟现实体验,以响应人类、机器人和其他交互式代理所采取的行动。真实世界模拟器的应用范围从游戏和电影中的可控内容创建,到纯粹在模拟中训练可直接部署在现实世界中的具体代理。我们探索了通过生成建模学习真实世界交互的通用模拟器的可能性。我们首先提出了一个重要的观察结果,即可用于学习真实世界模拟器的自然数据集通常在不同维度上是丰富的(例如,图像数据中的大量对象、机器人数据中的密集采样动作以及导航数据中的不同运动)。通过仔细编排不同的数据集,每个数据集都提供了整体体验的不同方面,我们可以从静态场景和对象中模拟高级指令(如“打开抽屉”)和低级控件(如“按x,y移动”)的视觉结果。我们使用模拟器来训练高级视觉语言策略和低级强化学习策略,在纯模拟训练后,每一种策略都可以在现实世界中零次部署。我们还表明,其他类型的智能,如视频字幕模型,可以从模拟经验的训练中受益,从而开辟更广泛的应用。视频演示可在https://universal-simulator.github.io.

摘要: Generative models trained on internet data have revolutionized how text, image, and video content can be created. Perhaps the next milestone for generative models is to simulate realistic experience in response to actions taken by humans, robots, and other interactive agents. Applications of a real-world simulator range from controllable content creation in games and movies, to training embodied agents purely in simulation that can be directly deployed in the real world. We explore the possibility of learning a universal simulator of real-world interaction through generative modeling. We first make the important observation that natural datasets available for learning a real-world simulator are often rich along different dimensions (e.g., abundant objects in image data, densely sampled actions in robotics data, and diverse movements in navigation data). With careful orchestration of diverse datasets, each providing a different aspect of the overall experience, we can simulate the visual outcome of both high-level instructions such as ``open the drawer’’ and low-level controls such as “move by x, y” from otherwise static scenes and objects. We use the simulator to train both high-level vision-language policies and low-level reinforcement learning policies, each of which can be deployed in the real world in zero shot after training purely in simulation. We also show that other types of intelligence such as video captioning models can benefit from training with simulated experience, opening up even wider applications. Video demos can be found at https://universal-simulator.github.io.


专属领域论文订阅

关注{晓理紫|小李子},每日更新论文,如感兴趣,请转发给有需要的同学,谢谢支持

如果你感觉对你有所帮助,请关注我,每日准时为你推送最新论文。

为了答谢各位网友的支持,从今日起免费为300名读者提供订阅主题论文服务,只需VX关注公号并回复{邮箱+论文主题}(如:[email protected] + chatgpt@large language model @LLM),主题必须是同一个领域,最多三个关键词。解释权归博主所有

[晓理紫]每日论文分享(有中文摘要,源码或项目地址)--大模型、扩散模型、视觉语言导航_第2张图片

你可能感兴趣的:(每日论文,机器人,深度学习,人工智能,大模型,学习)