[晓理紫]每日论文分享(有源码或项目地址、中文摘要)--大模型、扩散模型、视觉

专属领域论文订阅

关注{晓理紫|小李子},每日更新论文,如感兴趣,请转发给有需要的同学,谢谢支持

如果你感觉对你有所帮助,请关注我,每日准时为你推送最新论文。

为了答谢各位网友的支持,从今日起免费为300名读者提供订阅主题论文服务,只需VX关注公号并回复{邮箱+论文主题}(如:[email protected] + chatgpt@large language model @LLM),主题必须是同一个领域,最多三个关键词。解释权归博主所有

[晓理紫]每日论文分享(有源码或项目地址、中文摘要)--大模型、扩散模型、视觉_第1张图片

分类:

  • 大语言模型LLM
  • 视觉模型VLM
  • 扩散模型
  • 视觉语言导航VLN
  • 强化学习 RL
  • 模仿学习 IL
  • 机器人
  • 开放词汇,检测分割

== LLM ==

标题: War and Peace (WarAgent): Large Language Model-based Multi-Agent Simulation of World Wars

作者: Wenyue Hua, Lizhou Fan, Lingyao Li

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2311.17227v2

GitHub: https://github.com/agiresearch/WarAgent|

中文摘要: 我们能在历史的十字路口避免战争吗?在整个人类历史中,个人、学者、决策者和组织都在探讨这个问题。在这项研究中,我们试图根据人工智能(AI)和大型语言模型(LLMs)的最新进展来回答这个问题。我们提出\textbf{WarAgent},一个LLM驱动的多代理人工智能系统,来模拟历史上国际冲突中的参与国、他们的决定和后果,包括第一次世界大战(WWI)、第二次世界大战(WWII)和中国古代战国时期(WSP)。通过评估模拟效果,我们检查了尖端人工智能系统在研究复杂的集体人类行为(如不同环境下的国际冲突)方面的进步和局限性。在这些模拟中,代理之间的紧急交互也为检查导致战争的触发因素和条件提供了一个新的视角。我们的发现提供了数据驱动和人工智能增强的见解,可以重新定义我们如何解决冲突和维持和平战略。其含义超越了历史分析,为使用人工智能理解人类历史并可能防止未来的国际冲突提供了蓝图。代码和数据可在\url{https://github.com/agi research/war agent}获得。

摘要: Can we avoid wars at the crossroads of history? This question has been pursued by individuals, scholars, policymakers, and organizations throughout human history. In this research, we attempt to answer the question based on the recent advances of Artificial Intelligence (AI) and Large Language Models (LLMs). We propose \textbf{WarAgent}, an LLM-powered multi-agent AI system, to simulate the participating countries, their decisions, and the consequences, in historical international conflicts, including the World War I (WWI), the World War II (WWII), and the Warring States Period (WSP) in Ancient China. By evaluating the simulation effectiveness, we examine the advancements and limitations of cutting-edge AI systems’ abilities in studying complex collective human behaviors such as international conflicts under diverse settings. In these simulations, the emergent interactions among agents also offer a novel perspective for examining the triggers and conditions that lead to war. Our findings offer data-driven and AI-augmented insights that can redefine how we approach conflict resolution and peacekeeping strategies. The implications stretch beyond historical analysis, offering a blueprint for using AI to understand human history and possibly prevent future international conflicts. Code and data are available at \url{https://github.com/agiresearch/WarAgent}.


标题: Conditional and Modal Reasoning in Large Language Models

作者: Wesley H. Holliday, Matthew Mandelkern

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.17169v1

GitHub: https://github.com/wesholliday/llm-logic|

中文摘要: 大型语言模型(LLMs)的推理能力是人工智能和认知科学领域越来越多研究的主题。在本文中,我们探讨了十几个LLMs能够在多大程度上区分逻辑上正确的推论和逻辑上谬误的推论。我们专注于涉及条件句(例如,“如果安有一个皇后,那么鲍勃有一个杰克”)和认知情态动词(例如,“安可能有一张王牌”,“鲍勃必须有一个国王”)的推理模式。逻辑学家、哲学家和语言学家对这些推理模式特别感兴趣,因为它们似乎在人类推理中起着核心作用。因此,根据这些推理模式评估LLMs与LLMs的推理能力在多大程度上与人类的推理能力相匹配的问题高度相关。在我们测试的LLM中,除了GPT-4之外,所有的LLM都经常在条件句上犯基本错误。此外,即使是GPT-4也在涉及认知情态动词的推理模式中显示出逻辑上不一致的判断。

摘要: The reasoning abilities of large language models (LLMs) are the topic of a growing body of research in artificial intelligence and cognitive science. In this paper, we probe the extent to which a dozen LLMs are able to distinguish logically correct inferences from logically fallacious ones. We focus on inference patterns involving conditionals (e.g., ‘If Ann has a queen, then Bob has a jack’) and epistemic modals (e.g., ‘Ann might have an ace’, ‘Bob must have a king’). These inference patterns have been of special interest to logicians, philosophers, and linguists, since they plausibly play a central role in human reasoning. Assessing LLMs on these inference patterns is thus highly relevant to the question of how much the reasoning abilities of LLMs match those of humans. Among the LLMs we tested, all but GPT-4 often make basic mistakes with conditionals. Moreover, even GPT-4 displays logically inconsistent judgments across inference patterns involving epistemic modals.


标题: Augmenting Math Word Problems via Iterative Question Composing

作者: Haoxiong Liu, Yifan Zhang, Yifan Luo

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.09003v3

Project: https://huggingface.co/datasets/Vivacem/MMIQC|

GitHub: https://github.com/iiis-ai/IterativeQuestionComposing|

中文摘要: 尽管用于数学推理的大型语言模型(LLMs)取得了进步,但解决竞争级别的数学问题仍然是一个重大挑战,特别是对于没有外部工具的开源LLMs。我们引入了MMIQC数据集,它由处理过的web数据和合成的问题——回答对组成,旨在增强基础语言模型的数学推理能力。在MMIQC上微调的模型在各种模型尺寸的数学基准测试中的性能始终超过其同行。值得注意的是,Qwen-72B-MMIQC实现了45.0%的准确率,比之前的开源技术水平高出8.2%,并超过了2023年发布的初始版本GPT-4。对匈牙利高中期末考试的广泛评估结果表明,这种改善可以推广到看不见的数据。我们对MMIQC的消融研究表明,很大一部分改进可以归因于我们的新增强方法,迭代问题合成(IQC),它涉及使用LLM从种子问题迭代合成新问题,并通过另一个LLM应用拒绝采样。MMIQC数据集可在https://huggingface.co/datasets/Vivacem/MMIQC。的HuggingFace中心获得我们的代码可从https://github.com/iiis-ai/IterativeQuestionComposing获得。

摘要: Despite the advancements in large language models (LLMs) for mathematical reasoning, solving competition-level math problems remains a significant challenge, especially for open-source LLMs without external tools. We introduce the MMIQC dataset, comprising a mixture of processed web data and synthetic question-response pairs, aimed at enhancing the mathematical reasoning capabilities of base language models. Models fine-tuned on MMIQC consistently surpass their counterparts in performance on the MATH benchmark across various model sizes. Notably, Qwen-72B-MMIQC achieves a 45.0% accuracy, exceeding the previous open-source state-of-the-art by 8.2% and outperforming the initial version GPT-4 released in 2023. Extensive evaluation results on Hungarian high school finals suggest that such improvement can generalize to unseen data. Our ablation study on MMIQC reveals that a large part of the improvement can be attributed to our novel augmentation method, Iterative Question Composing (IQC), which involves iteratively composing new questions from seed problems using an LLM and applying rejection sampling through another LLM. The MMIQC dataset is available on the HuggingFace hub at https://huggingface.co/datasets/Vivacem/MMIQC. Our code is available at https://github.com/iiis-ai/IterativeQuestionComposing.


标题: K-QA: A Real-World Medical Q&A Benchmark

作者: Itay Manes, Naama Ronn, David Cohen

PubTime: 2024-01-25

Downlink: http://arxiv.org/abs/2401.14493v1

Project: https://huggingface.co/spaces/Itaykhealth/K-QA|

GitHub: https://github.com/Itaymanes/K-QA|

中文摘要: 确保大型语言模型(LLMs)提供的响应的准确性至关重要,尤其是在不正确的信息可能直接影响患者健康的临床环境中。为了应对这一挑战,我们构建了K-QA,这是一个包含1,212个患者问题的数据集,这些问题来自K Health(一个人工智能驱动的临床平台)上举行的真实世界对话。我们雇佣了一组内部医生来回答并手动将K-QA的子集分解成独立的陈述。此外,我们制定了两个基于NLI的评估指标,近似召回率和精确度:(1)全面性,测量生成的答案中基本临床信息的百分比,以及(2)幻觉率,测量与LLM答案相矛盾的医生策划的反应的陈述数量。最后,我们使用K-QA和这些指标来评估几个最先进的模型,以及作者开发的上下文学习和面向医学的增强检索方案的效果。我们的发现表明,情境学习提高了模型的全面性,增强检索在减少幻觉方面是有效的。我们向社区提供K-QA,以刺激对医学上精确的NLP应用的研究。

摘要: Ensuring the accuracy of responses provided by large language models (LLMs) is crucial, particularly in clinical settings where incorrect information may directly impact patient health. To address this challenge, we construct K-QA, a dataset containing 1,212 patient questions originating from real-world conversations held on K Health (an AI-driven clinical platform). We employ a panel of in-house physicians to answer and manually decompose a subset of K-QA into self-contained statements. Additionally, we formulate two NLI-based evaluation metrics approximating recall and precision: (1) comprehensiveness, measuring the percentage of essential clinical information in the generated answer and (2) hallucination rate, measuring the number of statements from the physician-curated response contradicted by the LLM answer. Finally, we use K-QA along with these metrics to evaluate several state-of-the-art models, as well as the effect of in-context learning and medically-oriented augmented retrieval schemes developed by the authors. Our findings indicate that in-context learning improves the comprehensiveness of the models, and augmented retrieval is effective in reducing hallucinations. We make K-QA available to to the community to spur research into medically accurate NLP applications.


标题: Wordflow: Social Prompt Engineering for Large Language Models

作者: Zijie J. Wang, Aishwarya Chakravarthy, David Munechika

PubTime: 2024-01-25

Downlink: http://arxiv.org/abs/2401.14447v1

Project: https://poloclub.github.io/wordflow|https://poloclub.github.io/wordflow|https://youtu.be/3dOcVuofGVo|

GitHub: https://github.com/poloclub/wordflow/|

中文摘要: 大型语言模型(LLMs)需要精心制作的提示才能有效使用。提示工程,即设计提示的过程,具有挑战性,特别是对于不太熟悉人工智能技术的非专家来说。虽然研究人员已经提出了帮助LLM用户快速设计的技术和工具,但这些工作主要针对人工智能应用程序开发人员,而不是非专家。为了解决这一研究空白,我们提出了社会提示工程,这是一种利用社会计算技术来促进协作提示设计的新范式。为了研究社交提示工程,我们引入了Wordflow,这是一个开源的社交文本编辑器,使日常用户能够轻松地创建、运行、共享和发现LLM提示。此外,通过利用现代网络技术,Wordflow允许用户在他们的浏览器中本地和私有地运行LLMs。两个使用场景强调了社交提示工程和我们的工具如何增强外行人与LLMs的交互。Wordflow可在https://poloclub.github.io/Wordflow。

摘要: Large language models (LLMs) require well-crafted prompts for effective use. Prompt engineering, the process of designing prompts, is challenging, particularly for non-experts who are less familiar with AI technologies. While researchers have proposed techniques and tools to assist LLM users in prompt design, these works primarily target AI application developers rather than non-experts. To address this research gap, we propose social prompt engineering, a novel paradigm that leverages social computing techniques to facilitate collaborative prompt design. To investigate social prompt engineering, we introduce Wordflow, an open-source and social text editor that enables everyday users to easily create, run, share, and discover LLM prompts. Additionally, by leveraging modern web technologies, Wordflow allows users to run LLMs locally and privately in their browsers. Two usage scenarios highlight how social prompt engineering and our tool can enhance laypeople’s interaction with LLMs. Wordflow is publicly accessible at https://poloclub.github.io/wordflow.


标题: In-Context Language Learning: Architectures and Algorithms

作者: Ekin Akyürek, Bailin Wang, Yoon Kim

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.12973v2

摘要: Large-scale neural language models exhibit a remarkable capacity for in-context learning (ICL): they can infer novel functions from datasets provided as input. Most of our current understanding of when and how ICL arises comes from LMs trained on extremely simple learning problems like linear regression and associative recall. There remains a significant gap between these model problems and the “real” ICL exhibited by LMs trained on large text corpora, which involves not just retrieval and function approximation but free-form generation of language and other structured outputs. In this paper, we study ICL through the lens of a new family of model problems we term in context language learning (ICLL). In ICLL, LMs are presented with a set of strings from a formal language, and must generate additional strings from the same language. We focus on in-context learning of regular languages generated by random finite automata. We evaluate a diverse set of neural sequence models (including several RNNs, Transformers, and state-space model variants) on regular ICLL tasks, aiming to answer three questions: (1) Which model classes are empirically capable of ICLL? (2) What algorithmic solutions do successful models implement to perform ICLL? (3) What architectural changes can improve ICLL in less performant models? We first show that Transformers significantly outperform neural sequence models with recurrent or convolutional representations on ICLL tasks. Next, we provide evidence that their ability to do so relies on specialized “n-gram heads” (higher-order variants of induction heads) that compute input-conditional next-token distributions. Finally, we show that hard-wiring these heads into neural models improves performance not just on ICLL, but natural language modeling – improving the perplexity of 340M-parameter models by up to 1.14 points (6.7%) on the SlimPajama dataset.


== VLM ==

标题: Embracing Language Inclusivity and Diversity in CLIP through Continual Language Learning

作者: Bang Yang, Yong Dai, Xuxin Cheng

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.17186v1

GitHub: https://github.com/yangbang18/CLFM|

中文摘要: 虽然视觉语言预训练模型(VL-PTMs)近年来推进了多模态研究,但它们对英语等少数语言的掌握限制了它们在更广泛社区中的适用性。为此,通过联合学习设置开发多语言VL模型的兴趣越来越大,然而,由于昂贵的成本和数据可用性,这可能是不现实的。在这项工作中,我们建议通过持续语言学习(CLL)来扩展VL-PTMs的语言能力,其中模型需要增量更新其语言知识,而不会遭受灾难性遗忘(CF)。我们通过引入一个被称为CLL-CLIP的模型开始我们的研究,该模型建立在CLIP的基础上,CLIP是一种流行的VL-PTM,已经获得了图像——英语文本对齐。具体来说,CLL-CLIP包含一个可扩展的令牌嵌入层来处理语言差异。它只训练标记嵌入以提高记忆稳定性,并在跨模态和跨语言目标下进行优化,以学习图像和多语言文本之间的对齐。为了缓解协变量移位和词汇重叠引起的CF,我们进一步提出了一种新的方法,确保初始化期间所有标记嵌入的相同分布,并在训练期间正则化标记嵌入学习。我们基于MSCOCO和XM3600数据集构建了一个涵盖36种语言的CLL基准,然后评估了多语言图像文本检索性能。大量的实验验证了CLL-CLIP的有效性,并表明我们的方法可以提高CLL-CLIP,例如,在XM3600上的文本到图像平均召回率@1提高6.7%,并持续改进各种最先进的方法。我们的代码和数据可在\url{https://github.com/yangbang18/CLFM}获得。

摘要: While vision-language pre-trained models (VL-PTMs) have advanced multimodal research in recent years, their mastery in a few languages like English restricts their applicability in broader communities. To this end, there is an increasing interest in developing multilingual VL models via a joint-learning setup, which, however, could be unrealistic due to expensive costs and data availability. In this work, we propose to extend VL-PTMs’ language capacity by continual language learning (CLL), where a model needs to update its linguistic knowledge incrementally without suffering from catastrophic forgetting (CF). We begin our study by introducing a model dubbed CLL-CLIP, which builds upon CLIP, a prevailing VL-PTM that has acquired image-English text alignment. Specifically, CLL-CLIP contains an expandable token embedding layer to handle linguistic differences. It solely trains token embeddings to improve memory stability and is optimized under cross-modal and cross-lingual objectives to learn the alignment between images and multilingual texts. To alleviate CF raised by covariate shift and lexical overlap, we further propose a novel approach that ensures the identical distribution of all token embeddings during initialization and regularizes token embedding learning during training. We construct a CLL benchmark covering 36 languages based on MSCOCO and XM3600 datasets and then evaluate multilingual image-text retrieval performance. Extensive experiments verify the effectiveness of CLL-CLIP and show that our approach can boost CLL-CLIP, e.g., by 6.7% in text-to-image average Recall@1 on XM3600, and improve various state-of-the-art methods consistently. Our code and data are available at \url{https://github.com/yangbang18/CLFM}.


标题: ViTree: Single-path Neural Tree for Step-wise Interpretable Fine-grained Visual Categorization

作者: Danning Lao, Qi Liu, Jiazi Bu

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.17050v1

GitHub: https://github.com/SJTU-DeepVisionLab/ViTree|

中文摘要: 随着计算机视觉的不断发展,并在各个领域找到了广泛的应用,深度学习模型中对可解释性的需求变得至关重要。现有的方法通常求助于事后技术或原型来解释决策过程,这可能是间接的,缺乏内在的说明。在这项研究中,我们介绍了ViTree,这是一种用于细粒度视觉分类的新方法,它将流行的视觉Transformer model作为特征提取主干与神经决策树相结合。通过遍历树路径,ViTree有效地从Transformer model处理的特征中选择斑块,以突出信息丰富的局部区域,从而以逐步的方式细化表示。与以前依赖于软分布或路径集成的基于树的模型不同,ViTree选择单个树路径,提供了更清晰和更简单的决策过程。这种补丁和路径选择性增强了ViTree的模型可解释性,能够更好地洞察模型的内部工作。值得注意的是,大量的实验证实,这种简化的方法超越了各种强大的竞争对手,实现了最先进的性能,同时保持了出色的可解释性,这已被多视角方法所证明。代码可在https://github.com/SJTU-DeepVisionLab/ViTree找到。

摘要: As computer vision continues to advance and finds widespread applications across various domains, the need for interpretability in deep learning models becomes paramount. Existing methods often resort to post-hoc techniques or prototypes to explain the decision-making process, which can be indirect and lack intrinsic illustration. In this research, we introduce ViTree, a novel approach for fine-grained visual categorization that combines the popular vision transformer as a feature extraction backbone with neural decision trees. By traversing the tree paths, ViTree effectively selects patches from transformer-processed features to highlight informative local regions, thereby refining representations in a step-wise manner. Unlike previous tree-based models that rely on soft distributions or ensembles of paths, ViTree selects a single tree path, offering a clearer and simpler decision-making process. This patch and path selectivity enhances model interpretability of ViTree, enabling better insights into the model’s inner workings. Remarkably, extensive experimentation validates that this streamlined approach surpasses various strong competitors and achieves state-of-the-art performance while maintaining exceptional interpretability which is proved by multi-perspective methods. Code can be found at https://github.com/SJTU-DeepVisionLab/ViTree.


标题: Multi-granularity Correspondence Learning from Long-term Noisy Videos

作者: Yijie Lin, Jie Zhang, Zhenyu Huang

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.16702v1

Project: https://lin-yijie.github.io/projects/Norton|

中文摘要: 现有的视频语言研究主要集中在学习短视频剪辑上,由于长视频建模的计算成本过高,很少探索长期时间依赖性。为了解决这个问题,一个可行的解决方案是学习视频剪辑和字幕之间的对应关系,但是这不可避免地遇到了多粒度噪声对应(MNC)问题。具体来说,MNC指的是剪辑字幕错位(粗粒度)和帧词错位(细粒度),阻碍了时间学习和视频理解。在本文中,我们提出了噪声鲁棒时间最优传输(诺顿),在一个统一的最优传输(OT)框架中解决MNC。简而言之,诺顿采用视频段落和剪辑字幕对比损失来捕捉基于OT的长期依赖关系。为了解决视频段落对比中的粗粒度错位,诺顿通过可对齐的提示桶过滤掉不相关的剪辑和字幕,并根据传输距离重新对齐异步剪辑——字幕对。为了解决细粒度的错位,诺顿采用了软最大运算符来识别关键单词和关键帧。此外,诺顿利用剪辑字幕对比中潜在的错误负样本,通过OT分配校正对齐目标来确保精确的时间建模。在视频检索、视频质量保证和动作分割方面的大量实验验证了我们方法的有效性。代码可在https://lin-yijie.github.io/projects/Norton。

摘要: Existing video-language studies mainly focus on learning short video clips, leaving long-term temporal dependencies rarely explored due to over-high computational cost of modeling long videos. To address this issue, one feasible solution is learning the correspondence between video clips and captions, which however inevitably encounters the multi-granularity noisy correspondence (MNC) problem. To be specific, MNC refers to the clip-caption misalignment (coarse-grained) and frame-word misalignment (fine-grained), hindering temporal learning and video understanding. In this paper, we propose NOise Robust Temporal Optimal traNsport (Norton) that addresses MNC in a unified optimal transport (OT) framework. In brief, Norton employs video-paragraph and clip-caption contrastive losses to capture long-term dependencies based on OT. To address coarse-grained misalignment in video-paragraph contrast, Norton filters out the irrelevant clips and captions through an alignable prompt bucket and realigns asynchronous clip-caption pairs based on transport distance. To address the fine-grained misalignment, Norton incorporates a soft-maximum operator to identify crucial words and key frames. Additionally, Norton exploits the potential faulty negative samples in clip-caption contrast by rectifying the alignment target with OT assignment to ensure precise temporal modeling. Extensive experiments on video retrieval, videoQA, and action segmentation verify the effectiveness of our method. Code is available at https://lin-yijie.github.io/projects/Norton.


标题: Pixel-Wise Recognition for Holistic Surgical Scene Understanding

作者: Nicolás Ayobi, Santiago Rodríguez, Alejandra Pérez

PubTime: 2024-01-26

Downlink: http://arxiv.org/abs/2401.11174v2

Project: https://link.springer.com/chapter/10.1007/978-3-031-16449-1_42|https://ieeexplore.ieee.org/document/10230819|

GitHub: https://github.com/BCV-Uniandes/GraSP|

摘要: This paper presents the Holistic and Multi-Granular Surgical Scene Understanding of Prostatectomies (GraSP) dataset, a curated benchmark that models surgical scene understanding as a hierarchy of complementary tasks with varying levels of granularity. Our approach enables a multi-level comprehension of surgical activities, encompassing long-term tasks such as surgical phases and steps recognition and short-term tasks including surgical instrument segmentation and atomic visual actions detection. To exploit our proposed benchmark, we introduce the Transformers for Actions, Phases, Steps, and Instrument Segmentation (TAPIS) model, a general architecture that combines a global video feature extractor with localized region proposals from an instrument segmentation model to tackle the multi-granularity of our benchmark. Through extensive experimentation, we demonstrate the impact of including segmentation annotations in short-term recognition tasks, highlight the varying granularity requirements of each task, and establish TAPIS’s superiority over previously proposed baselines and conventional CNN-based models. Additionally, we validate the robustness of our method across multiple public benchmarks, confirming the reliability and applicability of our dataset. This work represents a significant step forward in Endoscopic Vision, offering a novel and comprehensive framework for future research towards a holistic understanding of surgical procedures.


标题: ReAlnet: Achieving More Human Brain-Like Vision via Human Neural Representational Alignment

作者: Zitong Lu, Yile Wang, Julie D. Golomb

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.17231v1

中文摘要: 尽管人工智能取得了显著的进步,但目前的物体识别模型在模拟人脑视觉信息处理机制方面仍然落后。最近的研究强调了使用神经数据模拟大脑处理的潜力;然而,这些通常对来自非人类受试者的侵入性神经记录做出回复,这在我们对人类视觉感知的理解和更像人类大脑的视觉模型的开发中留下了关键的空白。为了解决这一差距,我们首次提出了“Re(呈现)Al(排列)net”,这是一种基于非侵入性EEG记录与人脑活动对齐的视觉模型,证明了与人脑表征的显著更高的相似性。我们创新的图像到大脑多层编码对齐框架不仅优化了模型的多层,标志着神经对齐的实质性飞跃,而且使模型能够跨对象类别和不同的神经数据模态有效地学习和模仿人脑的视觉表征模式。此外,我们发现与人脑表征的一致性提高了模型的对抗性鲁棒性。我们的发现表明,ReAlnet在该领域开创了一个新的先例,弥合了人工视觉和人类视觉之间的差距,并为更像大脑的人工智能系统铺平了道路。

摘要: Despite the remarkable strides made in artificial intelligence, current object recognition models still lag behind in emulating the mechanism of visual information processing in human brains. Recent studies have highlighted the potential of using neural data to mimic brain processing; however, these often reply on invasive neural recordings from non-human subjects, leaving a critical gap in our understanding of human visual perception and the development of more human brain-like vision models. Addressing this gap, we present, for the first time, “Re(presentational)Al(ignment)net”, a vision model aligned with human brain activity based on non-invasive EEG recordings, demonstrating a significantly higher similarity to human brain representations. Our innovative image-to-brain multi-layer encoding alignment framework not only optimizes multiple layers of the model, marking a substantial leap in neural alignment, but also enables the model to efficiently learn and mimic human brain’s visual representational patterns across object categories and different neural data modalities. Furthermore, we discover that alignment with human brain representations improves the model’s adversarial robustness. Our findings suggest that ReAlnet sets a new precedent in the field, bridging the gap between artificial and human vision, and paving the way for more brain-like artificial intelligence systems.


标题: MouSi: Poly-Visual-Expert Vision-Language Models

作者: Xiaoran Fan, Tao Ji, Changhao Jiang

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.17221v1

中文摘要: 当前的大型视觉语言模型(VLM)经常遇到挑战,例如单个视觉组件的能力不足和视觉标记过长。这些问题会限制模型在准确解释复杂的视觉信息和过长的上下文信息方面的有效性。应对这些挑战对于提高VLM的性能和适用性至关重要。本文提出使用集成专家技术来协同单个视觉编码器的能力,包括那些擅长图像——文本匹配、OCR、图像分割等的视觉编码器。该技术引入了融合网络来统一来自不同视觉专家的输出处理,同时弥合了图像编码器和预先训练的LLMs之间的差距。此外,我们探索了不同的位置编码方案,以缓解冗长的图像特征序列带来的位置编码浪费,有效地解决了位置溢出和长度限制的问题。例如,在我们的实现中,这种技术显著降低了SAM等模型中的位置占用率,从大量的4096降低到更高效、更易于管理的64,甚至降低到1。实验结果表明,具有多个专家的VLM始终表现出优于孤立的视觉编码器的性能,并且随着更多专家的集成,性能显著提升。我们已经开源了本报告中使用的培训代码。所有这些资源都可以在我们的项目网站上找到。

摘要: Current large vision-language models (VLMs) often encounter challenges such as insufficient capabilities of a single visual component and excessively long visual tokens. These issues can limit the model’s effectiveness in accurately interpreting complex visual information and over-lengthy contextual information. Addressing these challenges is crucial for enhancing the performance and applicability of VLMs. This paper proposes the use of ensemble experts technique to synergizes the capabilities of individual visual encoders, including those skilled in image-text matching, OCR, image segmentation, etc. This technique introduces a fusion network to unify the processing of outputs from different visual experts, while bridging the gap between image encoders and pre-trained LLMs. In addition, we explore different positional encoding schemes to alleviate the waste of positional encoding caused by lengthy image feature sequences, effectively addressing the issue of position overflow and length limitations. For instance, in our implementation, this technique significantly reduces the positional occupancy in models like SAM, from a substantial 4096 to a more efficient and manageable 64 or even down to 1. Experimental results demonstrate that VLMs with multiple experts exhibit consistently superior performance over isolated visual encoders and mark a significant performance boost as more experts are integrated. We have open-sourced the training code used in this report. All of these resources can be found on our project website.


== diffusion model ==

标题: Blind Audio Bandwidth Extension: A Diffusion-Based Zero-Shot Approach

作者: Eloi Moliner, Filip Elvander, Vesa Välimäki

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2306.01433v2

Project: http://research.spa.aalto.fi/publications/papers/ieee-taslp-babe/|

中文摘要: 音频带宽扩展涉及从带限观测中真实重建高频频谱。在低通退化未知的情况下,例如在恢复历史音频记录时,这成为一个盲目的问题。本文介绍了一种称为BABE(盲音频带宽扩展)的新方法,该方法利用预先训练的无条件扩散模型的生成先验来解决零镜头设置下的盲问题。在推断过程中,BABE利用扩散后验采样的广义版本,其中退化算子是未知的,但参数化和迭代推断。使用客观和主观指标对所提出的方法的性能进行了评估,结果表明,当用合成数据测试时,BABE超过了最先进的盲带宽扩展基线,并实现了与知情方法相比具有竞争力的性能。此外,BABE在增强真实历史记录时表现出强大的概括能力,有效地重建缺失的高频内容,同时保持与原始记录的一致性。主观偏好测试证实,贝比显著提高了历史音乐录音的音频质量。使用所提出的方法恢复的历史记录的示例可在配套网页上获得:(http://research.spa.aalto.fi/publications/papers/ieee-taslp-babe/)

摘要: Audio bandwidth extension involves the realistic reconstruction of high-frequency spectra from bandlimited observations. In cases where the lowpass degradation is unknown, such as in restoring historical audio recordings, this becomes a blind problem. This paper introduces a novel method called BABE (Blind Audio Bandwidth Extension) that addresses the blind problem in a zero-shot setting, leveraging the generative priors of a pre-trained unconditional diffusion model. During the inference process, BABE utilizes a generalized version of diffusion posterior sampling, where the degradation operator is unknown but parametrized and inferred iteratively. The performance of the proposed method is evaluated using objective and subjective metrics, and the results show that BABE surpasses state-of-the-art blind bandwidth extension baselines and achieves competitive performance compared to informed methods when tested with synthetic data. Moreover, BABE exhibits robust generalization capabilities when enhancing real historical recordings, effectively reconstructing the missing high-frequency content while maintaining coherence with the original recording. Subjective preference tests confirm that BABE significantly improves the audio quality of historical music recordings. Examples of historical recordings restored with the proposed method are available on the companion webpage: (http://research.spa.aalto.fi/publications/papers/ieee-taslp-babe/)


标题: BlockFusion: Expandable 3D Scene Generation using Latent Tri-plane Extrapolation

作者: Zhennan Wu, Yang Li, Han Yan

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.17053v1

Project: https://www.youtube.com/watch?v=PxIBtd6G0mA|

中文摘要: 我们介绍了BlockFusion,这是一种基于扩散的模型,它将3D场景生成为单元块,并无缝地合并新块以扩展场景。使用从完整的3D场景网格中随机裁剪的3D块数据集来训练块融合。通过逐块拟合,所有训练块被转换成混合神经场:具有包含几何特征的三平面,随后是用于解码符号距离值的多层感知器(MLP)。采用变分自动编码器将三平面压缩到潜在的三平面空间中,并在该空间上进行去噪扩散过程。应用于潜在表示的扩散允许高质量和多样化的3D场景生成。为了在生成过程中扩展场景,只需要附加空块以与当前场景重叠,并外推现有的潜在三平面以填充新块。外推是通过在去噪迭代期间用来自重叠三平面的特征样本调节生成过程来完成的。潜在的三平面外推产生语义和几何上有意义的过渡,与现有场景和谐融合。2D布局调节机制用于控制场景元素的放置和排列。实验结果表明,BlockFusion能够在室内和室外场景中生成具有前所未有的高质量形状的多样、几何一致和无界的大型3D场景。

摘要: We present BlockFusion, a diffusion-based model that generates 3D scenes as unit blocks and seamlessly incorporates new blocks to extend the scene. BlockFusion is trained using datasets of 3D blocks that are randomly cropped from complete 3D scene meshes. Through per-block fitting, all training blocks are converted into the hybrid neural fields: with a tri-plane containing the geometry features, followed by a Multi-layer Perceptron (MLP) for decoding the signed distance values. A variational auto-encoder is employed to compress the tri-planes into the latent tri-plane space, on which the denoising diffusion process is performed. Diffusion applied to the latent representations allows for high-quality and diverse 3D scene generation. To expand a scene during generation, one needs only to append empty blocks to overlap with the current scene and extrapolate existing latent tri-planes to populate new blocks. The extrapolation is done by conditioning the generation process with the feature samples from the overlapping tri-planes during the denoising iterations. Latent tri-plane extrapolation produces semantically and geometrically meaningful transitions that harmoniously blend with the existing scene. A 2D layout conditioning mechanism is used to control the placement and arrangement of scene elements. Experimental results indicate that BlockFusion is capable of generating diverse, geometrically consistent and unbounded large 3D scenes with unprecedented high-quality shapes in both indoor and outdoor scenarios.


标题: Repositioning the Subject within Image

作者: Yikai Wang, Chenjie Cao, Qiaole Dong

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.16861v1

Project: https://yikai-wang.github.io/seele/|

GitHub: https://github.com/Yikai-Wang/ReS|

中文摘要: 当前的图像操作主要集中在静态操作上,例如替换图像中的特定区域或改变其整体样式。在本文中,我们介绍了一个创新的动态操作任务,主题重新定位。这项任务包括将用户指定的对象重新定位到期望的位置,同时保持图像的保真度。我们的研究表明,主体重新定位的基本子任务,包括填充重新定位的主体留下的空白,重建主体的模糊部分,并将主体与周围区域融合,可以有效地重新制定为一个统一的,即时引导的修复任务。因此,我们可以使用单个扩散生成模型来处理这些子任务,使用通过我们提出的任务反转技术学习的各种任务提示。此外,我们集成了预处理和后处理技术,以进一步提高主体重新定位的质量。这些元素共同构成了我们的细分生成和混合(SEELE)框架。为了评估SEELE在受试者重新定位方面的有效性,我们组装了一个名为ReS的真实世界受试者重新定位数据集。我们在ReS上的结果证明了重新定位的图像生成的质量。

摘要: Current image manipulation primarily centers on static manipulation, such as replacing specific regions within an image or altering its overall style. In this paper, we introduce an innovative dynamic manipulation task, subject repositioning. This task involves relocating a user-specified subject to a desired position while preserving the image’s fidelity. Our research reveals that the fundamental sub-tasks of subject repositioning, which include filling the void left by the repositioned subject, reconstructing obscured portions of the subject and blending the subject to be consistent with surrounding areas, can be effectively reformulated as a unified, prompt-guided inpainting task. Consequently, we can employ a single diffusion generative model to address these sub-tasks using various task prompts learned through our proposed task inversion technique. Additionally, we integrate pre-processing and post-processing techniques to further enhance the quality of subject repositioning. These elements together form our SEgment-gEnerate-and-bLEnd (SEELE) framework. To assess SEELE’s effectiveness in subject repositioning, we assemble a real-world subject repositioning dataset called ReS. Our results on ReS demonstrate the quality of repositioned image generation.


标题: Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance

作者: Qingcheng Zhao, Pengyu Long, Qixuan Zhang

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.15687v2

Project: https://sites.google.com/view/media2face|

中文摘要: 从语音合成3D面部动画已经获得了相当大的关注。由于缺乏高质量的4D面部数据和注释良好的丰富多模态标签,以前的方法经常遭受有限的真实性和缺乏词汇条件。我们通过三部曲来应对这一挑战。我们首先介绍了广义神经参数面部资产(GNPFA),这是一种有效的变分自动编码器,将面部几何和图像映射到高度广义的表情潜在空间,解耦表情和身份。然后,我们利用GNPFA从大量视频中提取高质量的表情和准确的头部姿势。这展示了M2F-D数据集,这是一个大型、多样化和扫描级的协同语音3D面部动画数据集,具有良好注释的情感和风格标签。最后,我们提出了Media2Face,这是一个用于协同语音面部动画生成的GNPFA潜在空间中的扩散模型,接受来自音频、文本和图像的丰富的多模态指导。大量实验表明,该模型不仅实现了人脸动画合成的高保真度,而且拓宽了三维人脸动画的表达范围和风格适应性。

摘要: The synthesis of 3D facial animations from speech has garnered considerable attention. Due to the scarcity of high-quality 4D facial data and well-annotated abundant multi-modality labels, previous methods often suffer from limited realism and a lack of lexible conditioning. We address this challenge through a trilogy. We first introduce Generalized Neural Parametric Facial Asset (GNPFA), an efficient variational auto-encoder mapping facial geometry and images to a highly generalized expression latent space, decoupling expressions and identities. Then, we utilize GNPFA to extract high-quality expressions and accurate head poses from a large array of videos. This presents the M2F-D dataset, a large, diverse, and scan-level co-speech 3D facial animation dataset with well-annotated emotional and style labels. Finally, we propose Media2Face, a diffusion model in GNPFA latent space for co-speech facial animation generation, accepting rich multi-modality guidances from audio, text, and image. Extensive experiments demonstrate that our model not only achieves high fidelity in facial animation synthesis but also broadens the scope of expressiveness and style adaptability in 3D facial animation.


标题: Semantic Guidance Tuning for Text-To-Image Diffusion Models

作者: Hyun Kang, Dohae Lee, Myungjin Shin

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2312.15964v2

Project: https://korguy.github.io/|

中文摘要: 文本到图像(T2I)扩散模型的最新进展在生成具有零镜头综合功能的高质量图像方面取得了令人印象深刻的成功。然而,当前的模型很难严格遵循提示语义,经常歪曲或忽略特定的属性。为了解决这个问题,我们提出了一种简单、免训练的方法,在推理过程中调节扩散模型的引导方向。我们首先将提示语义分解为一组概念,并监控与每个概念相关的引导轨迹。我们的关键观察是,模型坚持提示语义的偏差与这些概念中的一个或多个的指导偏差高度相关。基于这一观察,我们设计了一种技术来将引导方向引向模型偏离的任何概念。大量的实验证实,我们的方法改善了扩散模型响应提示生成的图像的语义对齐。项目页面可在:https://korguy.github.io/

摘要: Recent advancements in Text-to-Image (T2I) diffusion models have demonstrated impressive success in generating high-quality images with zero-shot generalization capabilities. Yet, current models struggle to closely adhere to prompt semantics, often misrepresenting or overlooking specific attributes. To address this, we propose a simple, training-free approach that modulates the guidance direction of diffusion models during inference. We first decompose the prompt semantics into a set of concepts, and monitor the guidance trajectory in relation to each concept. Our key observation is that deviations in model’s adherence to prompt semantics are highly correlated with divergence of the guidance from one or more of these concepts. Based on this observation, we devise a technique to steer the guidance direction towards any concept from which the model diverges. Extensive experimentation validates that our method improves the semantic alignment of images generated by diffusion models in response to prompts. Project page is available at: https://korguy.github.io/


标题: You Only Need One Step: Fast Super-Resolution with Stable Diffusion via Scale Distillation

作者: Mehdi Noroozi, Isma Hadji, Brais Martinez

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.17258v1

中文摘要: 在本文中,我们介绍了YONOS-SR,这是一种新颖的基于稳定扩散的图像超分辨率方法,仅使用单个DDIM步骤即可产生最先进的结果。我们提出了一种新的尺度蒸馏方法来训练我们的SR模型。我们不是直接在感兴趣的比例因子上训练我们的SR模型,而是从在较小的放大比例上训练教师模型开始,从而使SR问题对教师来说更简单。然后,我们使用教师的预测作为训练期间的目标,为更高的放大比例训练学生模型。这个过程反复重复,直到我们达到最终模型的目标比例因子。我们的尺度提取背后的基本原理是,教师通过以下方式来帮助学生进行扩散模型训练:i)提供适应当前噪声水平的目标,而不是使用来自所有噪声水平的地面实况数据的相同目标;ii)提供准确的目标,因为教师有更简单的任务要解决。我们的经验表明,提取的模型明显优于直接为高尺度训练的模型,特别是在推理过程中步骤很少。拥有一个只需要一个步骤的强扩散模型,我们可以冻结U-Net,并在其上微调解码器。我们表明,空间提取的U-Net和微调解码器的组合优于仅用一个步骤就需要200个步骤的最先进的方法。

摘要: In this paper, we introduce YONOS-SR, a novel stable diffusion-based approach for image super-resolution that yields state-of-the-art results using only a single DDIM step. We propose a novel scale distillation approach to train our SR model. Instead of directly training our SR model on the scale factor of interest, we start by training a teacher model on a smaller magnification scale, thereby making the SR problem simpler for the teacher. We then train a student model for a higher magnification scale, using the predictions of the teacher as a target during the training. This process is repeated iteratively until we reach the target scale factor of the final model. The rationale behind our scale distillation is that the teacher aids the student diffusion model training by i) providing a target adapted to the current noise level rather than using the same target coming from ground truth data for all noise levels and ii) providing an accurate target as the teacher has a simpler task to solve. We empirically show that the distilled model significantly outperforms the model trained for high scales directly, specifically with few steps during inference. Having a strong diffusion model that requires only one step allows us to freeze the U-Net and fine-tune the decoder on top of it. We show that the combination of spatially distilled U-Net and fine-tuned decoder outperforms state-of-the-art methods requiring 200 steps with only one single step.


== VLN ==

标题: Long-Tailed 3D Detection via 2D Late Fusion

作者: Yechi Ma, Neehar Peri, Shuoquan Wei

PubTime: 2024-01-25

Downlink: http://arxiv.org/abs/2312.10986v2

中文摘要: 为了安全导航,自动驾驶汽车(AVs)必须准确地检测普通和稀有类别的物体,这引发了长尾3D物体检测(LT3D)的问题。当代基于激光雷达的3D探测器在罕见的类别上表现不佳(例如,CenterPoint在婴儿车上仅实现5.1 AP),因为很难仅从稀疏的激光雷达点识别物体。RGB图像提供了视觉证据来帮助解决这种模糊性,激发了RGB-激光雷达融合的研究。在本文中,我们深入研究了一个简单的后期融合框架,该框架集成了独立训练的RGB和激光雷达探测器。与最近需要配对多模态训练数据的端到端方法不同,我们的后期融合方法可以轻松利用大规模单模态数据集,显著改善稀有类检测。特别是,我们从基本原理出发,研究了这一后期融合框架中的三个关键组成部分,包括是否训练2D或3D RGB检测器,是否在3D或投影2D图像平面中匹配RGB和激光雷达检测,以及如何融合匹配的检测。大量实验表明,2D RGB检测器比3D RGB检测器实现了更好的识别精度,2D图像平面上的匹配减轻了深度估计误差,并且将分数与校准概率融合导致了最先进的LT3D性能。我们的后期融合方法在已建立的nuScenes LT3D基准上实现了51.4 mAP,比以前的工作提高了5.9 mAP。

摘要: Autonomous vehicles (AVs) must accurately detect objects from both common and rare classes for safe navigation, motivating the problem of Long-Tailed 3D Object Detection (LT3D). Contemporary LiDAR-based 3D detectors perform poorly on rare classes (e.g., CenterPoint only achieves 5.1 AP on stroller) as it is difficult to recognize objects from sparse LiDAR points alone. RGB images provide visual evidence to help resolve such ambiguities, motivating the study of RGB-LiDAR fusion. In this paper, we delve into a simple late-fusion framework that ensembles independently trained RGB and LiDAR detectors. Unlike recent end-to-end methods which require paired multi-modal training data, our late-fusion approach can easily leverage large-scale uni-modal datasets, significantly improving rare class detection. In particular, we examine three critical components in this late-fusion framework from first principles, including whether to train 2D or 3D RGB detectors, whether to match RGB and LiDAR detections in 3D or the projected 2D image plane, and how to fuse matched detections.Extensive experiments reveal that 2D RGB detectors achieve better recognition accuracy than 3D RGB detectors, matching on the 2D image plane mitigates depth estimation errors, and fusing scores probabilistically with calibration leads to state-of-the-art LT3D performance. Our late-fusion approach achieves 51.4 mAP on the established nuScenes LT3D benchmark, improving over prior work by 5.9 mAP.


专属领域论文订阅

关注{晓理紫|小李子},每日更新论文,如感兴趣,请转发给有需要的同学,谢谢支持。谢谢提供建议

如果你感觉对你有所帮助,请关注我,每日准时为你推送最新论文

为了答谢各位网友的支持,从今日起免费为300名读者提供订阅主题论文服务,只需VX关注公号并回复{邮箱+论文主题}(如:[email protected] + chatgpt@large language model @LLM),主题必须是同一个领域,最多三个关键词。解释权归博主所有

[晓理紫]每日论文分享(有源码或项目地址、中文摘要)--大模型、扩散模型、视觉_第2张图片

你可能感兴趣的:(每日论文,机器人,深度学习,人工智能,大模型,扩散模型)