Skip to content

Table of Contents

cs.CL [Back]

[1] BioACE: An Automated Framework for Biomedical Answer and Citation Evaluations

Deepak Gupta,Davis Bartels,Dina Demner-Fuhsman

Main category: cs.CL

TL;DR: 本文提出BioACE框架,用于自动评估生物医学领域大语言模型生成的答案及其引用文献的质量,涵盖完整性、正确性、精确率和召回率等多个维度,并通过实验验证其与人工评估的相关性。

Details Motivation: 由于生物医学领域需要专家评估来验证生成文本与科学文献的一致性及处理复杂术语,现有LLM生成文本的评估方法面临挑战。 Method: 提出BioACE自动化评估框架,从完整性、正确性、精确率、召回率等方面评估答案质量,并结合自然语言推理(NLI)、预训练语言模型和大语言模型评估引用文献质量。 Result: 实验表明BioACE各评估模块与人工评估具有较高相关性,并确定了在生物医学答案与引用评估中最优的方法组合。 Conclusion: BioACE为生物医学问答、RAG等任务提供了可靠、可复现的自动化评估方案,开源实现已发布在GitHub。 Abstract: With the increasing use of large language models (LLMs) for generating answers to biomedical questions, it is crucial to evaluate the quality of the generated answers and the references provided to support the facts in the generated answers. Evaluation of text generated by LLMs remains a challenge for question answering, retrieval-augmented generation (RAG), summarization, and many other natural language processing tasks in the biomedical domain, due to the requirements of expert assessment to verify consistency with the scientific literature and complex medical terminology. In this work, we propose BioACE, an automated framework for evaluating biomedical answers and citations against the facts stated in the answers. The proposed BioACE framework considers multiple aspects, including completeness, correctness, precision, and recall, in relation to the ground-truth nuggets for answer evaluation. We developed automated approaches to evaluate each of the aforementioned aspects and performed extensive experiments to assess and analyze their correlation with human evaluations. In addition, we considered multiple existing approaches, such as natural language inference (NLI) and pre-trained language models and LLMs, to evaluate the quality of evidence provided to support the generated answers in the form of citations into biomedical literature. With the detailed experiments and analysis, we provide the best approaches for biomedical answer and citation evaluation as a part of BioACE (https://github.com/deepaknlp/BioACE) evaluation package.

[2] CoWork-X: Experience-Optimized Co-Evolution for Multi-Agent Collaboration System

Zexin Lin,Jiachen Yu,Haoyang Zhang,Yuzhao Li,Zhonghang Li,Yujiu Yang,Junjie Wang,Xiaoqiang Ji

Main category: cs.CL

TL;DR: CoWork-X is a framework for real-time, multi-episode collaborative agents that separates fast execution (via structured skill library) from slow, budget-aware skill optimization, improving performance while reducing latency and token cost.

Details Motivation: Highly cooperative tasks require both sub-second real-time coordination and sustained online adaptation under strict token budgets—existing methods fail to satisfy both constraints simultaneously. Method: CoWork-X introduces a two-component architecture: (1) a Skill-Agent using HTN-based retrieval from a structured, interpretable skill library for low-latency execution; (2) a post-episode Co-Optimizer performing patch-style skill consolidation with explicit token budget constraints and drift regularization. Result: On Overcooked-AI-like benchmarks, CoWork-X achieves stable, cumulative performance gains while steadily reducing online latency and token usage. Conclusion: CoWork-X successfully bridges the gap between real-time responsiveness and long-term adaptation in language-conditioned collaborative agents via active co-evolution and memory-inspired fast–slow separation. Abstract: Large language models are enabling language-conditioned agents in interactive environments, but highly cooperative tasks often impose two simultaneous constraints: sub-second real-time coordination and sustained multi-episode adaptation under a strict online token budget. Existing approaches either rely on frequent in-episode reasoning that induces latency and timing jitter, or deliver post-episode improvements through unstructured text that is difficult to compile into reliable low-cost execution. We propose CoWork-X, an active co-evolution framework that casts peer collaboration as a closed-loop optimization problem across episodes, inspired by fast--slow memory separation. CoWork-X instantiates a Skill-Agent that executes via HTN (hierarchical task network)-based skill retrieval from a structured, interpretable, and compositional skill library, and a post-episode Co-Optimizer that performs patch-style skill consolidation with explicit budget constraints and drift regularization. Experiments in challenging Overcooked-AI-like realtime collaboration benchmarks demonstrate that CoWork-X achieves stable, cumulative performance gains while steadily reducing online latency and token usage.

[3] Capacity Constraints and the Multilingual Penalty for Lexical Disambiguation

Sean Trott,Pamela D. Rivière

Main category: cs.CL

TL;DR: 本文研究了多语言语言模型在词义消歧任务中表现不如单语模型的现象,即“多语言惩罚”,并通过控制实验量化了这一现象,并探讨了表征、注意力和词汇三个方面的容量限制因素。

Details Motivation: 多语言语言模型有时表现不如其单语对应模型,可能由于容量限制,本文旨在量化这种‘多语言惩罚’并探究其原因。 Method: 使用英语和西班牙语的受控人类相关性判断数据集对词义消歧任务进行评估,比较同一系列的单语与多语言语言模型,并分析表征各向同性、注意力机制及词汇分段等潜在容量约束。 Result: 多语言模型在词义消歧任务中表现持续较差;发现其存在表征各向同性降低、对消歧线索注意力减弱、多词元分段增加三种容量限制,且这些因素能统计解释原本归因于多语言状态的性能差异。 Conclusion: 多语言语言模型确实受到多种容量约束,且这些约束与词义消歧性能下降密切相关。 Abstract: Multilingual language models (LMs) sometimes under-perform their monolingual counterparts, possibly due to capacity limitations. We quantify this ``multilingual penalty'' for lexical disambiguation--a task requiring precise semantic representations and contextualization mechanisms--using controlled datasets of human relatedness judgments for ambiguous words in both English and Spanish. Comparing monolingual and multilingual LMs from the same families, we find consistently reduced performance in multilingual LMs. We then explore three potential capacity constraints: representational (reduced embedding isotropy), attentional (reduced attention to disambiguating cues), and vocabulary-related (increased multi-token segmentation). Multilingual LMs show some evidence of all three limitations; moreover, these factors statistically account for the variance formerly attributed to a model's multilingual status. These findings suggest both that multilingual LMs do suffer from multiple capacity constraints, and that these constraints correlate with reduced disambiguation performance.

[4] Locas: Your Models are Principled Initializers of Locally-Supported Parametric Memories

Sidi Lu,Zhenwen Liang,Dongyang Ma,Yan Wang,Haitao Mi,Dong Yu

Main category: cs.CL

TL;DR: 本文提出Locas,一种本地支持的参数化记忆机制,可灵活地从模型参数中卸载或合并,支持高效持续学习。Locas有两种变体:基于两层MLP和基于GLU-FFN结构,后者易于集成到现有大语言模型中。通过重用模型参数、激活值或梯度进行原则性初始化,显著提升收敛速度、泛化能力和抗灾难性遗忘能力。在PG-19和LoCoMo任务上验证了其有效性,仅增加0.02%参数即可高效存储上下文信息,并在MMLU评测中保持原有知识能力。

Details Motivation: 将测试时训练(test-time training)与新型可灵活卸载/融合的参数化记忆相结合,解决持续学习中的灾难性遗忘与效率问题。 Method: 提出Locas——一种本地支持、类FFN结构的参数化记忆模块;设计两种变体(MLP型与GLU-FFN型),并采用基于模型参数、激活或梯度的原理性低秩初始化策略。 Result: 在PG-19和LoCoMo任务上,Locas-GLU仅增0.02%参数即可有效建模长上下文;MMLU评测显示其在记忆整本书后仍基本保持原有通用能力,灾难性遗忘极小。 Conclusion: Locas是一种高效、轻量且可无缝集成的参数化记忆机制,能将过往上下文永久化为模型参数知识,同时最小化对原始模型知识的干扰,为测试时持续学习提供了新范式。 Abstract: In this paper, we aim to bridge test-time-training with a new type of parametric memory that can be flexibly offloaded from or merged into model parameters. We present Locas, a Locally-Supported parametric memory that shares the design of FFN blocks in modern transformers, allowing it to be flexibly permanentized into the model parameters while supporting efficient continual learning. We discuss two major variants of Locas: one with a conventional two-layer MLP design that has a clearer theoretical guarantee; the other one shares the same GLU-FFN structure with SOTA LLMs, and can be easily attached to existing models for both parameter-efficient and computation-efficient continual learning. Crucially, we show that proper initialization of such low-rank sideway-FFN-style memories -- performed in a principled way by reusing model parameters, activations and/or gradients -- is essential for fast convergence, improved generalization, and catastrophic forgetting prevention. We validate the proposed memory mechanism on the PG-19 whole-book language modeling and LoCoMo long-context dialogue question answering tasks. With only 0.02\% additional parameters in the lowest case, Locas-GLU is capable of storing the information from past context while maintaining a much smaller context window. In addition, we also test the model's general capability loss after memorizing the whole book with Locas, through comparative MMLU evaluation. Results show the promising ability of Locas to permanentize past context into parametric knowledge with minimized catastrophic forgetting of the model's existing internal knowledge.

[5] Data Kernel Perspective Space Performance Guarantees for Synthetic Data from Transformer Models

Michael Browder,Kevin Duh,J. David Harris,Vince Lyzinski,Paul McNamee,Youngser Park,Carey E. Priebe,Peter Viechnicki

Main category: cs.CL

TL;DR: 本文提出Data Kernel Perspective Space (DKPS)框架,旨在为基于大语言模型(LLM)生成的合成数据提供可验证的统计质量保证,以应对标注数据稀缺问题。

Details Motivation: 标注数据稀缺是构建高性能语言技术和生成式AI模型的主要瓶颈;现有利用LLM生成合成数据的方法缺乏可预测性和理论保障,工程师常凭经验调整温度参数,缺乏科学依据。 Method: 提出Data Kernel Perspective Space(DKPS)数学框架,通过理论推导建立合成数据质量的统计保证,并将其与下游任务(如神经机器翻译、对比偏好优化训练的LLM)性能关联。 Result: DKPS提供了对LLM生成合成数据质量的可证明统计保证,并能阐明其对下游任务性能的影响机制;实验验证了该框架在NMT和CPO训练中的有效性。 Conclusion: DKPS为合成数据生成提供了首个具备数学基础和统计保障的分析视角,有望推动数据工程从经验‘调参’走向理论驱动。 Abstract: Scarcity of labeled training data remains the long pole in the tent for building performant language technology and generative AI models. Transformer models -- particularly LLMs -- are increasingly being used to mitigate the data scarcity problem via synthetic data generation. However, because the models are black boxes, the properties of the synthetic data are difficult to predict. In practice it is common for language technology engineers to 'fiddle' with the LLM temperature setting and hope that what comes out the other end improves the downstream model. Faced with this uncertainty, here we propose Data Kernel Perspective Space (DKPS) to provide the foundation for mathematical analysis yielding concrete statistical guarantees for the quality of the outputs of transformer models. We first show the mathematical derivation of DKPS and how it provides performance guarantees. Next we show how DKPS performance guarantees can elucidate performance of a downstream task, such as neural machine translation models or LLMs trained using Contrastive Preference Optimization (CPO). Limitations of the current work and future research are also discussed.

[6] Multilingual Extraction and Recognition of Implicit Discourse Relations in Speech and Text

Ahmed Ruby,Christian Hardmeier,Sara Stymne

Main category: cs.CL

TL;DR: 本文提出了一种多语言多模态隐式篇章关系分类方法,通过构建含英法西三语的文本-音频数据集,并利用Qwen2-Audio联合建模文本与语音,提升了低资源语言下的分类性能。

Details Motivation: 隐式篇章关系分类需依赖上下文推断语义,而单靠文本难以充分捕捉跨模态、跨语言的上下文线索。 Method: 构建面向远距离及无关语言对的多语言多模态数据集(含英语、法语、西班牙语),并基于Qwen2-Audio模型融合文本与声学信息进行跨语言隐式关系分类。 Result: 文本模型优于纯音频模型,但文本与音频融合可提升性能;跨语言迁移显著改善低资源语言表现。 Conclusion: 多模态融合与跨语言迁移是提升隐式篇章关系分类效果、尤其对低资源语言有效的关键策略。 Abstract: Implicit discourse relation classification is a challenging task, as it requires inferring meaning from context. While contextual cues can be distributed across modalities and vary across languages, they are not always captured by text alone. To address this, we introduce an automatic method for distantly related and unrelated language pairs to construct a multilingual and multimodal dataset for implicit discourse relations in English, French, and Spanish. For classification, we propose a multimodal approach that integrates textual and acoustic information through Qwen2-Audio, allowing joint modeling of text and audio for implicit discourse relation classification across languages. We find that while text-based models outperform audio-based models, integrating both modalities can enhance performance, and cross-lingual transfer can provide substantial improvements for low-resource languages.

[7] GreekMMLU: A Native-Sourced Multitask Benchmark for Evaluating Language Models in Greek

Yang Zhang,Mersin Konomi,Christos Xypolopoulos,Konstantinos Divriotis,Konstantinos Skianis,Giannis Nikolentzos,Giorgos Stamou,Guokan Shang,Michalis Vazirgiannis

Main category: cs.CL

TL;DR: 本文介绍了GreekMMLU,一个原生希腊语的多任务语言理解评测基准,包含45个学科领域的21805道多项选择题,全部源自希腊本土学术、职业和政府考试,并公开发布其中16857题,保留4948题用于私有排行榜。实验评估了80多个大模型在该基准上的表现,揭示了前沿模型与开源模型、希腊适配模型与通用多语模型之间的显著性能差距,并系统分析了影响性能的关键因素。

Details Motivation: 现有希腊语评测基准多为英译而来,无法准确反映希腊语的语言与文化特征,缺乏基于真实母语内容的可靠评测工具。 Method: 构建原生希腊语的GreekMMLU基准:从希腊本土考试中收集或撰写题目,覆盖45个学科,按新定义的学科分类体系组织,并标注教育难度等级;公开发布大部分样本,保留部分用于私有 leaderboard 以防止数据污染;对80多个开源与闭源大模型进行系统评测,并分析模型规模、适配方式与提示策略等因素的影响。 Result: 评估发现前沿模型与开源模型、希腊适配模型与通用多语模型之间存在显著性能差距;系统分析揭示模型规模、希腊语适配及提示工程等是影响性能的关键因素。 Conclusion: GreekMMLU填补了高质量、原生希腊语评测基准的空白,为推动大模型在希腊语能力上的发展与评估提供了坚实基础,并为其他低资源语言构建类似基准提供了可借鉴的方法论。 Abstract: Large Language Models (LLMs) are commonly trained on multilingual corpora that include Greek, yet reliable evaluation benchmarks for Greek-particularly those based on authentic, native-sourced content-remain limited. Existing datasets are often machine-translated from English, failing to capture Greek linguistic and cultural characteristics. We introduce GreekMMLU, a native-sourced benchmark for massive multitask language understanding in Greek, comprising 21,805 multiple-choice questions across 45 subject areas, organized under a newly defined subject taxonomy and annotated with educational difficulty levels spanning primary to professional examinations. All questions are sourced or authored in Greek from academic, professional, and governmental exams. We publicly release 16,857 samples and reserve 4,948 samples for a private leaderboard to enable robust and contamination-resistant evaluation. Evaluations of over 80 open- and closed-source LLMs reveal substantial performance gaps between frontier and open-weight models, as well as between Greek-adapted models and general multilingual ones. Finally, we provide a systematic analysis of factors influencing performance-including model scale, adaptation, and prompting-and derive insights for improving LLM capabilities in Greek.

[8] Among Us: Measuring and Mitigating Malicious Contributions in Model Collaboration Systems

Ziyuan Yang,Wenxuan Ding,Shangbin Feng,Yulia Tsvetkov

Main category: cs.CL

TL;DR: 本文研究了多语言模型(multi-LLM)协作系统中恶意模型带来的安全风险,量化其对系统性能(尤其推理与安全性任务)的负面影响,并提出基于外部监督器的缓解策略,平均恢复95.31%原始性能。

Details Motivation: 在多语言模型协作日益普及的背景下,部分模型可能被篡改或恶意部署,带来严重安全风险,但当前对此类去中心化范式中的威胁缺乏系统评估与防御机制。 Method: 构建四类恶意语言模型,嵌入四种主流多模型协作系统,在10个数据集上评估影响;进而设计外部监督机制,通过禁用或掩码方式削弱恶意模型影响。 Result: 恶意模型使多LLM系统在推理和安全任务上平均性能下降7.12%和7.94%;所提监督缓解策略平均恢复95.31%初始性能。 Conclusion: 恶意模型对多LLM协作构成严重威胁,外部监督可显著缓解但尚不能完全免疫,实现完全抗恶意模型仍是开放问题。 Abstract: Language models (LMs) are increasingly used in collaboration: multiple LMs trained by different parties collaborate through routing systems, multi-agent debate, model merging, and more. Critical safety risks remain in this decentralized paradigm: what if some of the models in multi-LLM systems are compromised or malicious? We first quantify the impact of malicious models by engineering four categories of malicious LMs, plug them into four types of popular model collaboration systems, and evaluate the compromised system across 10 datasets. We find that malicious models have a severe impact on the multi-LLM systems, especially for reasoning and safety domains where performance is lowered by 7.12% and 7.94% on average. We then propose mitigation strategies to alleviate the impact of malicious components, by employing external supervisors that oversee model collaboration to disable/mask them out to reduce their influence. On average, these strategies recover 95.31% of the initial performance, while making model collaboration systems fully resistant to malicious models remains an open research question.

[9] The Single-Multi Evolution Loop for Self-Improving Model Collaboration Systems

Shangbin Feng,Kishan Panaganti,Yulia Tsvetkov,Wenhao Yu

Main category: cs.CL

TL;DR: 本文提出了一种‘单-多进化循环’框架,通过将多模型协作模式蒸馏到单个模型中,并在多模型间迭代协作与蒸馏,实现模型协同增益与推理效率的双重提升。

Details Motivation: 模型协作虽能融合多模型优势,但加载多个大模型带来高昂计算开销;亟需一种既能保留协作能力、又能降低部署成本的方法。 Method: 提出协作蒸馏(distilling collaborative patterns into a single model)和单-多进化循环(multiple LMs collaborate → each distills from collaborative outputs → improved LMs re-collaborate),在多种协作策略和任务上进行迭代训练与评估。 Result: 在15项任务(QA、推理、事实性等)上,单模型平均提升8.0%;协作系统经进化后平均再提升14.9%;该方法优于现有进化AI方法,且泛化性强、可解决初始模型难以处理的问题。 Conclusion: 单-多进化循环构建了一个模型间相互促进的自演化生态,有效兼顾协作性能与推理效率,为高效大模型协同提供了新范式。 Abstract: Model collaboration -- systems where multiple language models (LMs) collaborate -- combines the strengths of diverse models with cost in loading multiple LMs. We improve efficiency while preserving the strengths of collaboration by distilling collaborative patterns into a single model, where the model is trained on the outputs of the model collaboration system. At inference time, only the distilled model is employed: it imitates the collaboration while only incurring the cost of a single model. Furthermore, we propose the single-multi evolution loop: multiple LMs collaborate, each distills from the collaborative outputs, and these post-distillation improved LMs collaborate again, forming a collective evolution ecosystem where models evolve and self-improve by interacting with an environment of other models. Extensive experiments with 7 collaboration strategies and 15 tasks (QA, reasoning, factuality, etc.) demonstrate that: 1) individual models improve by 8.0% on average, absorbing the strengths of collaboration while reducing the cost to a single model; 2) the collaboration also benefits from the stronger and more synergistic LMs after distillation, improving over initial systems without evolution by 14.9% on average. Analysis reveals that the single-multi evolution loop outperforms various existing evolutionary AI methods, is compatible with diverse model/collaboration/distillation settings, and helps solve problems where the initial model/system struggles to.

[10] Are Open-Weight LLMs Ready for Social Media Moderation? A Comparative Study on Bluesky

Hsuan-Yu Chou,Wajiha Naveed,Shuyan Zhou,Xiaowei Yang

Main category: cs.CL

TL;DR: 本文评估了七种最先进语言模型(四个专有模型和三个开源模型)在社交媒体内容审核任务中的表现,发现开源模型在敏感性和特异性上与专有模型相当,表明其可在消费级硬件上支持隐私保护的内容审核。

Details Motivation: 随着互联网普及,有害内容暴露增加,亟需有效的内容审核手段;而开源大模型在零样本有害内容检测中的能力尚不明确,本文旨在评估其实际效果。 Method: 在真实Bluesky平台帖子数据上,对比测试四种专有和三种开源大语言模型的审核性能,以Bluesky审核服务决策及两位作者人工标注为基准,计算敏感性、特异性及人机一致性。 Result: 开源LLMs的敏感性(81%–97%)和特异性(91%–100%)与专有LLMs(72%–98%,93%–99%)高度重叠;不同违规类型(粗鲁、不容忍、威胁)呈现敏感性与特异性此消彼长现象;人机间存在可观的评分一致性。 Conclusion: 开源权重LLM具备支撑隐私优先、终端部署型内容审核的潜力,为兼顾社区规范与个体偏好的新型审核系统设计提供了新方向。 Abstract: As internet access expands, so does exposure to harmful content, increasing the need for effective moderation. Research has demonstrated that large language models (LLMs) can be effectively utilized for social media moderation tasks, including harmful content detection. While proprietary LLMs have been shown to zero-shot outperform traditional machine learning models, the out-of-the-box capability of open-weight LLMs remains an open question. Motivated by recent developments of reasoning LLMs, we evaluate seven state-of-the-art models: four proprietary and three open-weight. Testing with real-world posts on Bluesky, moderation decisions by Bluesky Moderation Service, and annotations by two authors, we find a considerable degree of overlap between the sensitivity (81%--97%) and specificity (91%--100%) of the open-weight LLMs and those (72%--98%, and 93%--99%) of the proprietary ones. Additionally, our analysis reveals that specificity exceeds sensitivity for rudeness detection, but the opposite holds for intolerance and threats. Lastly, we identify inter-rater agreement across human moderators and the LLMs, highlighting considerations for deploying LLMs in both platform-scale and personalized moderation contexts. These findings show open-weight LLMs can support privacy-preserving moderation on consumer-grade hardware and suggest new directions for designing moderation systems that balance community values with individual user preferences.

[11] Aligning Large Language Model Behavior with Human Citation Preferences

Kenichiro Ando,Tatsuya Harada

Main category: cs.CL

TL;DR: 本研究探讨了大语言模型(LLM)在生成内容时的引用偏好与人类偏好的一致性,构建了涵盖八类引用动机的标注数据集,发现当前模型在医学文本上与人类一致,但在需显式标注引用的文本上过度引用,在数字句和个人姓名句上则显著引用不足;通过直接偏好优化(DPO)可有效校准模型行为。

Details Motivation: 现有研究关注LLM应引用哪些参考文献,但对LLM如何识别‘值得引用’的内容及其可控性缺乏深入探索;本文旨在刻画LLM实际引用行为与人类 citation preference 的匹配程度。 Method: 构建包含八类引用动机(如医学、数值、人名等)的Web来源文本数据集,并对所有类型组合进行成对人类偏好标注;定量分析LLM引用倾向与人类偏好的偏差;采用Direct Preference Optimization(DPO)对模型进行偏好对齐微调。 Result: 1)人类最常要求医学类文本提供引用,强模型也呈现类似趋势;2)模型比人类高27%地为Wikipedia等已标注‘需引用’的文本添加引用,导致对齐下降;3)模型在数值句和含人名句上分别比人类少引用22.6%和20.1%;4)DPO可有效提升模型引用行为与人类偏好的一致性。 Conclusion: LLM当前的引用策略存在系统性偏差,既存在过度引用也存在引用不足;该偏差可通过偏好学习方法校准;本工作为细粒度研究LLM引用认知与可控性提供了基准数据集与分析框架。 Abstract: Most services built on powerful large-scale language models (LLMs) add citations to their output to enhance credibility. Recent research has paid increasing attention to the question of what reference documents to link to outputs. However, how LLMs recognize cite-worthiness and how this process should be controlled remains underexplored. In this study, we focus on what kinds of content LLMs currently tend to cite and how well that behavior aligns with human preferences. We construct a dataset to characterize the relationship between human citation preferences and LLM behavior. Web-derived texts are categorized into eight citation-motivation types, and pairwise citation preferences are exhaustively evaluated across all type combinations to capture fine-grained contrasts. Our results show that humans most frequently seek citations for medical text, and stronger models display a similar tendency. We also find that current models are as much as $27\%$ more likely than humans to add citations to text that is explicitly marked as needing citations on sources such as Wikipedia, and this overemphasis reduces alignment accuracy. Conversely, models systematically underselect numeric sentences (by $-22.6\%$ relative to humans) and sentences containing personal names (by $-20.1\%$), categories for which humans typically demand citations. Furthermore, experiments with Direct Preference Optimization demonstrate that model behavior can be calibrated to better match human citation preferences. We expect this study to provide a foundation for more fine-grained investigations into LLM citation preferences.

[12] Quantifying the Knowledge Proximity Between Academic and Industry Research: An Entity and Semantic Perspective

Hongye Zhao,Yi Zhao,Chengzhi Zhang

Main category: cs.CL

TL;DR: 本文通过细粒度知识实体和语义空间量化产学研协同演化轨迹,发现二者知识邻近性随技术变革上升,学术界知识主导地位在范式转变中减弱。

Details Motivation: 现有研究依赖宏观指标(如合作论文数)衡量产学研知识邻近性,缺乏对文献中知识单元的细粒度分析,导致对知识邻近性的理解不足,影响协作框架与资源配置效率。 Method: 1)实体层面:利用预训练模型提取细粒度知识实体,用余弦相似度测序列重叠,结合复杂网络分析拓扑特征;2)语义层面:采用无监督对比学习,通过跨机构文本相似性量化语义空间收敛;3)引用分布模式分析双向知识流与相似性的关联。 Result: 产学研知识邻近性整体上升,尤其在技术变革后显著增强;学术界知识主导地位在技术范式转变期间减弱;提供了协同演化的文本证据。 Conclusion: 细粒度实体与语义分析能更精准刻画产学研知识互动动态,揭示双向适应机制,为优化协作机制与资源分配提供新依据。 Abstract: The academia and industry are characterized by a reciprocal shaping and dynamic feedback mechanism. Despite distinct institutional logics, they have adapted closely in collaborative publishing and talent mobility, demonstrating tension between institutional divergence and intensive collaboration. Existing studies on their knowledge proximity mainly rely on macro indicators such as the number of collaborative papers or patents, lacking an analysis of knowledge units in the literature. This has led to an insufficient grasp of fine-grained knowledge proximity between industry and academia, potentially undermining collaboration frameworks and resource allocation efficiency. To remedy the limitation, this study quantifies the trajectory of academia-industry co-evolution through fine-grained entities and semantic space. In the entity measurement part, we extract fine-grained knowledge entities via pre-trained models, measure sequence overlaps using cosine similarity, and analyze topological features through complex network analysis. At the semantic level, we employ unsupervised contrastive learning to quantify convergence in semantic spaces by measuring cross-institutional textual similarities. Finally, we use citation distribution patterns to examine correlations between bidirectional knowledge flows and similarity. Analysis reveals that knowledge proximity between academia and industry rises, particularly following technological change. This provides textual evidence of bidirectional adaptation in co-evolution. Additionally, academia's knowledge dominance weakens during technological paradigm shifts. The dataset and code for this paper can be accessed at https://github.com/tinierZhao/Academic-Industrial-associations.

[13] Bagpiper: Solving Open-Ended Audio Tasks via Rich Captions

Jinchuan Tian,Haoran Wang,Bo-Hao Su,Chien-yu Huang,Qingzheng Wang,Jiatong Shi,William Chen,Xun Gong,Siddhant Arora,Chin-Jou Li,Masao Someki,Takashi Maekaku,Yusuke Shinohara,Jin Sakuma,Chao-Han Huck Yang,Shinji Watanabe

Main category: cs.CL

TL;DR: Bagpiper是一个80亿参数的音频基础模型,通过大规模音频-丰富自然语言描述(如转录、事件)对进行预训练,建立音频与高层认知概念空间之间的双向映射;其'先生成描述、再处理'的微调范式无需任务先验,实现统一的音频理解与生成。

Details Motivation: 现有音频基础模型依赖刚性、任务特定的监督,仅关注音频的孤立因素;而人类能整体地将物理音频信号与抽象认知概念关联以完成复杂任务,因此需要一种更符合认知机制的统一建模方法。 Method: 提出Bagpiper模型,基于600B token规模的音频-丰富caption(涵盖转录、事件等认知概念)数据集进行预训练,构建音频与高阶概念空间的鲁棒双向映射;微调阶段采用'caption-then-process'流程,引入中间认知推理步骤。 Result: 在MMAU和AIRBench音频理解基准上超越Qwen-2.5-Omni,在生成质量上优于CosyVoice3和TangoFlux,可合成任意组合的语音、音乐与音效,首次实现通用音频的统一理解与生成。 Conclusion: Bagpiper验证了以丰富自然语言描述为认知中介的音频建模范式有效性,为构建类人、通用、统一的音频基础模型提供了新路径。 Abstract: Current audio foundation models typically rely on rigid, task-specific supervision, addressing isolated factors of audio rather than the whole. In contrast, human intelligence processes audio holistically, seamlessly bridging physical signals with abstract cognitive concepts to execute complex tasks. Grounded in this philosophy, we introduce Bagpiper, an 8B audio foundation model that interprets physical audio via rich captions, i.e., comprehensive natural language descriptions that encapsulate the critical cognitive concepts inherent in the signal (e.g., transcription, audio events). By pre-training on a massive corpus of 600B tokens, the model establishes a robust bidirectional mapping between raw audio and this high-level conceptual space. During fine-tuning, Bagpiper adopts a caption-then-process workflow, simulating an intermediate cognitive reasoning step to solve diverse tasks without task-specific priors. Experimentally, Bagpiper outperforms Qwen-2.5-Omni on MMAU and AIRBench for audio understanding and surpasses CosyVoice3 and TangoFlux in generation quality, capable of synthesizing arbitrary compositions of speech, music, and sound effects. To the best of our knowledge, Bagpiper is among the first works that achieve unified understanding generation for general audio. Model, data, and code are available at Bagpiper Home Page.

[14] FedMosaic: Federated Retrieval-Augmented Generation via Parametric Adapters

Zhilin Liang,Yuxiang Wang,Zimu Zhou,Hainan Zhang,Boyi Liu,Yongxin Tong

Main category: cs.CL

TL;DR: 本文提出FedMosaic,一种基于参数化适配器的联邦检索增强生成(FedRAG)框架,通过语义聚类与选择性聚合,在保护数据隐私前提下显著提升准确率并大幅降低存储与通信开销。

Details Motivation: 现有RAG依赖中心化语料库,难以满足隐私敏感场景中知识孤岛的需求,亟需支持分布式、不共享原始文档的联邦RAG方案。 Method: 提出FedMosaic框架:1)将语义相关的多文档聚类为共享的多文档适配器,并引入文档特定掩码以兼顾泛化性与特异性;2)设计选择性适配器聚合机制,仅融合相关且无冲突的适配器。 Result: 在四类任务上平均准确率较SOTA方法提升10.9%,存储成本降低78.8%–86.3%,通信成本降低91.4%,全程不传输原始文档。 Conclusion: FedMosaic首次实现了高效、低开销、强隐私保障的联邦RAG,验证了参数化适配器在FedRAG中的可行性与优越性。 Abstract: Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by grounding generation in external knowledge to improve factuality and reduce hallucinations. Yet most deployments assume a centralized corpus, which is infeasible in privacy aware domains where knowledge remains siloed. This motivates federated RAG (FedRAG), where a central LLM server collaborates with distributed silos without sharing raw documents. In context RAG violates this requirement by transmitting verbatim documents, whereas parametric RAG encodes documents into lightweight adapters that merge with a frozen LLM at inference, avoiding raw-text exchange. We adopt the parametric approach but face two unique challenges induced by FedRAG: high storage and communication from per-document adapters, and destructive aggregation caused by indiscriminately merging multiple adapters. We present FedMosaic, the first federated RAG framework built on parametric adapters. FedMosaic clusters semantically related documents into multi-document adapters with document-specific masks to reduce overhead while preserving specificity, and performs selective adapter aggregation to combine only relevance-aligned, nonconflicting adapters. Experiments show that FedMosaic achieves an average 10.9% higher accuracy than state-of-the-art methods in four categories, while lowering storage costs by 78.8% to 86.3% and communication costs by 91.4%, and never sharing raw documents.

Guangwei Zhang,Jianing Zhu,Cheng Qian,Neil Gong,Rada Mihalcea,Zhaozhuo Xu,Jingrui He,Jiaqi Ma,Yun Huang,Chaowei Xiao,Bo Li,Ahmed Abbasi,Dongwon Lee,Heng Ji,Denghui Zhang

Main category: cs.CL

TL;DR: Copyright Detective 是首个用于检测、分析和可视化大语言模型(LLM)输出中潜在版权风险的交互式法证系统,将版权合规性评估建模为证据发现过程,而非简单分类任务。

Details Motivation: 由于版权法本身的复杂性,现有方法难以准确判定LLM输出是否构成侵权;需支持黑盒场景下的系统化、可解释、可交互的版权风险审计。 Method: 提出统一可扩展框架,融合内容召回测试、改写级相似性分析、说服性越狱探测和遗忘验证等多种检测范式,并通过交互式提示、响应采集与迭代工作流实现系统化审计。 Result: 实现了对LLM输出中逐字记忆与改写级信息泄露的可审计检测,支持负责任部署与透明评估,尤其适用于黑盒访问场景。 Conclusion: Copyright Detective 为LLM版权风险的实证分析提供了新范式,强调证据驱动、交互式与多维度检测,推动AI合规实践向可验证、可追溯方向发展。 Abstract: We present Copyright Detective, the first interactive forensic system for detecting, analyzing, and visualizing potential copyright risks in LLM outputs. The system treats copyright infringement versus compliance as an evidence discovery process rather than a static classification task due to the complex nature of copyright law. It integrates multiple detection paradigms, including content recall testing, paraphrase-level similarity analysis, persuasive jailbreak probing, and unlearning verification, within a unified and extensible framework. Through interactive prompting, response collection, and iterative workflows, our system enables systematic auditing of verbatim memorization and paraphrase-level leakage, supporting responsible deployment and transparent evaluation of LLM copyright risks even with black-box access.

[16] CoPE: Clipped RoPE as A Scalable Free Lunch for Long Context LLMs

Haoran Li,Sucheng Ren,Alan Yuille,Feng Wang

Main category: cs.CL

TL;DR: 本文提出CoPE方法,通过软裁剪RoPE的低频分量,统一解决OOD缓解和语义建模两大目标,在长达256k上下文长度上显著提升模型泛化性能。

Details Motivation: 现有RoPE适配长上下文的方法分为OOD缓解和语义建模两类,但二者目标看似分离;本文旨在统一这两种指导原则。 Method: 提出CoPE(soft clipping of low-frequency components of RoPE),即对RoPE的低频成分进行软裁剪,以兼顾OOD缓解、语义信号增强与避免谱泄漏。 Result: 在多种长上下文任务上验证了CoPE的有效性,性能提升显著,支持最长256k上下文,成为当前长度泛化的新SOTA。 Conclusion: CoPE是一种简洁而有效的RoPE改进方案,成功统一OOD缓解与语义建模目标,并在理论与实验上均验证其优越性。 Abstract: Rotary Positional Embedding (RoPE) is a key component of context scaling in Large Language Models (LLMs). While various methods have been proposed to adapt RoPE to longer contexts, their guiding principles generally fall into two categories: (1) out-of-distribution (OOD) mitigation, which scales RoPE frequencies to accommodate unseen positions, and (2) Semantic Modeling, which posits that the attention scores computed with RoPE should always prioritize semantically similar tokens. In this work, we unify these seemingly distinct objectives through a minimalist intervention, namely CoPE: soft clipping lowfrequency components of RoPE. CoPE not only eliminates OOD outliers and refines semantic signals, but also prevents spectral leakage caused by hard clipping. Extensive experiments demonstrate that simply applying our soft clipping strategy to RoPE yields significant performance gains that scale up to 256k context length, validating our theoretical analysis and establishing CoPE as a new state-of-the-art for length generalization. Our code, data, and models are available at https://github.com/hrlics/CoPE.

[17] Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR

Fanfan Liu,Youyang Yin,Peng Shi,Siqi Yang,Zhixiong Zeng,Haibo Qiu

Main category: cs.CL

TL;DR: 本文分析了强化学习中可验证奖励(RLVR)方法在大语言模型和视觉-语言模型训练中响应长度变化的差异,提出了一种消除长度偏差的新算法LUSPO,并在多个推理基准上验证了其优越性。

Details Motivation: 不同RLVR算法在训练过程中响应长度变化模式差异显著,缺乏对其根本原因的解释。 Method: 通过理论分析主流RLVR算法的组成成分,识别影响响应长度的因素,并基于此提出长度无偏序列策略优化(LUSPO)算法,修正GSPO中的长度偏差。 Result: LUSPO在数学推理和多模态推理任务上均优于GRPO和GSPO等现有方法,有效解决了响应长度坍缩问题。 Conclusion: LUSPO是一种新颖且先进的RLVR优化策略,能消除响应长度偏差,提升模型推理能力。 Abstract: Recent applications of Reinforcement Learning with Verifiable Rewards (RLVR) to Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated significant success in enhancing reasoning capabilities for complex tasks. During RLVR training, an increase in response length is often regarded as a key factor contributing to the growth of reasoning ability. However, the patterns of change in response length vary significantly across different RLVR algorithms during the training process. To provide a fundamental explanation for these variations, this paper conducts an in-depth analysis of the components of mainstream RLVR algorithms. We present a theoretical analysis of the factors influencing response length and validate our theory through extensive experimentation. Building upon these theoretical findings, we propose the Length-Unbiased Sequence Policy Optimization (LUSPO) algorithm. Specifically, we rectify the length bias inherent in Group Sequence Policy Optimization (GSPO), rendering its loss function unbiased with respect to response length and thereby resolving the issue of response length collapse. We conduct extensive experiments across mathematical reasoning benchmarks and multimodal reasoning scenarios, where LUSPO consistently achieves superior performance. Empirical results demonstrate that LUSPO represents a novel, state-of-the-art optimization strategy compared to existing methods such as GRPO and GSPO.

[18] Towards a Science of Collective AI: LLM-based Multi-Agent Systems Need a Transition from Blind Trial-and-Error to Rigorous Science

Jingru Fan,Dewen Liu,Yufan Dang,Huatao Li,Yuheng Wang,Wei Liu,Feiyu Duan,Xuanwen Ding,Shu Yao,Lin Wu,Ruijie Shi,Wai-Shing Leung,Yuan Cheng,Zhongyu Wei,Cheng Yang,Chen Qian,Zhiyuan Liu,Maosong Sun

Main category: cs.CL

TL;DR: 本文提出了一种面向多智能体系统(MAS)的“设计科学”框架,引入协作增益度量Γ以区分真实协作效果与资源堆砌,并构建因素库与归因范式,推动MAS研究从经验试错走向系统化科学。

Details Motivation: 当前多智能体系统(MAS)虽在大语言模型驱动下取得进展,但缺乏统一科学框架,主要受限于归因模糊:一是缺乏结构化因素分类体系,二是缺少能剥离预算效应的协作评估指标。 Method: 提出协作增益度量Γ作为科学标准;构建基于控制层预设与信息层动态的MAS因素库;建立因子归因范式,系统识别驱动协作的关键因素。 Result: 建立了首个面向MAS的设计科学框架,包含可量化的协作增益指标Γ、结构化因素库及归因方法,支持对协作机制的可复现、可解释分析。 Conclusion: 该框架标志着MAS研究从经验主义向设计科学范式的转变,为构建‘群体人工智能’的严谨科学基础提供路径。 Abstract: Recent advancements in Large Language Models (LLMs) have greatly extended the capabilities of Multi-Agent Systems (MAS), demonstrating significant effectiveness across a wide range of complex and open-ended domains. However, despite this rapid progress, the field still relies heavily on empirical trial-and-error. It lacks a unified and principled scientific framework necessary for systematic optimization and improvement. This bottleneck stems from the ambiguity of attribution: first, the absence of a structured taxonomy of factors leaves researchers restricted to unguided adjustments; second, the lack of a unified metric fails to distinguish genuine collaboration gain from mere resource accumulation. In this paper, we advocate for a transition to design science through an integrated framework. We advocate to establish the collaboration gain metric ($Γ$) as the scientific standard to isolate intrinsic gains from increased budgets. Leveraging $Γ$, we propose a factor attribution paradigm to systematically identify collaboration-driving factors. To support this, we construct a systematic MAS factor library, structuring the design space into control-level presets and information-level dynamics. Ultimately, this framework facilitates the transition from blind experimentation to rigorous science, paving the way towards a true science of Collective AI.

[19] MentorCollab: Selective Large-to-Small Inference-Time Guidance for Efficient Reasoning

Haojin Wang,Yike Wang,Shangbin Feng,Hannaneh Hajishirzi,Yulia Tsvetkov

Main category: cs.CL

TL;DR: 本文提出MentorCollab方法,让大模型在推理时选择性、稀疏地指导小模型,通过轻量级验证器决定小模型是否采纳大模型的短前瞻片段,从而在显著降低计算开销的同时提升多步推理性能。

Details Motivation: 大型推理模型(LRMs)虽推理能力强但成本高且冗余;小型语言模型(SLMs)高效却难以胜任多步推理;现有协同方法易导致模仿式冗长推理,缺乏一致纠错能力。 Method: MentorCollab:在随机采样的token位置探测SLM与LRM输出差异,用轻量验证器判断SLM是否采纳LRM生成的短前瞻片段,而非全程接管生成。 Result: 在15组SLM-LRM组合、3个领域共多个任务上,12种设置性能提升,平均+3.0%,最高+8.0%;仅18.4% token由昂贵的LRM生成。 Conclusion: 选择性、稀疏的推理时指导可在几乎不增加推理开销的前提下,有效恢复大模型的复杂推理能力。 Abstract: Large reasoning models (LRMs) achieve strong performance by producing long chains of thought, but their inference costs are high and often generate redundant reasoning. Small language models (SLMs) are far more efficient, yet struggle on multi-step reasoning tasks. A natural idea is to let a large model guide a small one at inference time as a mentor, yet existing collaboration methods often promote imitation, resulting in verbose reasoning without consistent error correction. We propose MentorCollab, an inference-time collaboration method in which an LRM selectively and sparsely guides an SLM, rather than taking over generation. At randomly sampled token positions, we probe for divergences between the two models and use a lightweight verifier to decide whether the SLM should follow a short lookahead segment from its mentor or continue on its own. Across 15 SLM--LRM pairs and 3 domains (math reasoning, general knowledge, and commonsense reasoning), our method improves performance in 12 settings, with average gains of 3.0% and up to 8.0%, while adopting only having 18.4% tokens generated by the expensive mentor model on average. We find that short segments and selective probing are sufficient for effective collaboration. Our results show that selective inference-time guidance restores large-model reasoning ability without substantial inference overhead.

[20] How Do Language Models Acquire Character-Level Information?

Soma Sato,Ryohei Sasano

Main category: cs.CL

TL;DR: 本文探讨了语言模型如何隐式编码字符级信息,通过控制训练设置分析了影响因素,发现分词规则和正字法约束是分词相关的主因,而子字符串的语义关联和句法信息则是与分词无关的关键因素。

Details Motivation: 语言模型虽未显式接收字符级信息,却能隐式编码,其机制尚不清楚。 Method: 通过在受控设置(如指定预训练数据集或分词器)下训练语言模型,并与标准设置下的模型对比,分析字符级知识获取过程,并将影响因素分为与分词相关和无关两类。 Result: 发现分词相关的主因是合并规则和正字法约束;与分词无关的关键因素是子字符串的语义关联和句法信息。 Conclusion: 语言模型隐式编码字符级信息由多种机制共同作用,其中部分源于分词过程,部分则来自更深层的语言结构建模。 Abstract: Language models (LMs) have been reported to implicitly encode character-level information, despite not being explicitly provided during training. However, the mechanisms underlying this phenomenon remain largely unexplored. To reveal the mechanisms, we analyze how models acquire character-level knowledge by comparing LMs trained under controlled settings, such as specifying the pre-training dataset or tokenizer, with those trained under standard settings. We categorize the contributing factors into those independent of tokenization. Our analysis reveals that merge rules and orthographic constraints constitute primary factors arising from tokenization, whereas semantic associations of substrings and syntactic information function as key factors independent of tokenization.

[21] PACE: Defying the Scaling Hypothesis of Exploration in Iterative Alignment for Mathematical Reasoning

Jun Rao,Zixiong Yu,Xuebo Liu,Guhan Chen,Jing Li,Jiansheng Wei,Xiaojun Meng,Min Zhang

Main category: cs.CL

TL;DR: 本文提出PACE方法,通过基于生成的纠正策略替代传统的Best-of-N采样,在数学推理任务中以更少计算量实现更优的偏好对齐效果,并提升鲁棒性。

Details Motivation: 挑战现有DPO-R1方法依赖大规模Best-of-N采样(N≥8)来挖掘高质量推理轨迹的假设,发现其在数学推理中会导致验证器噪声放大、分布偏移甚至策略崩溃。 Method: 提出PACE(Proximal Alignment via Corrective Exploration),用最小探索预算(2 ### [22] [Cross-Lingual Empirical Evaluation of Large Language Models for Arabic Medical Tasks](https://arxiv.org/abs/2602.05374) *Chaimae Abouzahir,Congbo Ma,Nizar Habash,Farah E. Shamout* Main category: cs.CL TL;DR: This paper analyzes the cross-lingual performance of LLMs on medical question answering in Arabic and English, revealing a language-driven performance gap that worsens with task complexity, stemming from Arabic tokenization issues and unreliable confidence/explanation signals.
Details Motivation: LLMs are often English-centric, limiting their reliability for linguistically diverse communities; performance discrepancies in low-resource languages like Arabic for medical tasks are observed but poorly understood. Method: Cross-lingual empirical analysis of LLM performance on medical question answering in Arabic and English, including tokenization analysis and reliability analysis of model-reported confidence and explanations. Result: A persistent, task-complexity-dependent performance gap exists between Arabic and English; Arabic medical text suffers from structural fragmentation due to tokenization; model confidence and explanations show limited correlation with correctness. Conclusion: There is a critical need for language-aware design and evaluation strategies in deploying LLMs for medical tasks across diverse languages. Abstract: In recent years, Large Language Models (LLMs) have become widely used in medical applications, such as clinical decision support, medical education, and medical question answering. Yet, these models are often English-centric, limiting their robustness and reliability for linguistically diverse communities. Recent work has highlighted discrepancies in performance in low-resource languages for various medical tasks, but the underlying causes remain poorly understood. In this study, we conduct a cross-lingual empirical analysis of LLM performance on Arabic and English medical question and answering. Our findings reveal a persistent language-driven performance gap that intensifies with increasing task complexity. Tokenization analysis exposes structural fragmentation in Arabic medical text, while reliability analysis suggests that model-reported confidence and explanations exhibit limited correlation with correctness. Together, these findings underscore the need for language-aware design and evaluation strategies in LLMs for medical tasks.
### [23] [IESR:Efficient MCTS-Based Modular Reasoning for Text-to-SQL with Large Language Models](https://arxiv.org/abs/2602.05385) *Tao Liu,Jiafan Lu,Bohan Yu,Pengcheng Wu,Liu Haixin,Guoyu Xu,Li Xiangheng,Lixiao Li,Jiaming Hou,Zhao Shijun,Xinglin Lyu,Kunli Zhang,Yuxiang Jia,Hongyin Zan* Main category: cs.CL TL;DR: 本文提出了一种名为IESR的信息增强结构化推理框架,用于轻量级大语言模型的Text-to-SQL任务,通过信息理解、多路径MCTS推理与轨迹一致性验证,在LogicCat和Archer等复杂推理基准上达到SOTA性能,且无需微调。
Details Motivation: 现有Text-to-SQL方法在复杂推理、领域知识和假设性查询上表现不足,且企业部署成本高。 Method: IESR框架包含三部分:(i) 利用LLM进行关键信息理解与模式链接,并解耦数学计算与SQL生成;(ii) 基于蒙特卡洛树搜索(MCTS)的多路径推理加多数投票;(iii) 带判别器模型的轨迹一致性验证模块。 Result: 在LogicCat(24.28 EX)和Archer(37.28 EX)上达到SOTA,仅使用未微调的轻量级模型;同时发现当前代码生成模型在物理知识、数学计算和常识推理方面存在明显偏差与缺陷。 Conclusion: IESR是一种高效、轻量、无需微调的Text-to-SQL框架,显著提升了复杂推理能力,并揭示了当前模型在多类推理能力上的短板,为后续研究提供了明确方向。 Abstract: Text-to-SQL is a key natural language processing task that maps natural language questions to SQL queries, enabling intuitive interaction with web-based databases. Although current methods perform well on benchmarks like BIRD and Spider, they struggle with complex reasoning, domain knowledge, and hypothetical queries, and remain costly in enterprise deployment. To address these issues, we propose a framework named IESR(Information Enhanced Structured Reasoning) for lightweight large language models: (i) leverages LLMs for key information understanding and schema linking, and decoupling mathematical computation and SQL generation, (ii) integrates a multi-path reasoning mechanism based on Monte Carlo Tree Search (MCTS) with majority voting, and (iii) introduces a trajectory consistency verification module with a discriminator model to ensure accuracy and consistency. Experimental results demonstrate that IESR achieves state-of-the-art performance on the complex reasoning benchmark LogicCat (24.28 EX) and the Archer dataset (37.28 EX) using only compact lightweight models without fine-tuning. Furthermore, our analysis reveals that current coder models exhibit notable biases and deficiencies in physical knowledge, mathematical computation, and common-sense reasoning, highlighting important directions for future research. We released code at https://github.com/Ffunkytao/IESR-SLM.
### [24] [Beyond Length: Context-Aware Expansion and Independence as Developmentally Sensitive Evaluation in Child Utterances](https://arxiv.org/abs/2602.05392) *Jiyun Chun,Eric Fosler-Lussier,Michael White,Andrew Perrault* Main category: cs.CL TL;DR: 本文提出了一种基于大语言模型(LLM)的评估框架,用于衡量儿童在成人-儿童对话中话语的质量,聚焦于‘扩展性’(Expansion)和‘独立性’(Independence)两个维度,超越传统以长度为主的指标,具备发展有效性、预测力与语义敏感性。
Details Motivation: 现有指标(如MLU、vocd-D、Flesch-Kincaid等)过度依赖长度且忽略对话上下文,无法捕捉推理深度、话题维持和话语规划等关键响应质量特征。 Method: 构建LLM-as-a-judge框架:先分类前序成人话语类型,再沿Expansion(上下文扩展与推理解析深度)和Independence(推动话语进展的自主性)两轴评分,并通过年龄相关模式、年龄预测任务和话语关系差异验证其有效性。 Result: 新指标展现出发展有效性(随年龄增长而提升)、更强的年龄预测能力(优于传统基线),以及对话语关系的语义敏感性;且与人工评分高度一致,支持大规模自动化评估。 Conclusion: 该框架将儿童话语评估从单纯测量长度转向评估其在具体对话语境中如何有意义地参与并推进交流,为儿童语言发展研究与应用提供更合理、可扩展的量化工具。 Abstract: Evaluating the quality of children's utterances in adult-child dialogue remains challenging due to insufficient context-sensitive metrics. Common proxies such as Mean Length of Utterance (MLU), lexical diversity (vocd-D), and readability indices (Flesch-Kincaid Grade Level, Gunning Fog Index) are dominated by length and ignore conversational context, missing aspects of response quality such as reasoning depth, topic maintenance, and discourse planning. We introduce an LLM-as-a-judge framework that first classifies the Previous Adult Utterance Type and then scores the child's response along two axes: Expansion (contextual elaboration and inferential depth) and Independence (the child's contribution to advancing the discourse). These axes reflect fundamental dimensions in child language development, where Expansion captures elaboration, clause combining, and causal and contrastive connectives. Independence captures initiative, topic control, decreasing reliance on adult scaffolding through growing self-regulation, and audience design. We establish developmental validity by showing age-related patterns and demonstrate predictive value by improving age estimation over common baselines. We further confirm semantic sensitivity by detecting differences tied to discourse relations. Our metrics align with human judgments, enabling large-scale evaluation. This shifts child utterance assessment from simply measuring length to evaluating how meaningfully the child's speech contributes to and advances the conversation within its context.
### [25] [Late-to-Early Training: LET LLMs Learn Earlier, So Faster and Better](https://arxiv.org/abs/2602.05393) *Ji Zhao,Yufei Gu,Shitong Shao,Xun Zhou,Liang Xiang,Zeke Xie* Main category: cs.CL TL;DR: 本文提出Late-to-Early Training(LET)范式,利用小规模预训练模型的晚期层表征来引导大规模语言模型早期训练阶段和早期层的学习,从而加速训练并提升性能。
Details Motivation: 大型语言模型预训练计算成本高昂,而大量已有的小型预训练模型尚未被有效用于加速大模型训练,这一现实问题亟待探索。 Method: 提出LET范式,通过将预训练大模型晚期层的表征作为监督信号,指导目标模型在早期训练步骤和早期网络层中学习‘后期知识’,包含late-to-early-step和late-to-early-layer两种机制。 Result: 在1.4B和7B模型上实验表明,LET可实现最高1.6×训练加速,并在下游任务准确率上提升近5%,即使教师模型参数仅为学生模型的1/10仍有效。 Conclusion: LET是一种高效、鲁棒的知识迁移方法,能显著加快大模型训练并提升性能,为复用小型预训练模型提供了新范式。 Abstract: As Large Language Models (LLMs) achieve remarkable empirical success through scaling model and data size, pretraining has become increasingly critical yet computationally prohibitive, hindering rapid development. Despite the availability of numerous pretrained LLMs developed at significant computational expense, a fundamental real-world question remains underexplored: \textit{Can we leverage existing small pretrained models to accelerate the training of larger models?} In this paper, we propose a Late-to-Early Training (LET) paradigm that enables LLMs to explicitly learn later knowledge in earlier steps and earlier layers. The core idea is to guide the early layers of an LLM during early training using representations from the late layers of a pretrained (i.e. late training phase) model. We identify two key mechanisms that drive LET's effectiveness: late-to-early-step learning and late-to-early-layer learning. These mechanisms significantly accelerate training convergence while robustly enhancing both language modeling capabilities and downstream task performance, enabling faster training with superior performance. Extensive experiments on 1.4B and 7B parameter models demonstrate LET's efficiency and effectiveness. Notably, when training a 1.4B LLM on the Pile dataset, our method achieves up to 1.6$\times$ speedup with nearly 5\% improvement in downstream task accuracy compared to standard training, even when using a pretrained model with 10$\times$ fewer parameters than the target model.
### [26] [OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration](https://arxiv.org/abs/2602.05400) *Shaobo Wang,Xuan Ouyang,Tianyi Xu,Yuzheng Hu,Jialin Liu,Guo Chen,Tianyu Zhang,Junhao Zheng,Kexin Yang,Xingzhang Ren,Dayiheng Liu,Linfeng Zhang* Main category: cs.CL TL;DR: OPUS是一种动态数据选择框架,通过在优化器诱导的更新空间中定义数据效用,并结合Ghost技术与Boltzmann采样,在极低额外开销下显著提升预训练的数据效率。
Details Motivation: 随着高质量公开文本趋于枯竭(Data Wall),预训练正从‘更多token’转向‘更好token’;但现有方法或依赖忽略训练动态的静态启发式过滤,或使用未考虑优化器特性的动态梯度标准。 Method: 提出OPUS框架:在优化器诱导的更新空间中定义数据效用,将候选样本的有效更新投影到由稳定、分布内代理导出的目标方向上;采用Ghost+CountSketch加速计算,Boltzmann采样保障多样性。 Result: 在GPT-2 Large/XL预训练中,仅用30B tokens即超越200B tokens全量训练;与工业级静态过滤联用时仍可增益;在Qwen3-8B-Base续训中,0.5B tokens效果优于3B tokens全量训练。 Conclusion: OPUS实现了对现代优化器敏感、训练动态感知、高扩展性且低开销的数据选择,在多种模型、数据集与优化器下均展现出显著数据效率优势。 Abstract: As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic static filters that ignore training dynamics, or use dynamic yet optimizer-agnostic criteria based on raw gradients. We propose OPUS (Optimizer-induced Projected Utility Selection), a dynamic data selection framework that defines utility in the optimizer-induced update space. OPUS scores candidates by projecting their effective updates, shaped by modern optimizers, onto a target direction derived from a stable, in-distribution proxy. To ensure scalability, we employ Ghost technique with CountSketch for computational efficiency, and Boltzmann sampling for data diversity, incurring only 4.7\% additional compute overhead. OPUS achieves remarkable results across diverse corpora, quality tiers, optimizers, and model scales. In pre-training of GPT-2 Large/XL on FineWeb and FineWeb-Edu with 30B tokens, OPUS outperforms industrial-level baselines and even full 200B-token training. Moreover, when combined with industrial-level static filters, OPUS further improves pre-training efficiency, even with lower-quality data. Furthermore, in continued pre-training of Qwen3-8B-Base on SciencePedia, OPUS achieves superior performance using only 0.5B tokens compared to full training with 3B tokens, demonstrating significant data efficiency gains in specialized domains.
### [27] [Grammatical Error Correction Evaluation by Optimally Transporting Edit Representation](https://arxiv.org/abs/2602.05419) *Takumi Goto,Yusuke Sakai,Taro Watanabe* Main category: cs.CL TL;DR: 本文提出UOT-ERRANT,一种基于编辑向量和非平衡最优传输的新评估指标,用于语法错误纠正(GEC)任务,显著提升+Fluency领域评估性能,并具备高可解释性。
Details Motivation: 现有基于嵌入相似度的参考式评估指标(如BERTScore)在GEC中效果不佳,因其无法有效区分源句中未改动的冗余词;需更聚焦于GEC特有编辑操作的评估方法。 Method: 基于ERRANT提取GEC编辑,定义编辑向量表示每个编辑,并利用非平衡最优传输(UOT)计算假设编辑到参考编辑之间的传输代价,构建新指标UOT-ERRANT。 Result: 在SEEDA元评估基准上,UOT-ERRANT显著优于现有指标,尤其在+Fluency子任务中;其传输方案提供软编辑对齐,增强可解释性。 Conclusion: UOT-ERRANT是一种更准确、更具解释性的GEC自动评估指标,兼顾系统排序与错误分析需求。 Abstract: Automatic evaluation in grammatical error correction (GEC) is crucial for selecting the best-performing systems. Currently, reference-based metrics are a popular choice, which basically measure the similarity between hypothesis and reference sentences. However, similarity measures based on embeddings, such as BERTScore, are often ineffective, since many words in the source sentences remain unchanged in both the hypothesis and the reference. This study focuses on edits specifically designed for GEC, i.e., ERRANT, and computes similarity measured over the edits from the source sentence. To this end, we propose edit vector, a representation for an edit, and introduce a new metric, UOT-ERRANT, which transports these edit vectors from hypothesis to reference using unbalanced optimal transport. Experiments with SEEDA meta-evaluation show that UOT-ERRANT improves evaluation performance, particularly in the +Fluency domain where many edits occur. Moreover, our method is highly interpretable because the transport plan can be interpreted as a soft edit alignment, making UOT-ERRANT a useful metric for both system ranking and analyzing GEC systems. Our code is available from https://github.com/gotutiyan/uot-errant.
### [28] [Once Correct, Still Wrong: Counterfactual Hallucination in Multilingual Vision-Language Models](https://arxiv.org/abs/2602.05437) *Basel Mousi,Fahim Dalvi,Shammur Chowdhury,Firoj Alam,Nadir Durrani* Main category: cs.CL TL;DR: 本文提出M2CQA基准,用于评估视觉语言模型(VLMs)在多文化、多语言(英/阿语及方言)场景下的反事实幻觉问题,并引入CFHR指标;实验发现现有VLMs在阿拉伯语尤其是方言中幻觉率显著升高,且推理优先提示会加剧幻觉。
Details Motivation: 现有幻觉评测基准缺乏对文化适配性与非英语语境下视觉-语言不一致幻觉的考察,尤其忽视中东与北非(MENA)地区的文化多样性及阿拉伯语多方言现象。 Method: 构建跨17个MENA国家图像的多模态基准M2CQA,配以英文、标准阿拉伯语及方言的真假对比陈述;提出反事实幻觉率(CFHR)指标,定义为在正确回答真实陈述前提下接受反事实陈述的比例;在多种提示策略下评测主流VLMs。 Result: CFHR在阿拉伯语(尤其方言)中显著高于英语,即使真实陈述准确率仍高;‘先推理后回答’提示加剧幻觉,而‘先回答后解释’提升鲁棒性。 Conclusion: VLMs的文化与语言泛化能力存在严重缺陷,单纯追求准确率不足以保障可靠性;需结合文化敏感的评测基准(如M2CQA)和CFHR等细粒度指标推动更稳健的多语言多文化VLM发展。 Abstract: Vision-language models (VLMs) can achieve high accuracy while still accepting culturally plausible but visually incorrect interpretations. Existing hallucination benchmarks rarely test this failure mode, particularly outside Western contexts and English. We introduce M2CQA, a culturally grounded multimodal benchmark built from images spanning 17 MENA countries, paired with contrastive true and counterfactual statements in English, Arabic, and its dialects. To isolate hallucination beyond raw accuracy, we propose the CounterFactual Hallucination Rate (CFHR), which measures counterfactual acceptance conditioned on correctly answering the true statement. Evaluating state-of-the-art VLMs under multiple prompting strategies, we find that CFHR rises sharply in Arabic, especially in dialects, even when true-statement accuracy remains high. Moreover, reasoning-first prompting consistently increases counterfactual hallucination, while answering before justifying improves robustness. We will make the experimental resources and dataset publicly available for the community.
### [29] [Causal Front-Door Adjustment for Robust Jailbreak Attacks on LLMs](https://arxiv.org/abs/2602.05444) *Yao Zhou,Zeen Song,Wenwen Qiang,Fengge Wu,Shuyi Zhou,Changwen Zheng,Hui Xiong* Main category: cs.CL TL;DR: 本文提出了一种基于因果推断的LLM越狱攻击方法CFA²,利用前门准则剥离安全对齐机制,实现高成功率且可解释的越狱。
Details Motivation: 现有LLM安全对齐机制常以隐状态形式存在,掩盖模型真实能力;为揭示并绕过这些隐式防御,作者从因果视角将其建模为未观测混杂因子。 Method: 提出CFA²框架:1)将安全机制视为未观测混杂因子;2)应用Pearl前门准则进行因果调整;3)使用稀疏自编码器(SAEs)物理剥离防御相关特征;4)将昂贵的边缘化简化为确定性干预。 Result: CFA²在多个主流LLM上实现了当前最优的越狱成功率,并提供了对越狱过程的机制性解释。 Conclusion: 通过因果建模范式,CFA²不仅提升了越狱攻击的有效性与效率,还增强了其可解释性,揭示了安全对齐机制的内在脆弱性。 Abstract: Safety alignment mechanisms in Large Language Models (LLMs) often operate as latent internal states, obscuring the model's inherent capabilities. Building on this observation, we model the safety mechanism as an unobserved confounder from a causal perspective. Then, we propose the \textbf{C}ausal \textbf{F}ront-Door \textbf{A}djustment \textbf{A}ttack ({\textbf{CFA}}$^2$) to jailbreak LLM, which is a framework that leverages Pearl's Front-Door Criterion to sever the confounding associations for robust jailbreaking. Specifically, we employ Sparse Autoencoders (SAEs) to physically strip defense-related features, isolating the core task intent. We further reduce computationally expensive marginalization to a deterministic intervention with low inference complexity. Experiments demonstrate that {CFA}$^2$ achieves state-of-the-art attack success rates while offering a mechanistic interpretation of the jailbreaking process.
### [30] [Structured Context Engineering for File-Native Agentic Systems: Evaluating Schema Accuracy, Format Effectiveness, and Multi-File Navigation at Scale](https://arxiv.org/abs/2602.05447) *Damon McMillan* Main category: cs.CL TL;DR: 本文通过9649次实验,系统研究了大语言模型(LLM)代理在SQL生成任务中处理结构化数据时的上下文工程策略,发现模型能力是决定性能的最主要因素,架构与格式选择需依据模型类型(前沿vs开源)定制,而非采用通用方案。
Details Motivation: 当前LLM代理广泛通过程序接口操作外部系统,但缺乏关于如何有效构建其输入上下文的实证指导;本文以SQL生成为代理程序化操作的代理任务,系统探究上下文结构对性能的影响。 Method: 开展大规模控制实验(共9,649次),覆盖11种模型、4种上下文格式(YAML、Markdown、JSON、TOON)及表规模从10至10,000的数据库模式,在SQL生成任务上评估准确率、运行效率等指标,并进行统计检验(如p值、卡方检验)。 Result: 1)文件式上下文检索对前沿模型(Claude/GPT/Gemini)提升准确率+2.7%,但对开源模型平均降低7.7%;2)格式总体无显著影响(p=0.484),但开源模型存在明显格式敏感性;3)前沿与开源模型间存在21个百分点的准确率鸿沟,远超格式/架构效应;4)文件原生代理借助领域分区模式可扩展至10,000表并保持高导航准确率;5)文件大小不决定运行效率,紧凑格式因模型不熟悉其结构反而消耗更多token。 Conclusion: 上下文工程应以模型能力为基准进行定制化设计,不能依赖统一最佳实践;实践者需根据所用模型类型(前沿或开源)选择适配的上下文架构与格式,方能有效部署LLM代理于结构化系统。 Abstract: Large Language Model agents increasingly operate external systems through programmatic interfaces, yet practitioners lack empirical guidance on how to structure the context these agents consume. Using SQL generation as a proxy for programmatic agent operations, we present a systematic study of context engineering for structured data, comprising 9,649 experiments across 11 models, 4 formats (YAML, Markdown, JSON, Token-Oriented Object Notation [TOON]), and schemas ranging from 10 to 10,000 tables. Our findings challenge common assumptions. First, architecture choice is model-dependent: file-based context retrieval improves accuracy for frontier-tier models (Claude, GPT, Gemini; +2.7%, p=0.029) but shows mixed results for open source models (aggregate -7.7%, p<0.001), with deficits varying substantially by model. Second, format does not significantly affect aggregate accuracy (chi-squared=2.45, p=0.484), though individual models, particularly open source, exhibit format-specific sensitivities. Third, model capability is the dominant factor, with a 21 percentage point accuracy gap between frontier and open source tiers that dwarfs any format or architecture effect. Fourth, file-native agents scale to 10,000 tables through domain-partitioned schemas while maintaining high navigation accuracy. Fifth, file size does not predict runtime efficiency: compact formats can consume significantly more tokens at scale due to format-unfamiliar search patterns. These findings provide practitioners with evidence-based guidance for deploying LLM agents on structured systems, demonstrating that architectural decisions should be tailored to model capability rather than assuming universal best practices.
### [31] [Reasoning under Ambiguity: Uncertainty-Aware Multilingual Emotion Classification under Partial Supervision](https://arxiv.org/abs/2602.05471) *Md. Mithun Hossaina,Mashary N. Alrasheedy,Nirban Bhowmick,Shamim Forhad,Md. Shakil Hossain,Sudipto Chaki,Md Shafiqul Islam* Main category: cs.CL TL;DR: 本文提出了一种名为'Reasoning under Ambiguity'的不确定性感知框架,用于多语言多标签情感分类,通过熵加权机制和掩码感知目标函数解决标注不完整与情感模糊性问题。
Details Motivation: 现有方法假设标签完全可观测且采用确定性学习目标,在部分监督下易导致偏差学习和不可靠预测;情感本身具有模糊性与标注缺失/异质性问题。 Method: 采用共享多语言编码器+语言特定优化,引入基于熵的模糊性加权机制(降低高模糊样本权重),并设计掩码感知目标函数结合正-未标注正则化以支持部分监督下的鲁棒学习。 Result: 在英语、西班牙语和阿拉伯语情感分类基准上,该方法在多个指标上持续优于强基线,同时提升了训练稳定性、对标注稀疏性的鲁棒性及可解释性。 Conclusion: 不确定性建模与部分监督学习的有效结合,显著提升了多语言多标签情感识别的性能与实用性。 Abstract: Contemporary knowledge-based systems increasingly rely on multilingual emotion identification to support intelligent decision-making, yet they face major challenges due to emotional ambiguity and incomplete supervision. Emotion recognition from text is inherently uncertain because multiple emotional states often co-occur and emotion annotations are frequently missing or heterogeneous. Most existing multi-label emotion classification methods assume fully observed labels and rely on deterministic learning objectives, which can lead to biased learning and unreliable predictions under partial supervision. This paper introduces Reasoning under Ambiguity, an uncertainty-aware framework for multilingual multi-label emotion classification that explicitly aligns learning with annotation uncertainty. The proposed approach uses a shared multilingual encoder with language-specific optimization and an entropy-based ambiguity weighting mechanism that down-weights highly ambiguous training instances rather than treating missing labels as negative evidence. A mask-aware objective with positive-unlabeled regularization is further incorporated to enable robust learning under partial supervision. Experiments on English, Spanish, and Arabic emotion classification benchmarks demonstrate consistent improvements over strong baselines across multiple evaluation metrics, along with improved training stability, robustness to annotation sparsity, and enhanced interpretability.
### [32] [LinguistAgent: A Reflective Multi-Model Platform for Automated Linguistic Annotation](https://arxiv.org/abs/2602.05493) *Bingru Li* Main category: cs.CL TL;DR: 本文提出了LinguistAgent平台,通过双代理(标注员与审阅员)的反思式多模型架构,提升人文与社会科学中隐喻识别等复杂语义任务的数据标注效率,并支持多种LLM应用范式对比实验。
Details Motivation: 数据标注是人文与社会科学中复杂语义任务(如隐喻识别)的重要瓶颈;现有大语言模型虽有潜力,但其理论能力与研究者实际需求之间仍存在显著差距。 Method: 提出LinguistAgent平台,采用反射式多模型架构和双代理(Annotator + Reviewer)工作流模拟专业同行评审;支持Prompt Engineering(零/少样本)、检索增强生成(RAG)与微调三种范式;在隐喻识别任务上实现实时词元级评估(Precision/Recall/F1)。 Result: LinguistAgent在隐喻识别任务上展现出良好效果,提供可复现、可比较的自动化标注性能,并开源代码与应用。 Conclusion: LinguistAgent为人文与社科研究者提供了易用、可解释、可扩展的LLM驱动标注工具,弥合了LLM能力与实际研究需求之间的鸿沟。 Abstract: Data annotation remains a significant bottleneck in the Humanities and Social Sciences, particularly for complex semantic tasks such as metaphor identification. While Large Language Models (LLMs) show promise, a significant gap remains between the theoretical capability of LLMs and their practical utility for researchers. This paper introduces LinguistAgent, an integrated, user-friendly platform that leverages a reflective multi-model architecture to automate linguistic annotation. The system implements a dual-agent workflow, comprising an Annotator and a Reviewer, to simulate a professional peer-review process. LinguistAgent supports comparative experiments across three paradigms: Prompt Engineering (Zero/Few-shot), Retrieval-Augmented Generation, and Fine-tuning. We demonstrate LinguistAgent's efficacy using the task of metaphor identification as an example, providing real-time token-level evaluation (Precision, Recall, and $F_1$ score) against human gold standards. The application and codes are released on https://github.com/Bingru-Li/LinguistAgent.
### [33] [Transport and Merge: Cross-Architecture Merging for Large Language Models](https://arxiv.org/abs/2602.05495) *Chenhang Cui,Binyun Yang,Fei Shen,Yuxin Chen,Jingnan Zheng,Xiang Wang,An Zhang,Tat-Seng Chua* Main category: cs.CL TL;DR: 本文提出了一种基于最优传输(OT)的跨架构模型融合框架,用于将大语言模型(LLM)的知识迁移到异构的小型低资源模型上,仅需少量输入即可实现有效权重融合与性能提升。
Details Motivation: 现实部署多依赖小型低资源模型,而现有模型融合方法通常要求模型架构兼容,难以实现从大型高资源LLM到异构小型模型的知识迁移。 Method: 提出基于最优传输(OT)的跨架构融合框架,通过激活对齐推断异构模型间的跨神经元对应关系,并利用所得传输计划指导权重空间直接融合。 Result: 在低资源语言和专业领域上的大量实验表明,该方法在目标模型上实现了持续性能提升。 Conclusion: 基于OT的跨架构融合是一种高效、轻量且通用的知识迁移机制,可弥合高资源LLM与低资源实际部署模型之间的鸿沟。 Abstract: Large language models (LLMs) achieve strong capabilities by scaling model capacity and training data, yet many real-world deployments rely on smaller models trained or adapted from low-resource data. This gap motivates the need for mechanisms to transfer knowledge from large, high-resource models to smaller, low-resource targets. While model merging provides an effective transfer mechanism, most existing approaches assume architecture-compatible models and therefore cannot directly transfer knowledge from large high-resource LLMs to heterogeneous low-resource targets. In this work, we propose a cross-architecture merging framework based on optimal transport (OT) that aligns activations to infer cross-neuron correspondences between heterogeneous models. The resulting transport plans are then used to guide direct weight-space fusion, enabling effective high-resource to low-resource transfer using only a small set of inputs. Extensive experiments across low-resource languages and specialized domains demonstrate consistent improvements over target models.
### [34] [A Human-in-the-Loop, LLM-Centered Architecture for Knowledge-Graph Question Answering](https://arxiv.org/abs/2602.05512) *Larissa Pusch,Alexandre Courtiol,Tim Conrad* Main category: cs.CL TL;DR: 本文提出了一种交互式框架,让大语言模型(LLMs)生成并解释Cypher图查询语句,用户可通过自然语言迭代优化查询,从而提升对复杂知识图谱(KG)的可访问性、准确性与可解释性。
Details Motivation: 解决LLMs在知识密集型任务中易幻觉、信息过时、可解释性差的问题,以及传统RAG难以支持多跳推理、KG查询需掌握专用语言的痛点。 Method: 设计一个LLM驱动的交互式框架:LLM生成Cypher查询并用自然语言解释;用户以自然语言反馈修正;在合成电影KG(90查询基准)及两个真实KG(Hyena、MaRDI)上评估查询解释质量与错误检测能力。 Result: 该框架显著提升了KG查询的可访问性与事实准确性,在多模型对比中验证了其在查询解释和故障检测方面的有效性,并揭示了模型性能的领域差异性。 Conclusion: 结合LLM的自然语言能力与KG的结构化语义能力,通过人机协同交互可兼顾推理能力、可解释性与实用性,为知识密集型应用提供新范式。 Abstract: Large Language Models (LLMs) excel at language understanding but remain limited in knowledge-intensive domains due to hallucinations, outdated information, and limited explainability. Text-based retrieval-augmented generation (RAG) helps ground model outputs in external sources but struggles with multi-hop reasoning. Knowledge Graphs (KGs), in contrast, support precise, explainable querying, yet require a knowledge of query languages. This work introduces an interactive framework in which LLMs generate and explain Cypher graph queries and users iteratively refine them through natural language. Applied to real-world KGs, the framework improves accessibility to complex datasets while preserving factual accuracy and semantic rigor and provides insight into how model performance varies across domains. Our core quantitative evaluation is a 90-query benchmark on a synthetic movie KG that measures query explanation quality and fault detection across multiple LLMs, complemented by two smaller real-life query-generation experiments on a Hyena KG and the MaRDI (Mathematical Research Data Initiative) KG.
### [35] [Multi-Task GRPO: Reliable LLM Reasoning Across Tasks](https://arxiv.org/abs/2602.05547) *Shyam Sundhar Ramesh,Xiaotong Ji,Matthieu Zimmer,Sangwoong Yoon,Zhiyong Wang,Haitham Bou Ammar,Aurelien Lucchi,Ilija Bogunovic* Main category: cs.CL TL;DR: 本文提出了一种多任务GRPO(MT-GRPO)算法,通过动态调整任务权重和引入比值保持采样器,提升大语言模型在多任务场景下最差任务性能的优化效果与训练效率。
Details Motivation: 现有GRPO在多任务场景中易导致任务间性能不平衡,且不同任务零梯度提示比例差异大,干扰优化信号。 Method: 提出MT-GRPO:(i) 动态任务加权以显式优化最差任务性能;(ii) 设计比值保持采样器,使策略梯度准确反映加权后的任务重要性。 Result: 在3任务和9任务设置中,MT-GRPO在最差任务准确率上分别比标准GRPO和DAPO提升16–28%和6%绝对值,并以50%更少训练步数达到50%最差任务准确率,平均准确率保持竞争力。 Conclusion: MT-GRPO有效缓解多任务RL微调中的性能失衡问题,在保障整体性能的同时显著提升最差任务表现与训练效率,更适合真实部署场景。 Abstract: RL-based post-training with GRPO is widely used to improve large language models on individual reasoning tasks. However, real-world deployment requires reliable performance across diverse tasks. A straightforward multi-task adaptation of GRPO often leads to imbalanced outcomes, with some tasks dominating optimization while others stagnate. Moreover, tasks can vary widely in how frequently prompts yield zero advantages (and thus zero gradients), which further distorts their effective contribution to the optimization signal. To address these issues, we propose a novel Multi-Task GRPO (MT-GRPO) algorithm that (i) dynamically adapts task weights to explicitly optimize worst-task performance and promote balanced progress across tasks, and (ii) introduces a ratio-preserving sampler to ensure task-wise policy gradients reflect the adapted weights. Experiments on both 3-task and 9-task settings show that MT-GRPO consistently outperforms baselines in worst-task accuracy. In particular, MT-GRPO achieves 16-28% and 6% absolute improvement on worst-task performance over standard GRPO and DAPO, respectively, while maintaining competitive average accuracy. Moreover, MT-GRPO requires 50% fewer training steps to reach 50% worst-task accuracy in the 3-task setting, demonstrating substantially improved efficiency in achieving reliable performance across tasks.
### [36] [CASTLE: A Comprehensive Benchmark for Evaluating Student-Tailored Personalized Safety in Large Language Models](https://arxiv.org/abs/2602.05633) *Rui Jia,Ruiyi Lan,Fengrui Liu,Zhongxiang Dai,Bo Jiang,Jing Shao,Jingyuan Chen,Guandong Xu,Fei Wu,Min Zhang* Main category: cs.CL TL;DR: 本文提出学生定制化个性化安全概念,并构建了CASTLE基准,涵盖15种教育安全风险和14种学生属性,包含92908个双语场景,设计了三种评估指标,实验表明现有SOTA LLM在个性化安全方面存在显著不足。
Details Motivation: 现有大语言模型生成机制导致对相同提示产生同质化响应,忽视学生认知与心理的异质性,且传统安全评估指标(如事实准确性、偏见、毒性)无法反映同一响应对不同学生属性可能造成的差异化危害。 Method: 基于教育理论提出学生定制化个性化安全概念,构建CASTLE基准,涵盖15类教育安全风险和14类学生属性,共92908个双语场景;设计Risk Sensitivity、Emotional Empathy和Student Alignment三项评估指标。 Result: 在18个SOTA大语言模型上的实验表明,所有模型在CASTLE上的平均安全评分为低于2.3(满分5分),揭示其在个性化安全保障方面存在严重缺陷。 Conclusion: 当前大语言模型在面向教育场景的学生个性化安全方面能力严重不足,亟需发展更细粒度、属性感知的安全评估与建模方法。 Abstract: Large language models (LLMs) have advanced the development of personalized learning in education. However, their inherent generation mechanisms often produce homogeneous responses to identical prompts. This one-size-fits-all mechanism overlooks the substantial heterogeneity in students cognitive and psychological, thereby posing potential safety risks to vulnerable groups. Existing safety evaluations primarily rely on context-independent metrics such as factual accuracy, bias, or toxicity, which fail to capture the divergent harms that the same response might cause across different student attributes. To address this gap, we propose the concept of Student-Tailored Personalized Safety and construct CASTLE based on educational theories. This benchmark covers 15 educational safety risks and 14 student attributes, comprising 92,908 bilingual scenarios. We further design three evaluation metrics: Risk Sensitivity, measuring the model ability to detect risks; Emotional Empathy, evaluating the model capacity to recognize student states; and Student Alignment, assessing the match between model responses and student attributes. Experiments on 18 SOTA LLMs demonstrate that CASTLE poses a significant challenge: all models scored below an average safety rating of 2.3 out of 5, indicating substantial deficiencies in personalized safety assurance.
### [37] [Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew](https://arxiv.org/abs/2602.05648) *Giuseppe Samo,Paola Merlo* Main category: cs.CL TL;DR: 本文研究了Transformer模型如何表示土耳其语和现代希伯来语中的复杂动词变位,重点分析了不同分词策略对模型能力的影响。结果表明,对于形态标记透明的土耳其语,单语和多语模型均表现良好;而对于非黏着性形态的希伯来语,仅采用词素感知切分的单语模型效果更优。
Details Motivation: 探究不同分词策略(如原子级、子词级、字符级、词素级)如何影响Transformer模型对高度屈折语言(土耳其语和希伯来语)中复杂动词范式的建模能力。 Method: 在自然数据上使用Blackbird Language Matrices任务,对比单语与多语Transformer模型在不同分词策略(原子级、子词级、字符级、词素感知)下的表现,并扩展至合成数据集进行验证。 Result: 土耳其语中,单语与多语模型在原子或小粒度子词分词下均成功建模动词范式;希伯来语中,字符级多语模型失败,而词素感知的单语模型表现优异;所有模型在合成数据上性能均提升。 Conclusion: 分词策略需匹配目标语言的形态类型:透明黏着语(如土耳其语)容错性强,而非黏着语(如希伯来语)则依赖词素感知的建模方式;单语模型在特定形态结构上可能优于多语模型。 Abstract: We investigate how transformer models represent complex verb paradigms in Turkish and Modern Hebrew, concentrating on how tokenization strategies shape this ability. Using the Blackbird Language Matrices task on natural data, we show that for Turkish -- with its transparent morphological markers -- both monolingual and multilingual models succeed, either when tokenization is atomic or when it breaks words into small subword units. For Hebrew, instead, monolingual and multilingual models diverge. A multilingual model using character-level tokenization fails to capture the language non-concatenative morphology, but a monolingual model with morpheme-aware segmentation performs well. Performance improves on more synthetic datasets, in all models.
### [38] [MedErrBench: A Fine-Grained Multilingual Benchmark for Medical Error Detection and Correction with Clinical Expert Annotations](https://arxiv.org/abs/2602.05692) *Congbo Ma,Yichun Zhang,Yousef Al-Jazzazi,Ahamed Foisal,Laasya Sharma,Yousra Sadqi,Khaled Saleh,Jihad Mallat,Farah E. Shamout* Main category: cs.CL TL;DR: 本文提出了首个用于临床文本错误检测、定位与纠正的多语言基准MedErrBench,涵盖英语、阿拉伯语和中文,由临床专家标注与审核;实验揭示了现有大语言模型在非英语临床场景中的显著性能差距,强调需构建临床可信、语言感知的AI系统。
Details Motivation: 现有或生成的临床文本存在 inaccuracies 可能导致严重后果,而当前缺乏覆盖多语言、多临床场景的专用评估基准。 Method: 构建基于十类常见错误扩展分类法的多语言(英/阿/中)临床错误基准MedErrBench,由临床专家标注与审核;评估多种通用、语言特化及医学领域大语言模型在错误检测、定位与纠正三项任务上的表现。 Result: 实验发现模型在非英语语种(尤其阿拉伯语和中文)上性能明显落后,暴露了现有模型在临床准确性与语言适应性方面的不足。 Conclusion: MedErrBench填补了多语言临床NLP评估的空白,其开源发布将推动构建更安全、公平的全球医疗AI系统。 Abstract: Inaccuracies in existing or generated clinical text may lead to serious adverse consequences, especially if it is a misdiagnosis or incorrect treatment suggestion. With Large Language Models (LLMs) increasingly being used across diverse healthcare applications, comprehensive evaluation through dedicated benchmarks is crucial. However, such datasets remain scarce, especially across diverse languages and contexts. In this paper, we introduce MedErrBench, the first multilingual benchmark for error detection, localization, and correction, developed under the guidance of experienced clinicians. Based on an expanded taxonomy of ten common error types, MedErrBench covers English, Arabic and Chinese, with natural clinical cases annotated and reviewed by domain experts. We assessed the performance of a range of general-purpose, language-specific, and medical-domain language models across all three tasks. Our results reveal notable performance gaps, particularly in non-English settings, highlighting the need for clinically grounded, language-aware systems. By making MedErrBench and our evaluation protocols publicly-available, we aim to advance multilingual clinical NLP to promote safer and more equitable AI-based healthcare globally. The dataset is available in the supplementary material. An anonymized version of the dataset is available at: https://github.com/congboma/MedErrBench.
### [39] [Consensus-Aligned Neuron Efficient Fine-Tuning Large Language Models for Multi-Domain Machine Translation](https://arxiv.org/abs/2602.05694) *Shuting Jiang,Ran Song,Yuxin Huang,Yan Xiang,Yantuan Xian,Shengxiang Gao,Zhengtao Yu* Main category: cs.CL TL;DR: 本文提出了一种面向多领域机器翻译(MDMT)的神经元高效微调框架,通过最大化神经元行为与领域特征间的互信息,识别并更新共识对齐神经元,从而缓解参数干扰和领域过拟合,在多个模型和语言对上实现SOTA性能。
Details Motivation: 现有基于大语言模型(LLM)的多领域机器翻译方法(如上下文学习、参数高效微调)面临领域偏移、参数干扰和泛化能力有限等问题。 Method: 提出一种神经元高效的微调框架:首先通过最大化神经元行为与领域特征之间的互信息,识别共识对齐神经元;然后仅针对这些神经元进行微调,以兼顾通用翻译模式与领域特异性。 Result: 在三个LLM和十个德英/中英翻译领域上的实验表明,该方法在已见和未见领域上均持续超越强PEFT基线,达到当前最优性能。 Conclusion: 共识对齐神经元的选择与微调是一种有效缓解LLM在MDMT中参数干扰和领域过拟合的策略,显著提升了跨领域泛化能力。 Abstract: Multi-domain machine translation (MDMT) aims to build a unified model capable of translating content across diverse domains. Despite the impressive machine translation capabilities demonstrated by large language models (LLMs), domain adaptation still remains a challenge for LLMs. Existing MDMT methods such as in-context learning and parameter-efficient fine-tuning often suffer from domain shift, parameter interference and limited generalization. In this work, we propose a neuron-efficient fine-tuning framework for MDMT that identifies and updates consensus-aligned neurons within LLMs. These neurons are selected by maximizing the mutual information between neuron behavior and domain features, enabling LLMs to capture both generalizable translation patterns and domain-specific nuances. Our method then fine-tunes LLMs guided by these neurons, effectively mitigating parameter interference and domain-specific overfitting. Comprehensive experiments on three LLMs across ten German-English and Chinese-English translation domains evidence that our method consistently outperforms strong PEFT baselines on both seen and unseen domains, achieving state-of-the-art performance.
### [40] [OmniMoE: An Efficient MoE by Orchestrating Atomic Experts at Scale](https://arxiv.org/abs/2602.05711) *Jingze Shi,Zhangyang Peng,Yizhang Zhu,Yifan Wu,Guang Liu,Yuyu Luo* Main category: cs.CL TL;DR: OmniMoE提出向极致细粒度专家发展,通过原子级专家、笛卡尔积路由器和专家中心调度,在保持高准确率的同时大幅降低推理延迟。
Details Motivation: 现有MoE架构在专家专业化粒度与硬件执行效率之间存在固有折衷,需突破该限制以提升参数效率。 Method: 提出系统-算法协同设计框架OmniMoE:引入向量级Atomic Experts;设计笛卡尔积路由器将路由复杂度从O(N)降至O(√N);采用专家中心调度将稀疏查表转为密集矩阵运算。 Result: 在七个基准上,OmniMoE(1.7B激活参数)零样本准确率达50.9%,超越DeepSeekMoE和PEER;推理延迟从73ms降至6.7ms(加速10.9倍)。 Conclusion: 极致细粒度MoE可通过协同设计实现高效、快速且准确的推理,打破了细粒度与效率不可兼得的传统认知。 Abstract: Mixture-of-Experts (MoE) architectures are evolving towards finer granularity to improve parameter efficiency. However, existing MoE designs face an inherent trade-off between the granularity of expert specialization and hardware execution efficiency. We propose OmniMoE, a system-algorithm co-designed framework that pushes expert granularity to its logical extreme. OmniMoE introduces vector-level Atomic Experts, enabling scalable routing and execution within a single MoE layer, while retaining a shared dense MLP branch for general-purpose processing. Although this atomic design maximizes capacity, it poses severe challenges for routing complexity and memory access. To address these, OmniMoE adopts a system-algorithm co-design: (i) a Cartesian Product Router that decomposes the massive index space to reduce routing complexity from O(N) to O(sqrt(N)); and (ii) Expert-Centric Scheduling that inverts the execution order to turn scattered, memory-bound lookups into efficient dense matrix operations. Validated on seven benchmarks, OmniMoE (with 1.7B active parameters) achieves 50.9% zero-shot accuracy across seven benchmarks, outperforming coarse-grained (e.g., DeepSeekMoE) and fine-grained (e.g., PEER) baselines. Crucially, OmniMoE reduces inference latency from 73ms to 6.7ms (a 10.9-fold speedup) compared to PEER, demonstrating that massive-scale fine-grained MoE can be fast and accurate. Our code is open-sourced at https://github.com/flash-algo/omni-moe.
### [41] [CompactRAG: Reducing LLM Calls and Token Overhead in Multi-Hop Question Answering](https://arxiv.org/abs/2602.05728) *Hao Yang,Zhiyu Yang,Xupeng Zhang,Wei Wei,Yunjie Zhang,Lin Yang* Main category: cs.CL TL;DR: CompactRAG是一种高效多跳检索增强生成框架,通过离线构建原子化QA知识库和在线轻量推理,显著降低LLM调用次数与token消耗,同时保持高准确率。
Details Motivation: 现有多跳RAG系统因反复调用LLM、高token消耗及跨跳实体不一致而效率低下。 Method: 提出CompactRAG:离线阶段用LLM将语料转化为原子级问答对知识库;在线阶段对查询进行实体一致的分解与重写,再经稠密检索与RoBERTa答案抽取完成推理,全程仅两次LLM调用。 Result: 在HotpotQA、2WikiMultiHopQA和MuSiQue上达到与迭代RAG相当的准确率,但token消耗大幅下降。 Conclusion: CompactRAG通过解耦离线重构与在线推理,在保证性能的同时显著提升多跳RAG的成本效益与实用性。 Abstract: Retrieval-augmented generation (RAG) has become a key paradigm for knowledge-intensive question answering. However, existing multi-hop RAG systems remain inefficient, as they alternate between retrieval and reasoning at each step, resulting in repeated LLM calls, high token consumption, and unstable entity grounding across hops. We propose CompactRAG, a simple yet effective framework that decouples offline corpus restructuring from online reasoning. In the offline stage, an LLM reads the corpus once and converts it into an atomic QA knowledge base, which represents knowledge as minimal, fine-grained question-answer pairs. In the online stage, complex queries are decomposed and carefully rewritten to preserve entity consistency, and are resolved through dense retrieval followed by RoBERTa-based answer extraction. Notably, during inference, the LLM is invoked only twice in total - once for sub-question decomposition and once for final answer synthesis - regardless of the number of reasoning hops. Experiments on HotpotQA, 2WikiMultiHopQA, and MuSiQue demonstrate that CompactRAG achieves competitive accuracy while substantially reducing token consumption compared to iterative RAG baselines, highlighting a cost-efficient and practical approach to multi-hop reasoning over large knowledge corpora. The implementation is available at GitHub.
### [42] [LongR: Unleashing Long-Context Reasoning via Reinforcement Learning with Dense Utility Rewards](https://arxiv.org/abs/2602.05758) *Bowen Ping,Zijun Chen,Yiyao Yu,Tingfeng Hui,Junchi Yan,Baobao Chang* Main category: cs.CL TL;DR: 本文提出LongR框架,通过动态'思考-阅读'机制和基于相对信息增益的上下文密度奖励,提升大语言模型在长文本场景下的推理能力。
Details Motivation: 现有基于稀疏、仅结果奖励的强化学习方法在长上下文推理中效果有限,因其粗粒度信号难以有效指导复杂推理过程。 Method: 提出LongR统一框架,结合动态'思考-阅读'机制(交替进行推理与文档查阅)和基于相对信息增益的上下文密度奖励,以量化相关文档效用。 Result: LongR在LongBench v2上提升9%,并在RULER和InfiniteBench上持续改进;对DAPO、GSPO等多种RL算法均有效;并通过分析验证其对推理链长度和干扰项的鲁棒性。 Conclusion: LongR通过细粒度、上下文感知的奖励机制与交互式推理策略,显著提升了大语言模型在长上下文任务中的推理性能与泛化能力。 Abstract: Reinforcement Learning has emerged as a key driver for LLM reasoning. This capability is equally pivotal in long-context scenarios--such as long-dialogue understanding and structured data analysis, where the challenge extends beyond consuming tokens to performing rigorous deduction. While existing efforts focus on data synthesis or architectural changes, recent work points out that relying solely on sparse, outcome-only rewards yields limited gains, as such coarse signals are often insufficient to effectively guide the complex long-context reasoning. To address this, we propose LongR, a unified framework that enhances long-context performance by integrating a dynamic "Think-and-Read" mechanism, which interleaves reasoning with document consultation, with a contextual density reward based on relative information gain to quantify the utility of the relevant documents. Empirically, LongR achieves a 9% gain on LongBench v2 and consistent improvements on RULER and InfiniteBench, demonstrating robust efficiency in navigating extensive contexts. Furthermore, LongR consistently enhances performance across diverse RL algorithms (e.g., DAPO, GSPO). Finally, we conduct in-depth analyses to investigate the impact of reasoning chain length on efficiency and the model's robustness against distractors.
### [43] [Different Time, Different Language: Revisiting the Bias Against Non-Native Speakers in GPT Detectors](https://arxiv.org/abs/2602.05769) *Adnan Al Ali,Jindřich Helcl,Jindřich Libovický* Main category: cs.CL TL;DR: 本文重新审视了LLM生成文本检测器对捷克语非母语者文本是否存在系统性偏见的问题,发现非母语者文本的困惑度并不低于母语者,且当前检测器并未系统性误判非母语者文本,也不再主要依赖困惑度特征。
Details Motivation: 先前研究指出,基于困惑度(perplexity)的LLM文本检测器容易将非母语者撰写的文本误判为AI生成,本文旨在在捷克语语境下验证该结论是否依然成立。 Method: 作者对比分析了捷克语母语者与非母语者文本的困惑度,并评估了三类主流检测器在该语言环境下的表现,同时考察困惑度在当前检测器中的实际作用。 Result: 实验表明:(1)捷克语非母语者文本的困惑度不低于母语者;(2)三类检测器均未表现出对非母语者的系统性偏差;(3)当代检测器的有效性不再依赖困惑度。 Conclusion: 先前关于检测器对非母语者存在偏见的结论在捷克语环境下不成立,且困惑度已不再是当前检测器的核心判别特征。 Abstract: LLM-based assistants have been widely popularised after the release of ChatGPT. Concerns have been raised about their misuse in academia, given the difficulty of distinguishing between human-written and generated text. To combat this, automated techniques have been developed and shown to be effective, to some extent. However, prior work suggests that these methods often falsely flag essays from non-native speakers as generated, due to their low perplexity extracted from an LLM, which is supposedly a key feature of the detectors. We revisit these statements two years later, specifically in the Czech language setting. We show that the perplexity of texts from non-native speakers of Czech is not lower than that of native speakers. We further examine detectors from three separate families and find no systematic bias against non-native speakers. Finally, we demonstrate that contemporary detectors operate effectively without relying on perplexity.
### [44] [Reinforcement World Model Learning for LLM-based Agents](https://arxiv.org/abs/2602.05842) *Xiao Yu,Baolin Peng,Ruize Xu,Yelong Shen,Pengcheng He,Suman Nath,Nikhil Singh,Jiangfeng Gao,Zhou Yu* Main category: cs.CL TL;DR: 本文提出了一种名为Reinforcement World Model Learning (RWML)的自监督方法,用于增强大语言模型(LLM)在代理任务中的世界建模能力,通过sim-to-real gap reward对齐模拟状态与真实环境状态,在ALFWorld和τ² Bench上显著提升性能。
Details Motivation: LLM在代理任务中难以预测动作后果和适应环境动态,缺乏世界建模能力,亟需提升其对环境的内在模拟一致性。 Method: 提出RWML方法,利用sim-to-real gap reward在预训练嵌入空间中对齐模型生成的模拟下一状态与环境中观测到的真实下一状态;避免传统token级下一状态预测导致的语义失真和模型崩溃。 Result: 在ALFWorld和τ² Bench上显著优于基线模型;结合任务成功奖励后,分别比直接任务奖励强化学习高6.9和5.7分,并达到与专家数据训练相当的性能。 Conclusion: RWML提供了一种鲁棒、自监督的世界模型学习范式,有效提升LLM代理的环境建模能力和泛化性,且不易受reward hacking影响。 Abstract: Large language models (LLMs) have achieved strong performance in language-centric tasks. However, in agentic settings, LLMs often struggle to anticipate action consequences and adapt to environment dynamics, highlighting the need for world-modeling capabilities in LLM-based agents. We propose Reinforcement World Model Learning (RWML), a self-supervised method that learns action-conditioned world models for LLM-based agents on textual states using sim-to-real gap rewards. Our method aligns simulated next states produced by the model with realized next states observed from the environment, encouraging consistency between internal world simulations and actual environment dynamics in a pre-trained embedding space. Unlike next-state token prediction, which prioritizes token-level fidelity (i.e., reproducing exact wording) over semantic equivalence and can lead to model collapse, our method provides a more robust training signal and is empirically less susceptible to reward hacking than LLM-as-a-judge. We evaluate our method on ALFWorld and $τ^2$ Bench and observe significant gains over the base model, despite being entirely self-supervised. When combined with task-success rewards, our method outperforms direct task-success reward RL by 6.9 and 5.7 points on ALFWorld and $τ^2$ Bench respectively, while matching the performance of expert-data training.
### [45] [OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions](https://arxiv.org/abs/2602.05843) *Fangzhi Xu,Hang Yan,Qiushi Sun,Jinyang Wu,Zixian Huang,Muye Huang,Jingyang Gong,Zichen Ding,Kanzhi Cheng,Yian Wang,Xinyu Che,Zeyi Sun,Jian Zhang,Zhangyue Yin,Haoran Luo,Xuanjing Huang,Ben Kao,Jun Liu,Qika Lin* Main category: cs.CL TL;DR: 本文提出OdysseyArena,一个专注于长时程、主动式和归纳式交互的自主智能体评估框架,旨在弥补现有评估忽视智能体从经验中自主发现潜在转移规律的不足。
Details Motivation: 现有评估主要采用演绎范式,忽略智能体需从经验中自主归纳潜在状态转移规律这一关键能力,而该能力是实现前瞻性与策略一致性的基础。 Method: 提出OdysseyArena框架,形式化并实例化四类基本交互原语;构建轻量版OdysseyArena-Lite(含120个任务)用于标准化评测,以及挑战版OdysseyArena-Challenge(支持超长交互步数如>200步)以检验稳定性。 Result: 在15+个主流大语言模型上的实验表明,即使是前沿模型,在归纳式场景中仍存在显著缺陷,揭示了复杂环境中自主发现能力的关键瓶颈。 Conclusion: OdysseyArena为评估智能体的归纳学习与长时程规划能力提供了新基准,凸显当前LLM驱动智能体在自主发现动态规律方面的根本局限。 Abstract: The rapid advancement of Large Language Models (LLMs) has catalyzed the development of autonomous agents capable of navigating complex environments. However, existing evaluations primarily adopt a deductive paradigm, where agents execute tasks based on explicitly provided rules and static goals, often within limited planning horizons. Crucially, this neglects the inductive necessity for agents to discover latent transition laws from experience autonomously, which is the cornerstone for enabling agentic foresight and sustaining strategic coherence. To bridge this gap, we introduce OdysseyArena, which re-centers agent evaluation on long-horizon, active, and inductive interactions. We formalize and instantiate four primitives, translating abstract transition dynamics into concrete interactive environments. Building upon this, we establish OdysseyArena-Lite for standardized benchmarking, providing a set of 120 tasks to measure an agent's inductive efficiency and long-horizon discovery. Pushing further, we introduce OdysseyArena-Challenge to stress-test agent stability across extreme interaction horizons (e.g., > 200 steps). Extensive experiments on 15+ leading LLMs reveal that even frontier models exhibit a deficiency in inductive scenarios, identifying a critical bottleneck in the pursuit of autonomous discovery in complex environments. Our code and data are available at https://github.com/xufangzhi/Odyssey-Arena
### [46] [RRAttention: Dynamic Block Sparse Attention via Per-Head Round-Robin Shifts for Long-Context Inference](https://arxiv.org/abs/2602.05853) *Siran Liu,Guoxia Wang,Sa Wang,Jinle Zeng,HaoYang Xie,Siyu Lou,JiaBin Yang,DianHai Yu,Haifeng Wang,Chao Yang* Main category: cs.CL TL;DR: 本文提出RRAttention,一种新型动态稀疏注意力机制,通过轮转采样策略在保持查询独立性的同时实现高效全局模式发现,显著降低计算复杂度并提升长上下文处理速度。
Details Motivation: 现有动态稀疏注意力方法存在预处理需求、缺乏全局评估、违反查询独立性或计算开销高等根本性权衡问题,难以兼顾效率与性能。 Method: 提出头轮转(Round-Robin)采样策略:在每个步幅内轮换各注意力头的查询采样位置,并结合步幅级聚合与自适应Top-τ选择,实现输入自适应稀疏化。 Result: 在HELMET和Video-MME基准上恢复99%以上全注意力性能,仅计算一半注意力块,在128K上下文长度下提速2.4倍,优于现有动态稀疏注意力方法。 Conclusion: RRAttention首次同时满足无预处理、全局可评估、查询独立、低开销等理想特性,为长上下文大模型提供高效且鲁棒的注意力替代方案。 Abstract: The quadratic complexity of attention mechanisms poses a critical bottleneck for large language models processing long contexts. While dynamic sparse attention methods offer input-adaptive efficiency, they face fundamental trade-offs: requiring preprocessing, lacking global evaluation, violating query independence, or incurring high computational overhead. We present RRAttention, a novel dynamic sparse attention method that simultaneously achieves all desirable properties through a head \underline{r}ound-\underline{r}obin (RR) sampling strategy. By rotating query sampling positions across attention heads within each stride, RRAttention maintains query independence while enabling efficient global pattern discovery with stride-level aggregation. Our method reduces complexity from $O(L^2)$ to $O(L^2/S^2)$ and employs adaptive Top-$τ$ selection for optimal sparsity. Extensive experiments on natural language understanding (HELMET) and multimodal video comprehension (Video-MME) demonstrate that RRAttention recovers over 99\% of full attention performance while computing only half of the attention blocks, achieving 2.4$\times$ speedup at 128K context length and outperforming existing dynamic sparse attention methods.
### [47] [xList-Hate: A Checklist-Based Framework for Interpretable and Generalizable Hate Speech Detection](https://arxiv.org/abs/2602.05874) *Adrián Girón,Pablo Miralles,Javier Huertas-Tato,Sergio D'Antonio,David Camacho* Main category: cs.CL TL;DR: 本文提出xList-Hate诊断框架,将仇恨言论检测分解为基于规范性标准的细粒度问题清单,由大语言模型逐项回答并经可解释决策树聚合,提升跨域鲁棒性与可解释性。
Details Motivation: 现有监督模型因过度拟合特定数据集定义,在领域迁移和标注噪声下鲁棒性差,且缺乏可解释性。 Method: 构建xList-Hate框架:1)设计基于共识规范的细粒度诊断问题清单;2)用LLM对每个问题独立二元作答,生成诊断表征;3)用轻量、完全可解释的决策树聚合诊断信号并输出预测。 Result: 在多个基准和模型族上验证,相比零样本LLM分类和有监督微调,该方法显著提升跨数据集鲁棒性与领域偏移下的相对性能,并对部分标注不一致和语境模糊更具鲁棒性;同时支持细粒度可解释性分析。 Conclusion: 将仇恨言论检测重构为诊断推理任务,而非端到端分类,是一种更鲁棒、可解释且可扩展的内容审核新范式。 Abstract: Hate speech detection is commonly framed as a direct binary classification problem despite being a composite concept defined through multiple interacting factors that vary across legal frameworks, platform policies, and annotation guidelines. As a result, supervised models often overfit dataset-specific definitions and exhibit limited robustness under domain shift and annotation noise. We introduce xList-Hate, a diagnostic framework that decomposes hate speech detection into a checklist of explicit, concept-level questions grounded in widely shared normative criteria. Each question is independently answered by a large language model (LLM), producing a binary diagnostic representation that captures hateful content features without directly predicting the final label. These diagnostic signals are then aggregated by a lightweight, fully interpretable decision tree, yielding transparent and auditable predictions. We evaluate it across multiple hate speech benchmarks and model families, comparing it against zero-shot LLM classification and in-domain supervised fine-tuning. While supervised methods typically maximize in-domain performance, we consistently improves cross-dataset robustness and relative performance under domain shift. In addition, qualitative analysis of disagreement cases provides evidence that the framework can be less sensitive to certain forms of annotation inconsistency and contextual ambiguity. Crucially, the approach enables fine-grained interpretability through explicit decision paths and factor-level analysis. Our results suggest that reframing hate speech detection as a diagnostic reasoning task, rather than a monolithic classification problem, provides a robust, explainable, and extensible alternative for content moderation.
### [48] [EuroLLM-22B: Technical Report](https://arxiv.org/abs/2602.05879) *Miguel Moura Ramos,Duarte M. Alves,Hippolyte Gisserot-Boukhlef,João Alves,Pedro Henrique Martins,Patrick Fernandes,José Pombal,Nuno M. Guerreiro,Ricardo Rei,Nicolas Boizard,Amin Farajian,Mateusz Klimaszewski,José G. C. de Souza,Barry Haddow,François Yvon,Pierre Colombo,Alexandra Birch,André F. T. Martins* Main category: cs.CL TL;DR: EuroLLM-22B is a new multilingual large language model trained from scratch to better serve European languages, achieving competitive performance and releasing models, data, and code for research.
Details Motivation: Addressing the underrepresentation and underservice of European languages in existing open large language models. Method: Training a 22B-parameter LLM from scratch with custom tokenizer design, architectural specifications, rigorous data filtering, and multilingual pretraining and instruction tuning. Result: Strong performance across multilingual benchmarks in reasoning, instruction following, and translation, competitive with similarly sized models. Conclusion: EuroLLM-22B successfully advances multilingual LLM support for European languages and promotes open research through comprehensive public releases. Abstract: This report presents EuroLLM-22B, a large language model trained from scratch to support the needs of European citizens by covering all 24 official European Union languages and 11 additional languages. EuroLLM addresses the issue of European languages being underrepresented and underserved in existing open large language models. We provide a comprehensive overview of EuroLLM-22B's development, including tokenizer design, architectural specifications, data filtering, and training procedures. Across a broad set of multilingual benchmarks, EuroLLM-22B demonstrates strong performance in reasoning, instruction following, and translation, achieving results competitive with models of comparable size. To support future research, we release our base and instruction-tuned models, our multilingual web pretraining data and updated EuroBlocks instruction datasets, as well as our pre-training and evaluation codebases.
### [49] [Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models](https://arxiv.org/abs/2602.05897) *Shuo Nie,Hexuan Deng,Chao Wang,Ruiyu Fang,Xuebo Liu,Shuangyong Song,Yu Li,Min Zhang,Xuelong Li* Main category: cs.CL TL;DR: 本文提出FaithRL方法,通过步骤级的忠实性奖励和隐式截断重采样策略,有效减少小推理模型在思维链推理中的忠实性幻觉问题。
Details Motivation: 小推理模型(SRMs)在资源受限场景中支持思维链推理至关重要,但易在中间推理步骤中产生忠实性幻觉;现有基于结果奖励或粗粒度CoT评估的在线强化学习方法可能在最终答案正确时错误强化不忠实推理。 Method: 提出Faithfulness-Aware Step-Level Reinforcement Learning(FaithRL),引入过程奖励模型提供的步骤级忠实性奖励,并结合隐式截断重采样策略生成来自忠实前缀的对比信号。 Result: 在多个小推理模型和开放书问答基准上的实验表明,FaithRL能持续降低思维链及最终答案中的幻觉,提升推理的忠实性与可靠性。 Conclusion: FaithRL通过细粒度步骤监督与对比学习机制,显著改善小模型推理的忠实性,为资源受限场景下的可信推理提供了新范式。 Abstract: As large language models become smaller and more efficient, small reasoning models (SRMs) are crucial for enabling chain-of-thought (CoT) reasoning in resource-constrained settings. However, they are prone to faithfulness hallucinations, especially in intermediate reasoning steps. Existing mitigation methods based on online reinforcement learning rely on outcome-based rewards or coarse-grained CoT evaluation, which can inadvertently reinforce unfaithful reasoning when the final answer is correct. To address these limitations, we propose Faithfulness-Aware Step-Level Reinforcement Learning (FaithRL), introducing step-level supervision via explicit faithfulness rewards from a process reward model, together with an implicit truncated resampling strategy that generates contrastive signals from faithful prefixes. Experiments across multiple SRMs and Open-Book QA benchmarks demonstrate that FaithRL consistently reduces hallucinations in both the CoT and final answers, leading to more faithful and reliable reasoning. Code is available at https://github.com/Easy195/FaithRL.
### [50] [Codified Finite-state Machines for Role-playing](https://arxiv.org/abs/2602.05905) *Letian Peng,Yupeng Hou,Kun Zhou,Jingbo Shang* Main category: cs.CL TL;DR: 本文提出Codified Finite-State Machines (CFSMs)及其概率扩展CPFSMs,利用大语言模型自动从角色描述中提取状态与转移,提升角色扮演中潜态建模的一致性与可解释性。
Details Motivation: 现有基于提示的方法难以建模驱动角色交互的潜在状态;传统手工构建的有限状态机(FSM)在开放语义的角色扮演场景中缺乏适应性。 Method: 提出CFSM框架:用LLM将文本角色档案自动编码为可解释的有限状态机;进一步扩展为CPFSM,以概率分布建模状态转移。 Result: 在合成评测与真实角色扮演任务中,CFSM与CPFSM均优于通用基线,在结构化任务与开放、随机的状态探索中均验证有效。 Conclusion: CFSM/CPFSM为角色扮演中的潜态建模提供了自动化、可解释且适应性强的新范式, bridging 规则可解释性与语义开放性。 Abstract: Modeling latent character states is crucial for consistent and engaging role-playing (RP) with large language models (LLMs). Yet, existing prompting-based approaches mainly capture surface actions, often failing to track the latent states that drive interaction. We revisit finite-state machines (FSMs), long used in game design to model state transitions. While effective in small, well-specified state spaces, traditional hand-crafted, rule-based FSMs struggle to adapt to the open-ended semantic space of RP. To address this, we introduce Codified Finite-State Machines (CFSMs), a framework that automatically codifies textual character profiles into FSMs using LLM-based coding. CFSMs extract key states and transitions directly from the profile, producing interpretable structures that enforce character consistency. To further capture uncertainty and variability, we extend CFSMs into Codified Probabilistic Finite-State Machines (CPFSMs), where transitions are modeled as probability distributions over states. Through both synthetic evaluations and real-world RP scenarios in established artifacts, we demonstrate that CFSM and CPFSM outperform generally applied baselines, verifying effectiveness not only in structured tasks but also in open-ended stochastic state exploration.
### [51] [KV-CoRE: Benchmarking Data-Dependent Low-Rank Compressibility of KV-Caches in LLMs](https://arxiv.org/abs/2602.05929) *Jian Chen,Zhuoran Wang,Jiayu Qin,Ming Li,Meng Wang,Changyou Chen,Yin Chen,Qizhen Weng,Yirui Liu* Main category: cs.CL TL;DR: 本文提出KV-CoRE方法,通过SVD量化KV缓存的数据依赖低秩可压缩性,并基于归一化有效秩建立首个大规模LLM KV缓存可压缩性基准,揭示其与模型结构、训练数据及语言覆盖的系统性关联。
Details Motivation: 现有KV缓存压缩方法忽视了KV缓存的数据依赖性和层间差异,导致压缩策略缺乏针对性和效率。 Method: 提出基于SVD的KV-CoRE(KV-cache Compressibility by Rank Evaluation)方法,计算Frobenius范数下的最优低秩近似;采用无需梯度、可增量计算的归一化有效秩(Normalized Effective Rank)作为可压缩性度量。 Result: 在涵盖5个英文领域和16种语言的多个模型与数据集上分析发现,KV缓存可压缩性与模型架构、训练数据和语言覆盖存在系统性关联;归一化有效秩与压缩下的性能下降强相关。 Conclusion: 建立了首个大规模、分层、数据驱动的KV缓存可压缩性评估框架与基准,为动态、数据感知的压缩策略和以数据为中心的大模型开发提供理论基础与实证支持。 Abstract: Large language models rely on kv-caches to avoid redundant computation during autoregressive decoding, but as context length grows, reading and writing the cache can quickly saturate GPU memory bandwidth. Recent work has explored KV-cache compression, yet most approaches neglect the data-dependent nature of kv-caches and their variation across layers. We introduce KV-CoRE KV-cache Compressibility by Rank Evaluation), an SVD-based method for quantifying the data-dependent low-rank compressibility of kv-caches. KV-CoRE computes the optimal low-rank approximation under the Frobenius norm and, being gradient-free and incremental, enables efficient dataset-level, layer-wise evaluation. Using this method, we analyze multiple models and datasets spanning five English domains and sixteen languages, uncovering systematic patterns that link compressibility to model architecture, training data, and language coverage. As part of this analysis, we employ the Normalized Effective Rank as a metric of compressibility and show that it correlates strongly with performance degradation under compression. Our study establishes a principled evaluation framework and the first large-scale benchmark of kv-cache compressibility in LLMs, offering insights for dynamic, data-aware compression and data-centric model development.
### [52] [Polyglots or Multitudes? Multilingual LLM Answers to Value-laden Multiple-Choice Questions](https://arxiv.org/abs/2602.05932) *Léo Labat,Etienne Ollion,François Yvon* Main category: cs.CL TL;DR: 本文研究多语言大语言模型(LLM)在价值导向的多项选择题(MCQ)中是否表现出语言依赖性响应,而非跨语言一致性;为此构建了基于人工翻译的8语种欧洲价值观调查数据集MEVS,并对30多个LLM进行了系统测试,发现指令微调的大模型整体一致性更高,但语言特异性行为仅出现在部分问题上,提示偏好微调具有选择性影响。
Details Motivation: 探究多语言LLM在价值导向MCQ中的响应是否随提问语言变化,即是否表现为‘理论上的多语者’还是‘多个单语模型的集合’,填补语言对价值观表达影响的研究空白。 Method: 构建全新人工翻译、跨语言对齐的8语种MEVS数据集;在30余个不同规模、厂商及对齐微调状态的多语言LLM上,采用控制变量法测试多种提示形式(答案顺序、符号类型、尾字符等)下的响应一致性。 Result: 较大且经指令微调的模型总体响应一致性更高,但不同题目间鲁棒性差异显著:部分MCQ引发模型内与跨模型完全一致,另一些则导致答案高度分裂;所有一致且经指令微调的模型均在特定题目上表现出语言特异性行为。 Conclusion: 多语言LLM在价值判断类MCQ中并非完全跨语言一致,其语言依赖性具有选择性,可能与偏好微调方式有关,需进一步研究微调如何塑造语言特异的价值表达。 Abstract: Multiple-Choice Questions (MCQs) are often used to assess knowledge, reasoning abilities, and even values encoded in large language models (LLMs). While the effect of multilingualism has been studied on LLM factual recall, this paper seeks to investigate the less explored question of language-induced variation in value-laden MCQ responses. Are multilingual LLMs consistent in their responses across languages, i.e. behave like theoretical polyglots, or do they answer value-laden MCQs depending on the language of the question, like a multitude of monolingual models expressing different values through a single model? We release a new corpus, the Multilingual European Value Survey (MEVS), which, unlike prior work relying on machine translation or ad hoc prompts, solely comprises human-translated survey questions aligned in 8 European languages. We administer a subset of those questions to over thirty multilingual LLMs of various sizes, manufacturers and alignment-fine-tuning status under comprehensive, controlled prompt variations including answer order, symbol type, and tail character. Our results show that while larger, instruction-tuned models display higher overall consistency, the robustness of their responses varies greatly across questions, with certain MCQs eliciting total agreement within and across models while others leave LLM answers split. Language-specific behavior seems to arise in all consistent, instruction-fine-tuned models, but only on certain questions, warranting a further study of the selective effect of preference fine-tuning.
### [53] [Self-Improving Multilingual Long Reasoning via Translation-Reasoning Integrated Training](https://arxiv.org/abs/2602.05940) *Junxiao Liu,Zhijun Wang,Yixiao Li,Zhejian Lai,Liqian Huang,Xin Huang,Xue Han,Junlan Feng,Shujian Huang* Main category: cs.CL TL;DR: 本文提出TRIT框架,通过将翻译训练与多语言推理结合,提升长推理模型在多语言场景下的问题理解与响应生成能力,无需额外数据或外部反馈,在MMATH等基准上显著提升准确率与语言一致性。
Details Motivation: 长推理模型在多语言环境下表现不佳:常在非英语问题上用英语推理;若强制使用提问语言推理,准确率又大幅下降,根源在于多语言问题理解和多语言推理能力均不足。 Method: 提出TRIT(Translation-Reasoning Integrated Training)自改进框架,将翻译训练无缝集成到多语言推理训练中,联合优化多语言问题理解与响应生成,不依赖外部反馈或额外多语言数据。 Result: 在MMATH数据集上平均超越多个基线7个百分点,同时提升答案正确性与语言一致性;跨语言问题对齐提升超10个百分点;数学问题及通用文本翻译质量提升最高达8.4 COMET分(FLORES-200)。 Conclusion: TRIT有效缓解了多语言长推理中的语言错配问题,证明翻译能力与推理能力可协同增强,为构建真正多语言强推理模型提供了新路径。 Abstract: Long reasoning models often struggle in multilingual settings: they tend to reason in English for non-English questions; when constrained to reasoning in the question language, accuracies drop substantially. The struggle is caused by the limited abilities for both multilingual question understanding and multilingual reasoning. To address both problems, we propose TRIT (Translation-Reasoning Integrated Training), a self-improving framework that integrates the training of translation into multilingual reasoning. Without external feedback or additional multilingual data, our method jointly enhances multilingual question understanding and response generation. On MMATH, our method outperforms multiple baselines by an average of 7 percentage points, improving both answer correctness and language consistency. Further analysis reveals that integrating translation training improves cross-lingual question alignment by over 10 percentage points and enhances translation quality for both mathematical questions and general-domain text, with gains up to 8.4 COMET points on FLORES-200.
### [54] [Characterizing Human Semantic Navigation in Concept Production as Trajectories in Embedding Space](https://arxiv.org/abs/2602.05971) *Felipe D. Toro-Hernández,Jesuino Vieira Filho,Rodrigo M. Cabral-Carvalho* Main category: cs.CL TL;DR: 本文提出了一种将概念生成建模为在语义嵌入空间中导航的新框架,通过累积嵌入构建个体化语义轨迹,并提取多种几何与动力学指标,以量化语义表征的动态性;该方法在多语言、多任务临床数据上验证有效,且对不同嵌入模型具有鲁棒性。
Details Motivation: 探究人类如何在结构化、动态的语义知识空间中导航以检索和操作意义,弥补传统语言预处理劳动密集、缺乏计算基础的不足。 Method: 基于多种Transformer文本嵌入模型,构建被试特异的累积语义嵌入轨迹,提取距离(到下一词、到质心)、熵、速度、加速度等几何与动力学指标,对比累积与非累积嵌入效果,并在多语言多任务数据集上评估。 Result: 该框架能有效区分临床组别与概念类型;累积嵌入在长轨迹中表现更优,短轨迹则非累积更佳;不同嵌入模型结果高度一致,表明其表征本质相似。 Conclusion: 将语义导航形式化为嵌入空间中的结构化轨迹,成功连接认知建模与学习表征,为临床研究、跨语言分析及人工认知评估提供了低干预、可量化的动态语义分析新范式。 Abstract: Semantic representations can be framed as a structured, dynamic knowledge space through which humans navigate to retrieve and manipulate meaning. To investigate how humans traverse this geometry, we introduce a framework that represents concept production as navigation through embedding space. Using different transformer text embedding models, we construct participant-specific semantic trajectories based on cumulative embeddings and extract geometric and dynamical metrics, including distance to next, distance to centroid, entropy, velocity, and acceleration. These measures capture both scalar and directional aspects of semantic navigation, providing a computationally grounded view of semantic representation search as movement in a geometric space. We evaluate the framework on four datasets across different languages, spanning different property generation tasks: Neurodegenerative, Swear verbal fluency, Property listing task in Italian, and in German. Across these contexts, our approach distinguishes between clinical groups and concept types, offering a mathematical framework that requires minimal human intervention compared to typical labor-intensive linguistic pre-processing methods. Comparison with a non-cumulative approach reveals that cumulative embeddings work best for longer trajectories, whereas shorter ones may provide too little context, favoring the non-cumulative alternative. Critically, different embedding models yielded similar results, highlighting similarities between different learned representations despite different training pipelines. By framing semantic navigation as a structured trajectory through embedding space, bridging cognitive modeling with learned representation, thereby establishing a pipeline for quantifying semantic representation dynamics with applications in clinical research, cross-linguistic analysis, and the assessment of artificial cognition.
### [55] [DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs](https://arxiv.org/abs/2602.05992) *Lizhuo Luo,Shenggui Li,Yonggang Wen,Tianwei Zhang* Main category: cs.CL TL;DR: 本文提出了一种无需训练的动态滑动块(DSB)调度方法,用于改进扩散大语言模型(dLLMs)的并行文本生成,通过自适应块大小和配套的DSB Cache机制,在不牺牲质量的前提下提升推理效率。
Details Motivation: 固定预定义块调度策略忽视语义难度差异,导致在不确定位置过早承诺、在简单位置延迟生成,影响生成质量和推理效率。 Method: 提出Dynamic Sliding Block(DSB)——一种基于语义难度动态调整滑动块大小的训练免费块调度方法;并设计配套的DSB Cache——适配DSB的训练免费KV缓存机制。 Result: 在多个模型和基准上实验表明,DSB与DSB Cache联合使用能持续提升dLLMs的生成质量与推理效率。 Conclusion: 动态适配语义难度的块调度比固定块更可靠高效;DSB及其缓存机制为dLLMs提供了实用、免训练的性能优化路径。 Abstract: Diffusion large language models (dLLMs) have emerged as a promising alternative for text generation, distinguished by their native support for parallel decoding. In practice, block inference is crucial for avoiding order misalignment in global bidirectional decoding and improving output quality. However, the widely-used fixed, predefined block (naive) schedule is agnostic to semantic difficulty, making it a suboptimal strategy for both quality and efficiency: it can force premature commitments to uncertain positions while delaying easy positions near block boundaries. In this work, we analyze the limitations of naive block scheduling and disclose the importance of dynamically adapting the schedule to semantic difficulty for reliable and efficient inference. Motivated by this, we propose Dynamic Sliding Block (DSB), a training-free block scheduling method that uses a sliding block with a dynamic size to overcome the rigidity of the naive block. To further improve efficiency, we introduce DSB Cache, a training-free KV-cache mechanism tailored to DSB. Extensive experiments across multiple models and benchmarks demonstrate that DSB, together with DSB Cache, consistently improves both generation quality and inference efficiency for dLLMs. Code is released at https://github.com/lizhuo-luo/DSB.
### [56] [A Systematic Evaluation of Large Language Models for PTSD Severity Estimation: The Role of Contextual Knowledge and Modeling Strategies](https://arxiv.org/abs/2602.06015) *Panagiotis Kaliosis,Adithya V Ganesan,Oscar N. E. Kjell,Whitney Ringwald,Scott Feltman,Melissa A. Carr,Dimitris Samaras,Camilo Ruggero,Benjamin J. Luft,Roman Kotov,Andrew H. Schwartz* Main category: cs.CL TL;DR: 本研究系统评估了11个大语言模型(LLMs)在零样本下评估PTSD严重程度的准确性,发现上下文知识(如构念定义、叙事背景)、推理努力程度、模型规模及集成策略显著影响性能;最佳效果来自监督模型与零样本LLMs的集成。
Details Motivation: 尽管大语言模型(LLMs)被越来越多地用于零样本精神健康评估,但影响其准确性的关键因素尚不清楚。 Method: 基于1437名个体的临床自然语言叙述和自评PTSD严重程度数据,系统评估11个SOTA LLMs;变量包括:(i)上下文知识(子量表定义、分布摘要、访谈问题),(ii)建模策略(零样本/少样本、推理努力量、模型规模、结构化子量表预测 vs 直接标量预测、输出重标定、九种集成方法)。 Result: (a)提供详细构念定义和叙事背景时LLM最准确;(b)增加推理努力提升估计精度;(c)开源模型(Llama、Deepseek)性能在70B参数后趋于饱和,闭源模型(o3-mini、gpt-5)随代际更新持续提升;(d)监督模型与零样本LLMs集成效果最佳。 Conclusion: 上下文知识选择与建模策略对LLM在精神健康评估中的准确部署至关重要。 Abstract: Large language models (LLMs) are increasingly being used in a zero-shot fashion to assess mental health conditions, yet we have limited knowledge on what factors affect their accuracy. In this study, we utilize a clinical dataset of natural language narratives and self-reported PTSD severity scores from 1,437 individuals to comprehensively evaluate the performance of 11 state-of-the-art LLMs. To understand the factors affecting accuracy, we systematically varied (i) contextual knowledge like subscale definitions, distribution summary, and interview questions, and (ii) modeling strategies including zero-shot vs few shot, amount of reasoning effort, model sizes, structured subscales vs direct scalar prediction, output rescaling and nine ensemble methods. Our findings indicate that (a) LLMs are most accurate when provided with detailed construct definitions and context of the narrative; (b) increased reasoning effort leads to better estimation accuracy; (c) performance of open-weight models (Llama, Deepseek), plateau beyond 70B parameters while closed-weight (o3-mini, gpt-5) models improve with newer generations; and (d) best performance is achieved when ensembling a supervised model with the zero-shot LLMs. Taken together, the results suggest choice of contextual knowledge and modeling strategies is important for deploying LLMs to accurately assess mental health.
### [57] [Multi-Token Prediction via Self-Distillation](https://arxiv.org/abs/2602.06019) *John Kirchenbauer,Abhimanyu Hans,Brian Bartoldson,Micah Goldblum,Ashwinee Panda,Tom Goldstein* Main category: cs.CL TL;DR: 本文提出了一种无需额外训练辅助模型或复杂推理管道的在线蒸馏方法,将预训练自回归语言模型转化为可一次性预测多个token的快速独立模型,保持原有实现结构,部署简单,在GSM8K上实现3倍加速且准确率下降小于5%。
Details Motivation: 现有语言模型推理加速技术(如推测解码)依赖训练辅助推测器和构建复杂推理流程,部署成本高、灵活性差。 Method: 采用简单的在线蒸馏目标,将预训练自回归语言模型直接优化为支持多token预测的模型,不改变原始模型结构与实现,无需额外验证器或特殊推理代码。 Result: 在GSM8K数据集上,模型平均解码速度提升超3倍,准确率下降控制在5%以内。 Conclusion: 该方法提供了一种轻量、即插即用的语言模型推理加速方案,兼顾效率与精度,显著降低部署复杂度。 Abstract: Existing techniques for accelerating language model inference, such as speculative decoding, require training auxiliary speculator models and building and deploying complex inference pipelines. We consider a new approach for converting a pretrained autoregressive language model from a slow single next token prediction model into a fast standalone multi-token prediction model using a simple online distillation objective. The final model retains the exact same implementation as the pretrained initial checkpoint and is deployable without the addition of any auxiliary verifier or other specialized inference code. On GSM8K, our method produces models that can decode more than $3\times$ faster on average at $<5\%$ drop in accuracy relative to single token decoding performance.
### [58] [Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory](https://arxiv.org/abs/2602.06025) *Haozhen Zhang,Haodong Yue,Tao Feng,Quanyu Long,Jianzhu Bao,Bowen Jin,Weizhi Zhang,Xiao Li,Jiaxuan You,Chengwei Qin,Wenya Wang* Main category: cs.CL TL;DR: 本文提出BudgetMem,一种用于LLM代理的运行时内存框架,支持显式的、查询感知的性能-成本权衡控制,通过轻量级神经路由器在不同预算层级的内存模块间动态路由。
Details Motivation: 现有LLM代理内存系统多依赖离线、查询无关的记忆构建,效率低且易丢失关键信息;而运行时记忆利用又常带来高开销且缺乏显式性能-成本调控机制。 Method: BudgetMem将内存处理建模为多个可配置三档预算(Low/Mid/High)的内存模块;设计轻量级强化学习训练的神经路由器进行预算路由;并系统比较实现复杂度、推理行为和模型容量三种预算分级策略。 Result: 在LoCoMo、LongMemEval和HotpotQA上,BudgetMem在高预算下超越强基线,在受限预算下提供更优的准确率-成本前沿;分析揭示了不同分级策略在不同预算区间下的适用性差异。 Conclusion: BudgetMem为LLM代理提供了灵活、可控的运行时内存管理范式,验证了显式预算控制的有效性,并为内存设计中的多维权衡提供了实证指导。 Abstract: Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rely on offline, query-agnostic memory construction that can be inefficient and may discard query-critical information. Although runtime memory utilization is a natural alternative, prior work often incurs substantial overhead and offers limited explicit control over the performance-cost trade-off. In this work, we present \textbf{BudgetMem}, a runtime agent memory framework for explicit, query-aware performance-cost control. BudgetMem structures memory processing as a set of memory modules, each offered in three budget tiers (i.e., \textsc{Low}/\textsc{Mid}/\textsc{High}). A lightweight router performs budget-tier routing across modules to balance task performance and memory construction cost, which is implemented as a compact neural policy trained with reinforcement learning. Using BudgetMem as a unified testbed, we study three complementary strategies for realizing budget tiers: implementation (method complexity), reasoning (inference behavior), and capacity (module model size). Across LoCoMo, LongMemEval, and HotpotQA, BudgetMem surpasses strong baselines when performance is prioritized (i.e., high-budget setting), and delivers better accuracy-cost frontiers under tighter budgets. Moreover, our analysis disentangles the strengths and weaknesses of different tiering strategies, clarifying when each axis delivers the most favorable trade-offs under varying budget regimes.
### [59] [DFlash: Block Diffusion for Flash Speculative Decoding](https://arxiv.org/abs/2602.06036) *Jian Chen,Yesheng Liang,Zhijian Liu* Main category: cs.CL TL;DR: 本文提出DFlash,一种基于轻量级块扩散模型的推测解码框架,通过单次前向传播并利用目标模型提取的上下文特征进行条件化,实现高质量、高接受率的并行草稿生成,在保持零损失前提下实现超6倍加速,显著优于EAGLE-3。
Details Motivation: 自回归大语言模型推理延迟高、GPU利用率低;现有推测解码仍依赖顺序草稿生成,而扩散LLM虽支持并行但性能不足。 Method: 提出DFlash框架,采用轻量级块扩散模型进行并行草稿生成,并将目标模型提取的上下文特征作为条件输入以提升草稿质量与接受率。 Result: 在多种模型和任务上实现超6倍零损失加速,最高比SOTA方法EAGLE-3快2.5倍。 Conclusion: DFlash有效结合扩散模型的并行性与推测解码的验证机制,在不牺牲性能的前提下大幅提升推理效率,为高效LLM推理提供了新范式。 Abstract: Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by using a fast draft model whose outputs are verified in parallel by the target LLM; however, existing methods still rely on autoregressive drafting, which remains sequential and limits practical speedups. Diffusion LLMs offer a promising alternative by enabling parallel generation, but current diffusion models typically underperform compared with autoregressive models. In this paper, we introduce DFlash, a speculative decoding framework that employs a lightweight block diffusion model for parallel drafting. By generating draft tokens in a single forward pass and conditioning the draft model on context features extracted from the target model, DFlash enables efficient drafting with high-quality outputs and higher acceptance rates. Experiments show that DFlash achieves over 6x lossless acceleration across a range of models and tasks, delivering up to 2.5x higher speedup than the state-of-the-art speculative decoding method EAGLE-3.
# cs.CV [[Back]](#toc) ### [60] [SIDeR: Semantic Identity Decoupling for Unrestricted Face Privacy](https://arxiv.org/abs/2602.04994) *Zhuosen Bao,Xia Du,Zheng Lin,Jizhe Zhou,Zihan Fang,Jiening Wu,Yuxin Zhang,Zhe Chen,Chi-man Pun,Wei Ni,Jun Luo* Main category: cs.CV TL;DR: SIDeR是一种面向无限制人脸隐私保护的语义解耦框架,通过扩散模型实现身份特征与外观语义的分离与可控重组,在保障机器可识别性的同时生成视觉匿名且自然的对抗人脸,并支持密码驱动的原始图像恢复。
Details Motivation: 随着人脸识别深度融入在线银行、身份验证等网络服务,如何在图像存储与传输中有效解耦身份信息与视觉表征,成为隐私保护的关键挑战。 Method: 提出SIDeR框架:将人脸图像分解为机器可识别的身份特征向量和视觉可感知的语义外观成分;利用扩散模型潜在空间中的语义引导重组生成视觉匿名但身份一致的对抗人脸;引入动量驱动的无限制扰动优化与语义-视觉平衡因子以合成多样、自然的对抗样本;支持密码验证下的原始图像无损恢复。 Result: 在CelebA-HQ和FFHQ数据集上,SIDeR在黑盒攻击下达到99%成功率,PSNR重建质量比基线方法提升41.28%。 Conclusion: SIDeR实现了身份可识别性与视觉匿名性的有效协同,在隐私保护、自然度与可恢复性三方面取得显著突破,适用于实际部署场景。 Abstract: With the deep integration of facial recognition into online banking, identity verification, and other networked services, achieving effective decoupling of identity information from visual representations during image storage and transmission has become a critical challenge for privacy protection. To address this issue, we propose SIDeR, a Semantic decoupling-driven framework for unrestricted face privacy protection. SIDeR decomposes a facial image into a machine-recognizable identity feature vector and a visually perceptible semantic appearance component. By leveraging semantic-guided recomposition in the latent space of a diffusion model, it generates visually anonymous adversarial faces while maintaining machine-level identity consistency. The framework incorporates momentum-driven unrestricted perturbation optimization and a semantic-visual balancing factor to synthesize multiple visually diverse, highly natural adversarial samples. Furthermore, for authorized access, the protected image can be restored to its original form when the correct password is provided. Extensive experiments on the CelebA-HQ and FFHQ datasets demonstrate that SIDeR achieves a 99% attack success rate in black-box scenarios and outperforms baseline methods by 41.28% in PSNR-based restoration quality.
### [61] [UniTrack: Differentiable Graph Representation Learning for Multi-Object Tracking](https://arxiv.org/abs/2602.05037) *Bishoy Galoaa,Xiangyu Bai,Utsav Nandi,Sai Siddhartha Vivek Dhir Rangoju,Somaieh Amraee,Sarah Ostadabbas* Main category: cs.CV TL;DR: UniTrack是一种即插即用的图论损失函数,通过统一可微学习直接优化多目标跟踪(MOT)任务的核心目标,在不修改模型结构的前提下显著提升跟踪性能,尤其在IDF1和ID切换指标上取得显著改进。
Details Motivation: 现有基于图的MOT方法需重构跟踪架构,缺乏通用性;而实际需求是能在不改动现有系统前提下,统一优化检测精度、身份保持与时空一致性等跟踪核心目标。 Method: 提出UniTrack,一种端到端可微的图理论损失函数,将检测、身份连续性和时空一致性联合建模为统一图表示学习问题,通过不同iable图学习实现运动连续性与身份关系的联合优化。 Result: 在Trackformer、MOTR、FairMOT、ByteTrack、GTR、MOTE等多种主流MOT模型及多个挑战性基准(如SportsMOT)上验证,ID切换最多降低53%,IDF1提升达12%,GTR在SportsMOT上MOTA提升9.7%。 Conclusion: UniTrack作为一种通用、即插即用的训练目标,无需修改网络结构即可显著提升各类MOT系统的性能,验证了统一图损失函数在多目标跟踪中的有效性与泛化能力。 Abstract: We present UniTrack, a plug-and-play graph-theoretic loss function designed to significantly enhance multi-object tracking (MOT) performance by directly optimizing tracking-specific objectives through unified differentiable learning. Unlike prior graph-based MOT methods that redesign tracking architectures, UniTrack provides a universal training objective that integrates detection accuracy, identity preservation, and spatiotemporal consistency into a single end-to-end trainable loss function, enabling seamless integration with existing MOT systems without architectural modifications. Through differentiable graph representation learning, UniTrack enables networks to learn holistic representations of motion continuity and identity relationships across frames. We validate UniTrack across diverse tracking models and multiple challenging benchmarks, demonstrating consistent improvements across all tested architectures and datasets including Trackformer, MOTR, FairMOT, ByteTrack, GTR, and MOTE. Extensive evaluations show up to 53\% reduction in identity switches and 12\% IDF1 improvements across challenging benchmarks, with GTR achieving peak performance gains of 9.7\% MOTA on SportsMOT.
### [62] [VISTA: Enhancing Visual Conditioning via Track-Following Preference Optimization in Vision-Language-Action Models](https://arxiv.org/abs/2602.05049) *Yiye Chen,Yanan Jian,Xiaoyi Dong,Shuxin Cao,Jing Wu,Patricio Vela,Benjamin E. Lundell,Dongdong Chen* Main category: cs.CV TL;DR: 本文提出了一种无需修改模型结构或额外数据收集的训练框架,通过偏好优化和潜在空间蒸馏来增强视觉-动作对齐,从而提升VLA模型的视觉条件依赖性和任务性能。
Details Motivation: 现有VLA模型存在视觉-动作错位问题,即动作预测对当前视觉状态依赖弱,导致输出不可靠;作者观察到成功执行轨迹比失败轨迹表现出更强的视觉依赖性。 Method: 首先在轨迹跟踪代理任务上通过偏好优化对齐动作预测与视觉输入,再通过潜在空间蒸馏将增强的对齐能力迁移到指令跟随任务的监督微调中。 Result: 该方法在离散OpenVLA上提升了视觉条件依赖性和任务性能,并在连续OpenVLA-OFT设置中也取得一致增益。 Conclusion: 显式增强视觉条件作用可有效缓解视觉-动作错位问题,提升VLA模型的鲁棒性与泛化能力。 Abstract: Vision-Language-Action (VLA) models have demonstrated strong performance across a wide range of robotic manipulation tasks. Despite the success, extending large pretrained Vision-Language Models (VLMs) to the action space can induce vision-action misalignment, where action predictions exhibit weak dependence on the current visual state, leading to unreliable action outputs. In this work, we study VLA models through the lens of visual conditioning and empirically show that successful rollouts consistently exhibit stronger visual dependence than failed ones. Motivated by this observation, we propose a training framework that explicitly strengthens visual conditioning in VLA models. Our approach first aligns action prediction with visual input via preference optimization on a track-following surrogate task, and then transfers the enhanced alignment to instruction-following task through latent-space distillation during supervised finetuning. Without introducing architectural modifications or additional data collection, our method improves both visual conditioning and task performance for discrete OpenVLA, and further yields consistent gains when extended to the continuous OpenVLA-OFT setting. Project website: https://vista-vla.github.io/ .
### [63] [Food Portion Estimation: From Pixels to Calories](https://arxiv.org/abs/2602.05078) *Gautham Vinod,Fengqing Zhu* Main category: cs.CV TL;DR: 本文综述了基于图像的膳食评估中食物份量估计的各种策略,重点解决从2D图像推断3D食物尺寸的挑战。
Details Motivation: 图像膳食评估在慢性病和肥胖防控中至关重要,但其核心难点在于从2D图像准确估计食物的3D尺寸(即份量)。 Method: 系统梳理并比较了多种解决份量估计问题的策略,包括辅助输入法(如深度图、多视角图像)、模型驱动法(如模板匹配)以及基于深度学习的方法(单目图像或图像与辅助输入融合)。 Result: 总结了各类方法的优缺点及适用场景,为后续研究提供技术路线参考。 Conclusion: 尽管已有多种策略,仍需进一步提升精度、鲁棒性与普适性;深度学习与多模态融合是未来重要方向。 Abstract: Reliance on images for dietary assessment is an important strategy to accurately and conveniently monitor an individual's health, making it a vital mechanism in the prevention and care of chronic diseases and obesity. However, image-based dietary assessment suffers from estimating the three dimensional size of food from 2D image inputs. Many strategies have been devised to overcome this critical limitation such as the use of auxiliary inputs like depth maps, multi-view inputs, or model-based approaches such as template matching. Deep learning also helps bridge the gap by either using monocular images or combinations of the image and the auxillary inputs to precisely predict the output portion from the image input. In this paper, we explore the different strategies employed for accurate portion estimation.
### [64] [Visual concept ranking uncovers medical shortcuts used by large multimodal models](https://arxiv.org/abs/2602.05096) *Joseph D. Janizek,Sonnet Xu,Junayd Lateef,Roxana Daneshjou* Main category: cs.CV TL;DR: 本文提出了一种名为视觉概念排序(VCR)的方法,用于识别大型多模态模型(LMMs)中的重要视觉概念,并应用于皮肤癌病变分类等医疗任务中,揭示了模型在不同人群子组间性能差异及对特定视觉特征的依赖性。
Details Motivation: 确保机器学习模型在医疗等安全关键领域的可靠性需要能发现模型缺陷的审计方法。 Method: 提出了视觉概念排序(VCR)方法,用于识别大型多模态模型中的重要视觉概念,并通过人工干预验证其生成的关于视觉特征依赖性的假设。 Result: 发现LMMs在不同人口子组间存在意外的性能差距,并利用VCR方法识别出影响模型决策的关键视觉概念及其依赖关系。 Conclusion: VCR是一种有效的审计工具,有助于理解并提升多模态模型在医疗任务中的公平性与可靠性。 Abstract: Ensuring the reliability of machine learning models in safety-critical domains such as healthcare requires auditing methods that can uncover model shortcomings. We introduce a method for identifying important visual concepts within large multimodal models (LMMs) and use it to investigate the behaviors these models exhibit when prompted with medical tasks. We primarily focus on the task of classifying malignant skin lesions from clinical dermatology images, with supplemental experiments including both chest radiographs and natural images. After showing how LMMs display unexpected gaps in performance between different demographic subgroups when prompted with demonstrating examples, we apply our method, Visual Concept Ranking (VCR), to these models and prompts. VCR generates hypotheses related to different visual feature dependencies, which we are then able to validate with manual interventions.
### [65] [CLEAR-HPV: Interpretable Concept Discovery for HPV-Associated Morphology in Whole-Slide Histology](https://arxiv.org/abs/2602.05126) *Weiyi Qin,Yingci Liu-Swetz,Shiwei Tan,Hao Wang* Main category: cs.CV TL;DR: 本文提出CLEAR-HPV框架,通过注意力引导的潜在空间重构,在无需概念标签的情况下自动发现并量化组织形态学概念(如角化、基底样、间质),提升HPV状态预测的可解释性与泛化性。
Details Motivation: 现有基于注意力的多实例学习(MIL)方法虽在HPV相关全切片病理预测中表现良好,但缺乏形态学可解释性。 Method: 提出CLEAR-HPV框架,在注意力加权潜在空间中自动发现形态学概念,生成空间概念图,并用紧凑的概念分数向量(10维)表征每张切片,替代高维MIL嵌入(如1536维)。 Result: CLEAR-HPV在TCGA-HNSCC、TCGA-CESC和CPTAC-HNSCC数据集上一致泛化,实现了高预测保持率下的紧凑、概念级可解释性。 Conclusion: CLEAR-HPV是一种主干网络无关的通用框架,显著提升了HPV状态预测模型的可解释性与临床适用性。 Abstract: Human papillomavirus (HPV) status is a critical determinant of prognosis and treatment response in head and neck and cervical cancers. Although attention-based multiple instance learning (MIL) achieves strong slide-level prediction for HPV-related whole-slide histopathology, it provides limited morphologic interpretability. To address this limitation, we introduce Concept-Level Explainable Attention-guided Representation for HPV (CLEAR-HPV), a framework that restructures the MIL latent space using attention to enable concept discovery without requiring concept labels during training. Operating in an attention-weighted latent space, CLEAR-HPV automatically discovers keratinizing, basaloid, and stromal morphologic concepts, generates spatial concept maps, and represents each slide using a compact concept-fraction vector. CLEAR-HPV's concept-fraction vectors preserve the predictive information of the original MIL embeddings while reducing the high-dimensional feature space (e.g., 1536 dimensions) to only 10 interpretable concepts. CLEAR-HPV generalizes consistently across TCGA-HNSCC, TCGA-CESC, and CPTAC-HNSCC, providing compact, concept-level interpretability through a general, backbone-agnostic framework for attention-based MIL models of whole-slide histopathology.
### [66] [ARGaze: Autoregressive Transformers for Online Egocentric Gaze Estimation](https://arxiv.org/abs/2602.05132) *Jia Li,Wenjie Zhao,Shijian Deng,Bolin Lai,Yuheng Wu,RUijia Chen,Jon E. Froehlich,Yuhang Zhao,Yapeng Tian* Main category: cs.CV TL;DR: 本文提出ARGaze模型,将第一人称视频中的在线注视点估计任务建模为基于视觉特征和有限长度历史注视目标的自回归序列预测问题,显著提升了在线场景下的性能。
Details Motivation: 第一人称注视估计缺乏显式的头部或眼部信号,需从手-物交互、场景显著性等稀疏间接线索中推断视觉注意;同时,注视在目标导向活动中具有强时间连续性,近期注视位置可作为预测当前注视的强先验。 Method: 提出ARGaze:使用Transformer解码器,在每个时间步基于当前视觉特征和固定长度的‘注视上下文窗口’(即近期历史注视目标估计)进行因果、流式自回归预测。 Result: 在多个第一人称基准数据集的在线评估中达到SOTA性能;消融实验验证了带有限历史的自回归建模对鲁棒预测的关键作用。 Conclusion: 将在线注视估计重构为视觉条件下的有界历史自回归序列预测是有效且实用的范式,兼顾准确性、因果性和计算效率,适用于AR与辅助技术等实时应用。 Abstract: Online egocentric gaze estimation predicts where a camera wearer is looking from first-person video using only past and current frames, a task essential for augmented reality and assistive technologies. Unlike third-person gaze estimation, this setting lacks explicit head or eye signals, requiring models to infer current visual attention from sparse, indirect cues such as hand-object interactions and salient scene content. We observe that gaze exhibits strong temporal continuity during goal-directed activities: knowing where a person looked recently provides a powerful prior for predicting where they look next. Inspired by vision-conditioned autoregressive decoding in vision-language models, we propose ARGaze, which reformulates gaze estimation as sequential prediction: at each timestep, a transformer decoder predicts current gaze by conditioning on (i) current visual features and (ii) a fixed-length Gaze Context Window of recent gaze target estimates. This design enforces causality and enables bounded-resource streaming inference. We achieve state-of-the-art performance across multiple egocentric benchmarks under online evaluation, with extensive ablations validating that autoregressive modeling with bounded gaze history is critical for robust prediction. We will release our source code and pre-trained models.
### [67] [AirGlove: Exploring Egocentric 3D Hand Tracking and Appearance Generalization for Sensing Gloves](https://arxiv.org/abs/2602.05159) *Wenhui Cui,Ziyi Kou,Chuan Qin,Ergys Ristani,Li Guan* Main category: cs.CV TL;DR: 本文提出AirGlove方法,通过利用现有传感手套数据,在少量新手套数据下实现对视觉手部姿态估计模型的有效泛化,显著提升其在各类传感手套上的跟踪性能。
Details Motivation: 现有传感器驱动的手部跟踪易受信号与标定质量影响;而主流视觉模型虽在裸手表现优异,却因手套外观差异大而在传感手套上性能大幅下降。 Method: 开展零样本与微调下的视觉手部跟踪模型在传感手套上的系统性评估,并提出AirGlove方法,利用已有手套数据建模并泛化至新设计手套。 Result: AirGlove在多种传感手套上均实现显著性能提升,优于对比方法。 Conclusion: 视觉模型在传感手套上存在严重外观域偏移问题,AirGlove通过跨手套表征泛化有效缓解该问题,为实际部署提供更鲁棒、低标定依赖的解决方案。 Abstract: Sensing gloves have become important tools for teleoperation and robotic policy learning as they are able to provide rich signals like speed, acceleration and tactile feedback. A common approach to track gloved hands is to directly use the sensor signals (e.g., angular velocity, gravity orientation) to estimate 3D hand poses. However, sensor-based tracking can be restrictive in practice as the accuracy is often impacted by sensor signal and calibration quality. Recent advances in vision-based approaches have achieved strong performance on human hands via large-scale pre-training, but their performance on gloved hands with distinct visual appearances remains underexplored. In this work, we present the first systematic evaluation of vision-based hand tracking models on gloved hands under both zero-shot and fine-tuning setups. Our analysis shows that existing bare-hand models suffer from substantial performance degradation on sensing gloves due to large appearance gap between bare-hand and glove designs. We therefore propose AirGlove, which leverages existing gloves to generalize the learned glove representations towards new gloves with limited data. Experiments with multiple sensing gloves show that AirGlove effectively generalizes the hand pose models to new glove designs and achieves a significant performance boost over the compared schemes.
### [68] [SHaSaM: Submodular Hard Sample Mining for Fair Facial Attribute Recognition](https://arxiv.org/abs/2602.05162) *Anay Majee,Rishabh Iyer* Main category: cs.CV TL;DR: 本文提出SHaSaM方法,通过子模硬样本挖掘解决深度神经网络中的社会与人口统计偏差问题,在提升公平性的同时不牺牲性能。
Details Motivation: 深度神经网络在训练中容易继承标注数据中的社会和人口统计偏差,导致在种族、年龄、性别等敏感属性存在时产生不公平预测;现有方法难以应对属性组间的数据不平衡,并可能无意中强化敏感属性,加剧不公平性。 Method: 提出SHaSaM(Submodular Hard Sample Mining):第一阶段SHaSaM-MINE采用子模子集选择策略挖掘难正/负样本以缓解数据不平衡;第二阶段SHaSaM-LEARN基于子模条件互信息设计组合损失函数,最大化目标类别决策边界并最小化敏感属性影响。 Result: 在CelebA和UTKFace数据集上,SHaSaM达到SOTA效果:公平性指标(Equalized Odds)最高提升2.7分,准确率提升3.5%,且收敛更快。 Conclusion: SHaSaM通过统一的子模优化框架有效限制模型学习与敏感属性相关的特征,在显著提升公平性的同时保持甚至增强模型性能。 Abstract: Deep neural networks often inherit social and demographic biases from annotated data during model training, leading to unfair predictions, especially in the presence of sensitive attributes like race, age, gender etc. Existing methods fall prey to the inherent data imbalance between attribute groups and inadvertently emphasize on sensitive attributes, worsening unfairness and performance. To surmount these challenges, we propose SHaSaM (Submodular Hard Sample Mining), a novel combinatorial approach that models fairness-driven representation learning as a submodular hard-sample mining problem. Our two-stage approach comprises of SHaSaM-MINE, which introduces a submodular subset selection strategy to mine hard positives and negatives - effectively mitigating data imbalance, and SHaSaM-LEARN, which introduces a family of combinatorial loss functions based on Submodular Conditional Mutual Information to maximize the decision boundary between target classes while minimizing the influence of sensitive attributes. This unified formulation restricts the model from learning features tied to sensitive attributes, significantly enhancing fairness without sacrificing performance. Experiments on CelebA and UTKFace demonstrate that SHaSaM achieves state-of-the-art results, with up to 2.7 points improvement in model fairness (Equalized Odds) and a 3.5% gain in Accuracy, within fewer epochs as compared to existing methods.
### [69] [LOBSTgER-enhance: an underwater image enhancement pipeline](https://arxiv.org/abs/2602.05163) *Andreas Mentzelopoulos,Keith Ellenbogen* Main category: cs.CV TL;DR: 本文提出了一种基于扩散模型的图像到图像转换方法,用于恢复水下摄影中的色彩失真、模糊和对比度下降等问题,仅用约2500张高质量水下图像即可实现良好泛化与感知一致性。
Details Motivation: 水下摄影存在对比度低、空间模糊和波长依赖性色彩失真等固有挑战,导致海洋生物色彩失真,且现有修复依赖繁重的手动后期处理。 Method: 构建合成退化管道模拟水下失真,并利用扩散模型学习逆向恢复;在Keith Ellenbogen提供的约2500张高质量水下摄影图像上从零训练一个约1100万参数的模型。 Result: 模型能稳定生成512×768分辨率图像,在感知一致性和泛化能力方面表现优异。 Conclusion: 该方法证明了轻量级扩散模型在小规模高质量数据集上解决水下图像复原任务的有效性,为低资源场景下的图像增强提供了新思路。 Abstract: Underwater photography presents significant inherent challenges including reduced contrast, spatial blur, and wavelength-dependent color distortions. These effects can obscure the vibrancy of marine life and awareness photographers in particular are often challenged with heavy post-processing pipelines to correct for these distortions. We develop an image-to-image pipeline that learns to reverse underwater degradations by introducing a synthetic corruption pipeline and learning to reverse its effects with diffusion-based generation. Training and evaluation are performed on a small high-quality dataset of awareness photography images by Keith Ellenbogen. The proposed methodology achieves high perceptual consistency and strong generalization in synthesizing 512x768 images using a model of ~11M parameters after training from scratch on ~2.5k images.
### [70] [ShapePuri: Shape Guided and Appearance Generalized Adversarial Purification](https://arxiv.org/abs/2602.05175) *Zhe Li,Bernhard Kainz* Main category: cs.CV TL;DR: 本文提出ShapePuri,一种基于形状引导的净化框架,通过结合形状编码模块(SEM)和全局外观去偏模块(GAD),在不增加计算开销的前提下显著提升模型对对抗攻击的鲁棒性,在AutoAttack上首次突破80%鲁棒准确率。
Details Motivation: 现有防御方法如对抗训练和基于扩散的净化存在高计算成本或信息损失问题,亟需更高效、稳定的鲁棒性提升方案。 Method: 提出Shape Guided Purification(ShapePuri),包含两个核心模块:1)形状编码模块(SEM),利用符号距离函数(SDF)提供密集几何引导;2)全局外观去偏模块(GAD),通过随机变换缓解外观偏差。 Result: 在AutoAttack协议下达到84.06%干净准确率和81.64%鲁棒准确率,首次超过80%鲁棒准确率阈值,且不引入额外计算开销或辅助模块。 Conclusion: ShapePuri是一种可扩展、高效的对抗防御方法,能保持推理阶段预测稳定性,为结构化表征引导的鲁棒学习提供了新范式。 Abstract: Deep neural networks demonstrate impressive performance in visual recognition, but they remain vulnerable to adversarial attacks that is imperceptible to the human. Although existing defense strategies such as adversarial training and purification have achieved progress, diffusion-based purification often involves high computational costs and information loss. To address these challenges, we introduce Shape Guided Purification (ShapePuri), a novel defense framework enhances robustness by aligning model representations with stable structural invariants. ShapePuri integrates two components: a Shape Encoding Module (SEM) that provides dense geometric guidance through Signed Distance Functions (SDF), and a Global Appearance Debiasing (GAD) module that mitigates appearance bias via stochastic transformations. In our experiments, ShapePuri achieves $84.06\%$ clean accuracy and $81.64\%$ robust accuracy under the AutoAttack protocol, representing the first defense framework to surpass the $80\%$ threshold on this benchmark. Our approach provides a scalable and efficient adversarial defense that preserves prediction stability during inference without requiring auxiliary modules or additional computational cost.
### [71] [PoseGaussian: Pose-Driven Novel View Synthesis for Robust 3D Human Reconstruction](https://arxiv.org/abs/2602.05190) *Ju Shen,Chen Chen,Tam V. Nguyen,Vijayan K. Asari* Main category: cs.CV TL;DR: PoseGaussian是一种姿态引导的高斯泼溅框架,用于高质量的人体新视角合成,通过将姿态信息融入几何与时间建模中,提升动态人体场景下的鲁棒性、泛化性与实时渲染性能(100 FPS)。
Details Motivation: 解决动态人体场景中关节运动和严重自遮挡带来的新视角合成挑战,提升现有方法在几何重建和时序一致性上的鲁棒性与泛化能力。 Method: 提出PoseGaussian框架:利用姿态作为结构先验,与颜色编码器融合以优化深度估计;同时设计专用姿态编码器提取时间线索,增强帧间一致性;整个流程端到端可微、可训练。 Result: 在ZJU-MoCap、THuman2.0及自建数据集上达到SOTA:PSNR 30.86、SSIM 0.979、LPIPS 0.028,并实现100 FPS实时渲染。 Conclusion: 姿态信号深度融入几何与时间建模显著提升人体新视角合成质量与效率,为动态场景建模提供了新范式。 Abstract: We propose PoseGaussian, a pose-guided Gaussian Splatting framework for high-fidelity human novel view synthesis. Human body pose serves a dual purpose in our design: as a structural prior, it is fused with a color encoder to refine depth estimation; as a temporal cue, it is processed by a dedicated pose encoder to enhance temporal consistency across frames. These components are integrated into a fully differentiable, end-to-end trainable pipeline. Unlike prior works that use pose only as a condition or for warping, PoseGaussian embeds pose signals into both geometric and temporal stages to improve robustness and generalization. It is specifically designed to address challenges inherent in dynamic human scenes, such as articulated motion and severe self-occlusion. Notably, our framework achieves real-time rendering at 100 FPS, maintaining the efficiency of standard Gaussian Splatting pipelines. We validate our approach on ZJU-MoCap, THuman2.0, and in-house datasets, demonstrating state-of-the-art performance in perceptual quality and structural accuracy (PSNR 30.86, SSIM 0.979, LPIPS 0.028).
### [72] [GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling](https://arxiv.org/abs/2602.05202) *Shivanshu Shekhar,Uttaran Bhattacharya,Raghavendra Addanki,Mehrab Tanjim,Somdeb Sarkhel,Tong Zhang* Main category: cs.CV TL;DR: 本文提出了一种新方法,将视频生成模型本身作为奖励模型(而非依赖VLM),通过将其重构为能量模型并设计合成负样本进行对比学习,显著提升了对视频时序质量的判别能力,在多个基准上以更少标注数据达到SOTA。
Details Motivation: 现有基于视觉语言模型(VLM)的视频生成奖励建模方法难以捕捉细微的时间动态;需一种能天然建模时序结构的替代方案。 Method: 将先进视频生成模型(如Generative-Transformer)重构为能量基模型(EBM),通过对比学习训练其区分高质量与退化视频;设计三类可控潜在空间扰动(时间切片、特征交换、帧打乱)生成合成负样本,避免模型利用表层伪影。 Result: 在GenAI-Bench和MonteBench上达到SOTA性能,仅需30K人工标注,比现有VLM方法少6–65倍标注量。 Conclusion: 视频生成模型可被有效重用为高精度、时序感知的奖励模型,无需额外架构设计,且数据效率大幅提升。 Abstract: Aligning video generative models with human preferences remains challenging: current approaches rely on Vision-Language Models (VLMs) for reward modeling, but these models struggle to capture subtle temporal dynamics. We propose a fundamentally different approach: repurposing video generative models, which are inherently designed to model temporal structure, as reward models. We present the Generative-Transformer-based Self-Supervised Video Judge (\modelname), a novel evaluation model that transforms state-of-the-art video generation models into powerful temporally-aware reward models. Our key insight is that generative models can be reformulated as energy-based models (EBMs) that assign low energy to high-quality videos and high energy to degraded ones, enabling them to discriminate video quality with remarkable precision when trained via contrastive objectives. To prevent the model from exploiting superficial differences between real and generated videos, we design challenging synthetic negative videos through controlled latent-space perturbations: temporal slicing, feature swapping, and frame shuffling, which simulate realistic but subtle visual degradations. This forces the model to learn meaningful spatiotemporal features rather than trivial artifacts. \modelname achieves state-of-the-art performance on GenAI-Bench and MonteBench using only 30K human-annotations: $6\times$ to $65\times$ fewer than existing VLM-based approaches.
### [73] [Dual-Representation Image Compression at Ultra-Low Bitrates via Explicit Semantics and Implicit Textures](https://arxiv.org/abs/2602.05213) *Chuqin Zhou,Xiaoyue Ling,Yunuo Chen,Jincheng Dai,Guo Lu,Wenjun Zhang* Main category: cs.CV TL;DR: 本文提出了一种无需训练的统一框架,通过协同整合显式高层语义与隐式细节表示(利用扩散模型和反向信道编码),在超低码率下实现语义保真与感知真实性的平衡,并在多个数据集上显著超越现有方法。
Details Motivation: 现有神经编解码器在超低码率下性能急剧下降;生成式压缩方法虽有潜力,但受限于语义保真性与感知真实性的固有矛盾。 Method: 提出一种训练无关的统一框架:以显式高层语义为条件引导扩散模型,同时采用反向信道编码隐式传递细粒度细节;并引入可插拔编码器,灵活调控失真-感知权衡。 Result: 在Kodak、DIV2K和CLIC2020数据集上,DISTS BD-Rate指标分别比DiffC提升29.92%、19.33%和20.89%,达到当前最优的码率-感知性能。 Conclusion: 显式与隐式表征的协同融合可在不依赖额外训练的前提下,有效突破超低码率下语义与感知质量的权衡瓶颈。 Abstract: While recent neural codecs achieve strong performance at low bitrates when optimized for perceptual quality, their effectiveness deteriorates significantly under ultra-low bitrate conditions. To mitigate this, generative compression methods leveraging semantic priors from pretrained models have emerged as a promising paradigm. However, existing approaches are fundamentally constrained by a tradeoff between semantic faithfulness and perceptual realism. Methods based on explicit representations preserve content structure but often lack fine-grained textures, whereas implicit methods can synthesize visually plausible details at the cost of semantic drift. In this work, we propose a unified framework that bridges this gap by coherently integrating explicit and implicit representations in a training-free manner. Specifically, We condition a diffusion model on explicit high-level semantics while employing reverse-channel coding to implicitly convey fine-grained details. Moreover, we introduce a plug-in encoder that enables flexible control of the distortion-perception tradeoff by modulating the implicit information. Extensive experiments demonstrate that the proposed framework achieves state-of-the-art rate-perception performance, outperforming existing methods and surpassing DiffC by 29.92%, 19.33%, and 20.89% in DISTS BD-Rate on the Kodak, DIV2K, and CLIC2020 datasets, respectively.
### [74] [E.M.Ground: A Temporal Grounding Vid-LLM with Holistic Event Perception and Matching](https://arxiv.org/abs/2602.05215) *Jiahao Nie,Wenbin An,Gongjie Zhang,Yicheng Xu,Yap-Peng Tan,Alex C. Kot,Shijian Lu* Main category: cs.CV TL;DR: E.M.Ground是一种新型视频大语言模型,通过引入标记、Savitzky-Golay平滑和多粒度帧特征聚合,提升时序视频定位任务中事件语义连续性与时间精度的建模能力。
Details Motivation: 现有方法依赖单独匹配起止帧,难以捕捉事件的语义连续性和完整性,导致定位模糊。 Method: 提出E.M.Ground模型:(i)引入标记聚合事件所有帧信息;(ii)采用Savitzky-Golay平滑优化token-to-frame相似度曲线;(iii)设计多粒度帧特征聚合机制以缓解压缩损失。 Result: 在多个基准数据集上显著超越当前最优视频大语言模型。 Conclusion: 整体化、连贯性的事件感知范式更适配时序视频定位任务,E.M.Ground验证了该思路的有效性与先进性。 Abstract: Despite recent advances in Video Large Language Models (Vid-LLMs), Temporal Video Grounding (TVG), which aims to precisely localize time segments corresponding to query events, remains a significant challenge. Existing methods often match start and end frames by comparing frame features with two separate tokens, relying heavily on exact timestamps. However, this approach fails to capture the event's semantic continuity and integrity, leading to ambiguities. To address this, we propose E.M.Ground, a novel Vid-LLM for TVG that focuses on holistic and coherent event perception. E.M.Ground introduces three key innovations: (i) a special token that aggregates information from all frames of a query event, preserving semantic continuity for accurate event matching; (ii) Savitzky-Golay smoothing to reduce noise in token-to-frame similarities across timestamps, improving prediction accuracy; (iii) multi-grained frame feature aggregation to enhance matching reliability and temporal understanding, compensating for compression-induced information loss. Extensive experiments on benchmark datasets show that E.M.Ground consistently outperforms state-of-the-art Vid-LLMs by significant margins.
### [75] [Cross-Domain Few-Shot Segmentation via Multi-view Progressive Adaptation](https://arxiv.org/abs/2602.05217) *Jiahao Nie,Guanqiao Fu,Wenbin An,Yap-Peng Tan,Alex C. Kot,Shijian Lu* Main category: cs.CV TL;DR: 本文提出多视角渐进式自适应方法(MPA),通过混合渐进增强和双链多视角预测,从数据和策略两个角度逐步提升跨域少样本分割模型在目标域的泛化能力,显著优于现有方法。
Details Motivation: 现有跨域少样本分割方法受限于目标域样本数量少、多样性低,且源域预训练模型在目标域初始少样本能力弱、域差距大,导致目标样本利用效率低、适应效果差。 Method: 提出多视角渐进式适应(MPA)框架:(i)数据视角——混合渐进增强,通过累积强增强生成日益多样复杂的视图;(ii)策略视角——双链多视角预测,结合串行与并行学习路径,在广泛监督下联合约束多视图预测一致性。 Result: 在多个基准上大幅超越当前最优方法,性能提升达+7.0%。 Conclusion: MPA通过协同优化数据增强策略与预测结构,有效缓解域偏移与样本稀缺问题,显著提升了跨域少样本分割的鲁棒性与准确性。 Abstract: Cross-Domain Few-Shot Segmentation aims to segment categories in data-scarce domains conditioned on a few exemplars. Typical methods first establish few-shot capability in a large-scale source domain and then adapt it to target domains. However, due to the limited quantity and diversity of target samples, existing methods still exhibit constrained performance. Moreover, the source-trained model's initially weak few-shot capability in target domains, coupled with substantial domain gaps, severely hinders the effective utilization of target samples and further impedes adaptation. To this end, we propose Multi-view Progressive Adaptation, which progressively adapts few-shot capability to target domains from both data and strategy perspectives. (i) From the data perspective, we introduce Hybrid Progressive Augmentation, which progressively generates more diverse and complex views through cumulative strong augmentations, thereby creating increasingly challenging learning scenarios. (ii) From the strategy perspective, we design Dual-chain Multi-view Prediction, which fully leverages these progressively complex views through sequential and parallel learning paths under extensive supervision. By jointly enforcing prediction consistency across diverse and complex views, MPA achieves both robust and accurate adaptation to target domains. Extensive experiments demonstrate that MPA effectively adapts few-shot capability to target domains, outperforming state-of-the-art methods by a large margin (+7.0%).
### [76] [Boosting SAM for Cross-Domain Few-Shot Segmentation via Conditional Point Sparsification](https://arxiv.org/abs/2602.05218) *Jiahao Nie,Yun Xing,Wenbin An,Qingsong Zhao,Jiawei Shao,Yap-Peng Tan,Alex C. Kot,Shijian Lu,Xuelong Li* Main category: cs.CV TL;DR: 本文提出Conditional Point Sparsification (CPS)方法,用于提升SAM在跨域少样本分割(CD-FSS)中的性能,通过自适应稀疏化匹配点来缓解域偏移带来的点-图像交互失效问题。
Details Motivation: 现有基于SAM的少样本分割方法在跨域(如医学、遥感)场景下因域偏移导致密集点匹配失效,需改进点提示策略。 Method: 提出无训练的Conditional Point Sparsification(CPS),利用参考图像的真实掩码指导,自适应稀疏化参考与目标图像间匹配的密集点,从而优化SAM的跨域交互。 Result: CPS在多个跨域少样本分割数据集上显著优于现有无训练SAM方法。 Conclusion: 点密度在跨域条件下至关重要;CPS通过条件稀疏化提升了SAM在CD-FSS任务中的泛化性与鲁棒性。 Abstract: Motivated by the success of the Segment Anything Model (SAM) in promptable segmentation, recent studies leverage SAM to develop training-free solutions for few-shot segmentation, which aims to predict object masks in the target image based on a few reference exemplars. These SAM-based methods typically rely on point matching between reference and target images and use the matched dense points as prompts for mask prediction. However, we observe that dense points perform poorly in Cross-Domain Few-Shot Segmentation (CD-FSS), where target images are from medical or satellite domains. We attribute this issue to large domain shifts that disrupt the point-image interactions learned by SAM, and find that point density plays a crucial role under such conditions. To address this challenge, we propose Conditional Point Sparsification (CPS), a training-free approach that adaptively guides SAM interactions for cross-domain images based on reference exemplars. Leveraging ground-truth masks, the reference images provide reliable guidance for adaptively sparsifying dense matched points, enabling more accurate segmentation results. Extensive experiments demonstrate that CPS outperforms existing training-free SAM-based methods across diverse CD-FSS datasets.
### [77] [PatchFlow: Leveraging a Flow-Based Model with Patch Features](https://arxiv.org/abs/2602.05238) *Boxiang Zhang,Baijian Yang,Xiaoming Wang,Corey Vian* Main category: cs.CV TL;DR: 本文提出了一种结合局部邻域感知补丁特征与标准化流模型的无监督异常检测方法,通过引入适配器模块提升预训练特征在工业图像上的迁移能力,在多个数据集上显著提升了检测精度。
Details Motivation: 解决压铸件表面缺陷检测中自动化质量控制面临的挑战,尤其是缺乏异常样本和通用预训练模型难以适配工业图像的问题。 Method: 融合局部邻域感知的补丁特征与标准化流模型,并设计适配器模块以桥接通用预训练特征提取器与工业产品图像之间的域差距,实现无需异常样本的无监督异常检测。 Result: 在MVTec AD数据集上图像级AUROC达99.28%(错误率降低20%);在VisA数据集上达96.48%(错误率降低28.2%);在自建压铸数据集上准确率达95.77%,且训练无需异常样本。 Conclusion: 该方法有效提升了工业场景下无监督异常检测的精度与泛化能力,展示了计算机视觉与深度学习在压铸质检中的实用潜力。 Abstract: Die casting plays a crucial role across various industries due to its ability to craft intricate shapes with high precision and smooth surfaces. However, surface defects remain a major issue that impedes die casting quality control. Recently, computer vision techniques have been explored to automate and improve defect detection. In this work, we combine local neighbor-aware patch features with a normalizing flow model and bridge the gap between the generic pretrained feature extractor and industrial product images by introducing an adapter module to increase the efficiency and accuracy of automated anomaly detection. Compared to state-of-the-art methods, our approach reduces the error rate by 20\% on the MVTec AD dataset, achieving an image-level AUROC of 99.28\%. Our approach has also enhanced performance on the VisA dataset , achieving an image-level AUROC of 96.48\%. Compared to the state-of-the-art models, this represents a 28.2\% reduction in error. Additionally, experiments on a proprietary die casting dataset yield an accuracy of 95.77\% for anomaly detection, without requiring any anomalous samples for training. Our method illustrates the potential of leveraging computer vision and deep learning techniques to advance inspection capabilities for the die casting industry
### [78] [Active Label Cleaning for Reliable Detection of Electron Dense Deposits in Transmission Electron Microscopy Images](https://arxiv.org/abs/2602.05250) *Jieyun Tan,Shuo Liu,Guibin Zhang,Ziqi Li,Jian Geng,Lei Zhang,Lei Cao* Main category: cs.CV TL;DR: 本文提出了一种主动标签清洗方法,用于提升基于众包标注的电子致密沉积物(EDD)检测模型性能,通过主动学习选择最有价值的噪声样本进行专家重标注,在显著降低成本的同时接近全专家标注的性能。
Details Motivation: 电子致密沉积物(EDD)自动检测受限于高质量标注数据稀缺;众包标注虽降低成本,但引入标签噪声。 Method: 提出主动标签清洗方法:利用主动学习选择最具价值的噪声样本供专家重标注;设计标签选择模块,结合众包标签与模型预测差异,实现样本选择和实例级噪声评分。 Result: 在私有数据集上达到67.18% AP₅₀,比直接使用噪声标签训练提升18.83%;性能达全专家标注水平的95.79%,标注成本降低73.30%。 Conclusion: 该方法为专家资源有限场景下构建可靠的医学AI系统提供了实用、低成本的解决方案。 Abstract: Automated detection of electron dense deposits (EDD) in glomerular disease is hindered by the scarcity of high-quality labeled data. While crowdsourcing reduces annotation cost, it introduces label noise. We propose an active label cleaning method to efficiently denoise crowdsourced datasets. Our approach uses active learning to select the most valuable noisy samples for expert re-annotation, building high-accuracy cleaning models. A Label Selection Module leverages discrepancies between crowdsourced labels and model predictions for both sample selection and instance-level noise grading. Experiments show our method achieves 67.18% AP\textsubscript{50} on a private dataset, an 18.83% improvement over training on noisy labels. This performance reaches 95.79% of that with full expert annotation while reducing annotation cost by 73.30%. The method provides a practical, cost-effective solution for developing reliable medical AI with limited expert resources.
### [79] [RFM-Pose:Reinforcement-Guided Flow Matching for Fast Category-Level 6D Pose Estimation](https://arxiv.org/abs/2602.05257) *Diya He,Qingchen Liu,Cong Zhang,Jiahu Qin* Main category: cs.CV TL;DR: 本文提出RFM-Pose框架,结合流匹配生成模型与强化学习(PPO),提升类别级6D物体位姿估计的采样效率与精度,在REAL275上显著降低计算成本并保持高性能。
Details Motivation: 现有基于分数的生成模型虽缓解了类别级位姿估计中的旋转对称性模糊问题,但采样开销高、效率低。 Method: 提出RFM-Pose:1)采用流匹配生成模型,沿最优传输路径从简单先验生成位姿;2)将采样过程建模为马尔可夫决策过程,用近端策略优化(PPO)联合优化流场(作为策略)和位姿评估器(作为值网络)。 Result: 在REAL275基准上,RFM-Pose在保持优越性能的同时显著降低计算成本;并可自然拓展至位姿跟踪任务,取得竞争性结果。 Conclusion: 流匹配与强化学习的协同建模能有效提升类别级6D位姿生成的效率与质量,为生成式位姿估计提供了新范式。 Abstract: Object pose estimation is a fundamental problem in computer vision and plays a critical role in virtual reality and embodied intelligence, where agents must understand and interact with objects in 3D space. Recently, score based generative models have to some extent solved the rotational symmetry ambiguity problem in category level pose estimation, but their efficiency remains limited by the high sampling cost of score-based diffusion. In this work, we propose a new framework, RFM-Pose, that accelerates category-level 6D object pose generation while actively evaluating sampled hypotheses. To improve sampling efficiency, we adopt a flow-matching generative model and generate pose candidates along an optimal transport path from a simple prior to the pose distribution. To further refine these candidates, we cast the flow-matching sampling process as a Markov decision process and apply proximal policy optimization to fine-tune the sampling policy. In particular, we interpret the flow field as a learnable policy and map an estimator to a value network, enabling joint optimization of pose generation and hypothesis scoring within a reinforcement learning framework. Experiments on the REAL275 benchmark demonstrate that RFM-Pose achieves favorable performance while significantly reducing computational cost. Moreover, similar to prior work, our approach can be readily adapted to object pose tracking and attains competitive results in this setting.
### [80] [ReGLA: Efficient Receptive-Field Modeling with Gated Linear Attention Network](https://arxiv.org/abs/2602.05262) *Junzhou Li,Manqi Zhao,Yilin Gao,Zhiheng Yu,Yin Li,Dongsheng Jiang,Li Xiao* Main category: cs.CV TL;DR: 本文提出ReGLA系列轻量级混合网络,结合高效卷积与ReLU门控线性注意力,在高分辨率图像任务中实现精度与延迟的良好平衡。
Details Motivation: 解决轻量级模型(尤其是Transformer架构)在高分辨率图像上精度与延迟难以兼顾的问题。 Method: 提出ReGLA:包含ELRF模块(提升卷积效率并保持大感受野)、RGMA模块(线性复杂度下增强局部表征)及多教师知识蒸馏策略。 Result: ReGLA-M在ImageNet-1K(224px)达80.85% Top-1精度,512px下仅4.98ms延迟;在COCO检测和ADE20K分割上分别超越iFormer 3.1% AP和3.6% mIoU。 Conclusion: ReGLA是面向高分辨率视觉任务的先进轻量级解决方案,在精度、速度和下游泛化能力上均表现优异。 Abstract: Balancing accuracy and latency on high-resolution images is a critical challenge for lightweight models, particularly for Transformer-based architectures that often suffer from excessive latency. To address this issue, we introduce \textbf{ReGLA}, a series of lightweight hybrid networks, which integrates efficient convolutions for local feature extraction with ReLU-based gated linear attention for global modeling. The design incorporates three key innovations: the Efficient Large Receptive Field (ELRF) module for enhancing convolutional efficiency while preserving a large receptive field; the ReLU Gated Modulated Attention (RGMA) module for maintaining linear complexity while enhancing local feature representation; and a multi-teacher distillation strategy to boost performance on downstream tasks. Extensive experiments validate the superiority of ReGLA; particularly the ReGLA-M achieves \textbf{80.85\%} Top-1 accuracy on ImageNet-1K at $224px$, with only \textbf{4.98 ms} latency at $512px$. Furthermore, ReGLA outperforms similarly scaled iFormer models in downstream tasks, achieving gains of \textbf{3.1\%} AP on COCO object detection and \textbf{3.6\%} mIoU on ADE20K semantic segmentation, establishing it as a state-of-the-art solution for high-resolution visual applications.
### [81] [Unlocking Prototype Potential: An Efficient Tuning Framework for Few-Shot Class-Incremental Learning](https://arxiv.org/abs/2602.05271) *Shengqin Jiang,Xiaoran Feng,Yuankai Qi,Haokui Zhang,Renlong Hang,Qingshan Liu,Lina Yao,Quan Z. Sheng,Ming-Hsuan Yang* Main category: cs.CV TL;DR: 本文提出一种新的少样本类增量学习方法,通过微调原型而非骨干网络,在静态高质量特征空间中优化决策区域,采用双校准方法提升原型判别能力。
Details Motivation: 传统FSCIL方法使用冻结的预训练特征提取器生成静态类原型,存在骨干网络表征偏差;而基于提示的调优方法在极低数据下难以实质性提升全局判别能力。 Method: 冻结特征提取器,微调原型;引入双校准方法(类特定偏移和任务感知偏移)使静态质心演变为动态可学习组件。 Result: 在多个基准上取得优越性能,且仅需极少可学习参数。 Conclusion: FSCIL的核心挑战在于静态优质特征空间中决策区域的优化,而非特征获取;原型微调比骨干微调更高效、更有效。 Abstract: Few-shot class-incremental learning (FSCIL) seeks to continuously learn new classes from very limited samples while preserving previously acquired knowledge. Traditional methods often utilize a frozen pre-trained feature extractor to generate static class prototypes, which suffer from the inherent representation bias of the backbone. While recent prompt-based tuning methods attempt to adapt the backbone via minimal parameter updates, given the constraint of extreme data scarcity, the model's capacity to assimilate novel information and substantively enhance its global discriminative power is inherently limited. In this paper, we propose a novel shift in perspective: freezing the feature extractor while fine-tuning the prototypes. We argue that the primary challenge in FSCIL is not feature acquisition, but rather the optimization of decision regions within a static, high-quality feature space. To this end, we introduce an efficient prototype fine-tuning framework that evolves static centroids into dynamic, learnable components. The framework employs a dual-calibration method consisting of class-specific and task-aware offsets. These components function synergistically to improve the discriminative capacity of prototypes for ongoing incremental classes. Extensive results demonstrate that our method attains superior performance across multiple benchmarks while requiring minimal learnable parameters.
### [82] [Magic-MM-Embedding: Towards Visual-Token-Efficient Universal Multimodal Embedding with MLLMs](https://arxiv.org/abs/2602.05275) *Qi Li,Yanzhe Zhao,Yongxin Zhou,Yameng Wang,Yandong Yang,Yuanjia Zhou,Jue Wang,Zuojian Wang,Jinxiang Liu* Main category: cs.CV TL;DR: 本文提出Magic-MM-Embedding系列模型,通过视觉令牌压缩和多阶段渐进训练策略,在通用多模态嵌入任务中实现高效率与高性能兼顾。
Details Motivation: 多模态大语言模型(MLLMs)在通用多模态检索中潜力巨大,但其实际应用受限于处理大量视觉token带来的高昂计算成本。 Method: 提出Magic-MM-Embedding:(1)基于视觉token压缩的高效MLLM架构;(2)包含持续预训练、对比预训练与难负样本挖掘、以及MLLM-as-a-Judge引导的任务感知微调的三阶段渐进训练策略。 Result: 实验表明,该模型在性能上大幅超越现有方法,同时推理更高效。 Conclusion: Magic-MM-Embedding成功平衡了多模态嵌入任务中的效率与性能,为实用化MLLM提供了新路径。 Abstract: Multimodal Large Language Models (MLLMs) have shown immense promise in universal multimodal retrieval, which aims to find relevant items of various modalities for a given query. But their practical application is often hindered by the substantial computational cost incurred from processing a large number of tokens from visual inputs. In this paper, we propose Magic-MM-Embedding, a series of novel models that achieve both high efficiency and state-of-the-art performance in universal multimodal embedding. Our approach is built on two synergistic pillars: (1) a highly efficient MLLM architecture incorporating visual token compression to drastically reduce inference latency and memory footprint, and (2) a multi-stage progressive training strategy designed to not only recover but significantly boost performance. This coarse-to-fine training paradigm begins with extensive continue pretraining to restore multimodal understanding and generation capabilities, progresses to large-scale contrastive pretraining and hard negative mining to enhance discriminative power, and culminates in a task-aware fine-tuning stage guided by an MLLM-as-a-Judge for precise data curation. Comprehensive experiments show that our model outperforms existing methods by a large margin while being more inference-efficient.
### [83] [Fast-SAM3D: 3Dfy Anything in Images but Faster](https://arxiv.org/abs/2602.05293) *Weilun Feng,Mingqiang Wu,Zhiliang Chen,Chuanguang Yang,Haotong Qin,Yuqi Li,Xiaokun Liu,Guoxin Fan,Zhulin An,Libo Huang,Yulun Zhang,Michele Magno,Yongjun Xu* Main category: cs.CV TL;DR: 本文提出Fast-SAM3D,一种无需训练的加速框架,通过模态感知缓存、时空联合token裁剪和频谱感知token聚合三种异构性感知机制,显著提升SAM3D在开放世界3D重建中的推理速度(最高2.67×),同时保持重建质量。
Details Motivation: SAM3D虽支持可扩展、开放世界的3D重建,但其高推理延迟阻碍实际部署;现有通用加速方法因忽视其多级异构性(形变与布局动力学差异、纹理细化稀疏性、几何频谱差异)而表现脆弱。 Method: 提出Fast-SAM3D框架,包含三个训练无关的异构性感知机制:(1) 模态感知步长缓存,解耦结构演化与布局更新;(2) 联合时空token裁剪,聚焦高熵区域细化;(3) 频谱感知token聚合,动态调整解码分辨率。 Result: 实验表明Fast-SAM3D实现最高2.67×端到端加速,保真度损失可忽略,在单视角3D生成中建立新的效率-质量Pareto前沿。 Conclusion: 忽视多级异构性是现有加速策略失效的主因;Fast-SAM3D通过动态匹配计算资源与生成复杂度,为高效3D生成提供了新范式。 Abstract: SAM3D enables scalable, open-world 3D reconstruction from complex scenes, yet its deployment is hindered by prohibitive inference latency. In this work, we conduct the \textbf{first systematic investigation} into its inference dynamics, revealing that generic acceleration strategies are brittle in this context. We demonstrate that these failures stem from neglecting the pipeline's inherent multi-level \textbf{heterogeneity}: the kinematic distinctiveness between shape and layout, the intrinsic sparsity of texture refinement, and the spectral variance across geometries. To address this, we present \textbf{Fast-SAM3D}, a training-free framework that dynamically aligns computation with instantaneous generation complexity. Our approach integrates three heterogeneity-aware mechanisms: (1) \textit{Modality-Aware Step Caching} to decouple structural evolution from sensitive layout updates; (2) \textit{Joint Spatiotemporal Token Carving} to concentrate refinement on high-entropy regions; and (3) \textit{Spectral-Aware Token Aggregation} to adapt decoding resolution. Extensive experiments demonstrate that Fast-SAM3D delivers up to \textbf{2.67$\times$} end-to-end speedup with negligible fidelity loss, establishing a new Pareto frontier for efficient single-view 3D generation. Our code is released in https://github.com/wlfeng0509/Fast-SAM3D.
### [84] [FlashBlock: Attention Caching for Efficient Long-Context Block Diffusion](https://arxiv.org/abs/2602.05305) *Zhuokun Chen,Jianfei Cai,Bohan Zhuang* Main category: cs.CV TL;DR: 本文提出FlashBlock,一种利用块外部注意力输出稳定性的缓存机制,以减少长上下文扩散模型中的注意力计算和KV缓存访问开销,显著提升推理效率,同时几乎不损失生成质量。
Details Motivation: 现有块扩散方法在长上下文下仍存在因重复计算增长的KV缓存带来的高开销问题;作者发现其跨步注意力存在块外冗余性这一未被充分利用的特性。 Method: 提出FlashBlock机制,缓存并复用块外部(block-external)注意力的稳定输出,避免重复计算;该方法正交于稀疏注意力,可作为互补的残差重用策略。 Result: 在扩散语言模型与视频生成任务中,实现最高1.44×的token吞吐量提升和1.6×的注意力耗时降低,生成质量几乎无损。 Conclusion: FlashBlock通过挖掘并利用块扩散中注意力的跨步稳定性,有效缓解长上下文下的计算瓶颈,是一种高效、即插即用且兼容性强的优化方案。 Abstract: Generating long-form content, such as minute-long videos and extended texts, is increasingly important for modern generative models. Block diffusion improves inference efficiency via KV caching and block-wise causal inference and has been widely adopted in diffusion language models and video generation. However, in long-context settings, block diffusion still incurs substantial overhead from repeatedly computing attention over a growing KV cache. We identify an underexplored property of block diffusion: cross-step redundancy of attention within a block. Our analysis shows that attention outputs from tokens outside the current block remain largely stable across diffusion steps, while block-internal attention varies significantly. Based on this observation, we propose FlashBlock, a cached block-external attention mechanism that reuses stable attention output, reducing attention computation and KV cache access without modifying the diffusion process. Moreover, FlashBlock is orthogonal to sparse attention and can be combined as a complementary residual reuse strategy, substantially improving model accuracy under aggressive sparsification. Experiments on diffusion language models and video generation demonstrate up to 1.44$\times$ higher token throughput and up to 1.6$\times$ reduction in attention time, with negligible impact on generation quality. Project page: https://caesarhhh.github.io/FlashBlock/.
### [85] [Wid3R: Wide Field-of-View 3D Reconstruction via Camera Model Conditioning](https://arxiv.org/abs/2602.05321) *Dongki Jung,Jaehoon Choi,Adil Qureshi,Somi Jeong,Dinesh Manocha,Suyong Yeon* Main category: cs.CV TL;DR: Wid3R是一种支持广角相机模型的前馈神经网络,用于视觉几何重建,能直接从360度图像进行多视角三维重建,具备零样本鲁棒性并显著优于先前方法。
Details Motivation: 现有方法通常假设输入图像是校正过的或使用针孔相机拍摄的,限制了其在鱼眼或全景相机等真实场景中的应用;需解决广角相机下的失真感知三维重建问题。 Method: 采用基于球谐函数的光线表示和网络内嵌的新颖相机模型标记,实现失真感知的三维重建;是首个支持前馈式360度图像多视角三维重建的基础模型。 Result: 在Stanford2D3D数据集上性能提升高达+77.33,展现出强零样本鲁棒性,并持续优于先前方法。 Conclusion: Wid3R突破了传统方法对针孔相机的依赖,为广角相机下的通用、免校准、端到端三维重建提供了新范式。 Abstract: We present Wid3R, a feed-forward neural network for visual geometry reconstruction that supports wide field-of-view camera models. Prior methods typically assume that input images are rectified or captured with pinhole cameras, since both their architectures and training datasets are tailored to perspective images only. These assumptions limit their applicability in real-world scenarios that use fisheye or panoramic cameras and often require careful calibration and undistortion. In contrast, Wid3R is a generalizable multi-view 3D estimation method that can model wide field-of-view camera types. Our approach leverages a ray representation with spherical harmonics and a novel camera model token within the network, enabling distortion-aware 3D reconstruction. Furthermore, Wid3R is the first multi-view foundation model to support feed-forward 3D reconstruction directly from 360 imagery. It demonstrates strong zero-shot robustness and consistently outperforms prior methods, achieving improvements of up to +77.33 on Stanford2D3D.
### [86] [MTPano: Multi-Task Panoramic Scene Understanding via Label-Free Integration of Dense Prediction Priors](https://arxiv.org/abs/2602.05330) *Jingdong Zhang,Xiaohang Zhan,Lingzhi Zhang,Yizhou Wang,Zhengming Yu,Jionghao Wang,Wenping Wang,Xin Li* Main category: cs.CV TL;DR: 本文提出MTPano,一种无需标注的多任务全景基础模型,通过利用透视域先验生成伪标签,并设计几何感知调制层和ERP token mixer来解决全景图像的几何畸变和任务干扰问题,实现了全景场景理解的SOTA性能。
Details Motivation: 全景场景理解面临高分辨率、多任务标注稀缺的挑战,且现有透视基础模型因球面几何畸变和坐标系差异难以直接迁移;此外,球面空间中不同密集预测任务间的内在关系尚未被充分探索。 Method: 提出MTPano模型:1)利用透视基础模型在投影后的视角图像上生成无域偏移伪标签,并重投影为全景监督信号;2)将任务分为旋转不变/可变两类,设计Panoramic Dual BridgeNet,通过几何感知调制层解耦特征流;3)引入ERP token mixer与双分支BridgeNet(含梯度截断)缓解畸变并促进有益跨任务交互;4)加入图像梯度、点图等辅助任务增强学习。 Result: MTPano在多个全景基准测试中达到SOTA性能,并在与各任务专用全景基础模型对比中表现具有竞争力。 Conclusion: MTPano通过标签免费训练范式、几何感知架构设计与跨任务协同机制,有效克服了全景多任务学习中的数据稀缺、几何畸变与任务干扰三大瓶颈,为全景基础模型提供了新范式。 Abstract: Comprehensive panoramic scene understanding is critical for immersive applications, yet it remains challenging due to the scarcity of high-resolution, multi-task annotations. While perspective foundation models have achieved success through data scaling, directly adapting them to the panoramic domain often fails due to severe geometric distortions and coordinate system discrepancies. Furthermore, the underlying relations between diverse dense prediction tasks in spherical spaces are underexplored. To address these challenges, we propose MTPano, a robust multi-task panoramic foundation model established by a label-free training pipeline. First, to circumvent data scarcity, we leverage powerful perspective dense priors. We project panoramic images into perspective patches to generate accurate, domain-gap-free pseudo-labels using off-the-shelf foundation models, which are then re-projected to serve as patch-wise supervision. Second, to tackle the interference between task types, we categorize tasks into rotation-invariant (e.g., depth, segmentation) and rotation-variant (e.g., surface normals) groups. We introduce the Panoramic Dual BridgeNet, which disentangles these feature streams via geometry-aware modulation layers that inject absolute position and ray direction priors. To handle the distortion from equirectangular projections (ERP), we incorporate ERP token mixers followed by a dual-branch BridgeNet for interactions with gradient truncation, facilitating beneficial cross-task information sharing while blocking conflicting gradients from incompatible task attributes. Additionally, we introduce auxiliary tasks (image gradient, point map, etc.) to fertilize the cross-task learning process. Extensive experiments demonstrate that MTPano achieves state-of-the-art performance on multiple benchmarks and delivers competitive results against task-specific panoramic specialist foundation models.
### [87] [Consistency-Preserving Concept Erasure via Unsafe-Safe Pairing and Directional Fisher-weighted Adaptation](https://arxiv.org/abs/2602.05339) *Yongwoo Kim,Sungmin Cha,Hyunsoo Kim,Jaewon Lee,Donghyun Kim* Main category: cs.CV TL;DR: 本文提出PAIR框架,通过不安全-安全概念配对实现一致性保持的语义重对齐,以解决文本到图像扩散模型中概念擦除时结构与语义不一致的问题。
Details Motivation: 现有概念擦除方法仅关注移除不安全概念,缺乏对安全替代方案的引导,导致原始与擦除图像间结构和语义一致性差。 Method: 提出PAIR框架:1)构建不安全-安全配对的多模态数据;2)配对语义重对齐目标,将目标概念显式映射至语义对齐的安全锚点;3)基于Fisher信息加权初始化DoRA低秩适配矩阵。 Result: 在多项实验中显著优于SOTA基线,在有效擦除目标概念的同时,保持结构完整性、语义连贯性和生成质量。 Conclusion: PAIR将概念擦除从简单删除重构为一致性保持的语义重对齐,实现了细粒度、可控且语义一致的不安全概念擦除。 Abstract: With the increasing versatility of text-to-image diffusion models, the ability to selectively erase undesirable concepts (e.g., harmful content) has become indispensable. However, existing concept erasure approaches primarily focus on removing unsafe concepts without providing guidance toward corresponding safe alternatives, which often leads to failure in preserving the structural and semantic consistency between the original and erased generations. In this paper, we propose a novel framework, PAIRed Erasing (PAIR), which reframes concept erasure from simple removal to consistency-preserving semantic realignment using unsafe-safe pairs. We first generate safe counterparts from unsafe inputs while preserving structural and semantic fidelity, forming paired unsafe-safe multimodal data. Leveraging these pairs, we introduce two key components: (1) Paired Semantic Realignment, a guided objective that uses unsafe-safe pairs to explicitly map target concepts to semantically aligned safe anchors; and (2) Fisher-weighted Initialization for DoRA, which initializes parameter-efficient low-rank adaptation matrices using unsafe-safe pairs, encouraging the generation of safe alternatives while selectively suppressing unsafe concepts. Together, these components enable fine-grained erasure that removes only the targeted concepts while maintaining overall semantic consistency. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, achieving effective concept erasure while preserving structural integrity, semantic coherence, and generation quality.
### [88] [Learning with Adaptive Prototype Manifolds for Out-of-Distribution Detection](https://arxiv.org/abs/2602.05349) *Ningkang Peng,JiuTao Zhou,Yuhao Zhang,Xiaoqian Peng,Qianfeng Yu,Linjing Qian,Tingyu Lu,Yi Chen,Yanhui Gu* Main category: cs.CV TL;DR: 本文提出APEX框架,通过自适应原型流形(APM)和后验感知OOD评分(PAOS)机制,解决现有原型学习方法中静态同质性假设和学习-推理断连两大根本缺陷,显著提升OOD检测性能。
Details Motivation: 现有基于原型的表示学习方法存在两个根本缺陷:静态同质性假设(为所有类别分配固定表征资源)和学习-推理断连(推理时忽略原型质量知识),限制了模型容量与性能。 Method: 提出APEX框架,包含两阶段修复:1)自适应原型流形(APM),基于最小描述长度(MDL)原则为每类自动确定最优原型数量K_c^*,解决原型冲突;2)后验感知OOD评分(PAOS)机制,量化原型内聚性与分离性以弥合学习-推理断连。 Result: 在CIFAR-100等基准上实验表明,APEX达到新的SOTA性能。 Conclusion: APEX通过自适应原型设计与后验感知评分,有效克服原型学习方法的根本局限,显著提升OOD检测能力与鲁棒性。 Abstract: Out-of-distribution (OOD) detection is a critical task for the safe deployment of machine learning models in the real world. Existing prototype-based representation learning methods have demonstrated exceptional performance. Specifically, we identify two fundamental flaws that universally constrain these methods: the Static Homogeneity Assumption (fixed representational resources for all classes) and the Learning-Inference Disconnect (discarding rich prototype quality knowledge at inference). These flaws fundamentally limit the model's capacity and performance. To address these issues, we propose APEX (Adaptive Prototype for eXtensive OOD Detection), a novel OOD detection framework designed via a Two-Stage Repair process to optimize the learned feature manifold. APEX introduces two key innovations to address these respective flaws: (1) an Adaptive Prototype Manifold (APM), which leverages the Minimum Description Length (MDL) principle to automatically determine the optimal prototype complexity $K_c^*$ for each class, thereby fundamentally resolving prototype collision; and (2) a Posterior-Aware OOD Scoring (PAOS) mechanism, which quantifies prototype quality (cohesion and separation) to bridge the learning-inference disconnect. Comprehensive experiments on benchmarks such as CIFAR-100 validate the superiority of our method, where APEX achieves new state-of-the-art performance.
### [89] [Multimodal Latent Reasoning via Hierarchical Visual Cues Injection](https://arxiv.org/abs/2602.05359) *Yiming Zhang,Qiangyu Yan,Borui Jiang,Kai Han* Main category: cs.CV TL;DR: 本文提出HIVE框架,通过在潜在空间中注入分层视觉线索实现多模态慢思考推理,提升复杂场景理解能力。
Details Motivation: 现有MLLMs的推理依赖端到端生成或语言为中心的思维链,存在低效、冗长和幻觉问题,需在潜在空间中融合多模态信号实现稳健推理。 Method: 提出HIVE框架:递归扩展Transformer块构建内部循环以实现迭代推理优化,并将从全局场景到细粒度区域的分层视觉线索可逆地注入模型潜在表征中。 Result: 实验证明测试时缩放有效,且分层信息整合显著提升了模型对复杂场景的理解能力。 Conclusion: HIVE实现了无需表面文本理由的、基于潜在空间的多模态慢思考推理,增强了推理的接地性与多步推断能力。 Abstract: The advancement of multimodal large language models (MLLMs) has enabled impressive perception capabilities. However, their reasoning process often remains a "fast thinking" paradigm, reliant on end-to-end generation or explicit, language-centric chains of thought (CoT), which can be inefficient, verbose, and prone to hallucination. This work posits that robust reasoning should evolve within a latent space, integrating multimodal signals seamlessly. We propose multimodal latent reasoning via HIerarchical Visual cuEs injection (\emph{HIVE}), a novel framework that instills deliberate, "slow thinking" without depending on superficial textual rationales. Our method recursively extends transformer blocks, creating an internal loop for iterative reasoning refinement. Crucially, it injectively grounds this process with hierarchical visual cues from global scene context to fine-grained regional details directly into the model's latent representations. This enables the model to perform grounded, multi-step inference entirely in the aligned latent space. Extensive evaluations demonstrate that test-time scaling is effective when incorporating vision knowledge, and that integrating hierarchical information significantly enhances the model's understanding of complex scenes.
### [90] [Breaking Semantic Hegemony: Decoupling Principal and Residual Subspaces for Generalized OOD Detection](https://arxiv.org/abs/2602.05360) *Ningkang Peng,Xiaoqian Peng,Yuhao Zhang,Qianfeng Yu,Feng Xing,Peirong Ma,Xichen Yang,Yi Chen,Tingyu Lu,Yanhui Gu* Main category: cs.CV TL;DR: 本文发现现有OOD检测方法存在'简单性悖论':对语义细微差异敏感,但对结构明显却语义简单的分布外样本或高频传感器噪声不敏感;提出无需训练的D-KNN框架,通过正交分解解耦语义与结构信息,并在CIFAR和ImageNet上达到SOTA性能。
Details Motivation: 现有基于特征的OOD检测方法虽取得进展,但存在'简单性悖论'——对语义细微差异敏感,却对结构显著但语义简单的OOD样本或高频噪声不敏感,源于深度特征空间中的'语义霸权'现象。 Method: 提出训练无关、即插即用的几何解耦框架D-KNN,利用正交分解显式分离语义成分与结构残差,并引入双空间校准机制以增强模型对微弱残差信号的敏感性。 Result: D-KNN在CIFAR和ImageNet基准上达到新SOTA;FPR95从31.3%降至2.3%;高斯噪声下AUROC从79.7%提升至94.9%。 Conclusion: D-KNN有效打破语义霸权,解决简单性悖论,证明显式建模结构信息对OOD检测至关重要。 Abstract: While feature-based post-hoc methods have made significant strides in Out-of-Distribution (OOD) detection, we uncover a counter-intuitive Simplicity Paradox in existing state-of-the-art (SOTA) models: these models exhibit keen sensitivity in distinguishing semantically subtle OOD samples but suffer from severe Geometric Blindness when confronting structurally distinct yet semantically simple samples or high-frequency sensor noise. We attribute this phenomenon to Semantic Hegemony within the deep feature space and reveal its mathematical essence through the lens of Neural Collapse. Theoretical analysis demonstrates that the spectral concentration bias, induced by the high variance of the principal subspace, numerically masks the structural distribution shift signals that should be significant in the residual subspace. To address this issue, we propose D-KNN, a training-free, plug-and-play geometric decoupling framework. This method utilizes orthogonal decomposition to explicitly separate semantic components from structural residuals and introduces a dual-space calibration mechanism to reactivate the model's sensitivity to weak residual signals. Extensive experiments demonstrate that D-KNN effectively breaks Semantic Hegemony, establishing new SOTA performance on both CIFAR and ImageNet benchmarks. Notably, in resolving the Simplicity Paradox, it reduces the FPR95 from 31.3% to 2.3%; when addressing sensor failures such as Gaussian noise, it boosts the detection performance (AUROC) from a baseline of 79.7% to 94.9%.
### [91] [Imagine a City: CityGenAgent for Procedural 3D City Generation](https://arxiv.org/abs/2602.05362) *Zishan Liu,Zecong Tang,RuoCheng Wu,Xinzhe Zheng,Jingyu Hu,Ka-Hei Hui,Haoran Xie,Bo Dai,Zhengzhe Liu* Main category: cs.CV TL;DR: 本文提出CityGenAgent,一种基于自然语言驱动的分层程序化生成高质量3D城市框架,通过块程序和建筑程序分解生成过程,并结合监督微调与强化学习提升结构正确性、空间推理与图文一致性,支持自然语言编辑,显著提升语义对齐、视觉质量与可控性。
Details Motivation: 现有3D城市生成方法在高保真资产创建、可控性和可编辑性方面存在不足,难以满足自动驾驶、虚拟现实和具身智能等应用需求。 Method: 提出CityGenAgent框架,将城市生成分解为Block Program和Building Program;采用两阶段学习策略:(1) 监督微调(SFT)确保程序符合几何与语义约束;(2) 强化学习(RL)引入空间对齐奖励和视觉一致性奖励以增强空间推理与图文对齐能力。 Result: 在语义对齐、视觉质量和可控性方面均优于现有方法,支持自然语言驱动的城市编辑与操纵。 Conclusion: CityGenAgent为可扩展、可控、高质量的3D城市生成提供了坚实基础,推动了生成式城市建模在多领域应用的发展。 Abstract: The automated generation of interactive 3D cities is a critical challenge with broad applications in autonomous driving, virtual reality, and embodied intelligence. While recent advances in generative models and procedural techniques have improved the realism of city generation, existing methods often struggle with high-fidelity asset creation, controllability, and manipulation. In this work, we introduce CityGenAgent, a natural language-driven framework for hierarchical procedural generation of high-quality 3D cities. Our approach decomposes city generation into two interpretable components, Block Program and Building Program. To ensure structural correctness and semantic alignment, we adopt a two-stage learning strategy: (1) Supervised Fine-Tuning (SFT). We train BlockGen and BuildingGen to generate valid programs that adhere to schema constraints, including non-self-intersecting polygons and complete fields; (2) Reinforcement Learning (RL). We design Spatial Alignment Reward to enhance spatial reasoning ability and Visual Consistency Reward to bridge the gap between textual descriptions and the visual modality. Benefiting from the programs and the models' generalization, CityGenAgent supports natural language editing and manipulation. Comprehensive evaluations demonstrate superior semantic alignment, visual quality, and controllability compared to existing methods, establishing a robust foundation for scalable 3D city generation.
### [92] [SAIL: Self-Amplified Iterative Learning for Diffusion Model Alignment with Minimal Human Feedback](https://arxiv.org/abs/2602.05380) *Xiaoxuan He,Siming Fu,Wanli Li,Zhiyuan Li,Dacheng Yin,Kang Rong,Fengyun Rao,Bo Zhang* Main category: cs.CV TL;DR: 本文提出SAIL框架,通过迭代自增强学习,使扩散模型无需外部奖励模型或大规模偏好数据即可实现与人类偏好的对齐。
Details Motivation: 在缺乏奖励模型或难以获取大规模人类偏好数据的情况下,如何仅用极少的人类反馈实现扩散模型的有效对齐是一个关键挑战。 Method: 提出SAIL(Self-Amplified Iterative Learning)框架:以少量人工标注的偏好对为起点,模型闭环式地生成样本、自我标注偏好,并利用自增强数据集进行迭代优化;引入排序偏好混合策略以平衡探索与保持初始人类先验。 Result: 实验表明,SAIL在多个基准上持续优于现有方法,且仅需现有方法6%的偏好数据量。 Conclusion: 扩散模型自身具备显著的自我改进能力,合理利用可替代大规模人工标注和外部奖励模型。 Abstract: Aligning diffusion models with human preferences remains challenging, particularly when reward models are unavailable or impractical to obtain, and collecting large-scale preference datasets is prohibitively expensive. \textit{This raises a fundamental question: can we achieve effective alignment using only minimal human feedback, without auxiliary reward models, by unlocking the latent capabilities within diffusion models themselves?} In this paper, we propose \textbf{SAIL} (\textbf{S}elf-\textbf{A}mplified \textbf{I}terative \textbf{L}earning), a novel framework that enables diffusion models to act as their own teachers through iterative self-improvement. Starting from a minimal seed set of human-annotated preference pairs, SAIL operates in a closed-loop manner where the model progressively generates diverse samples, self-annotates preferences based on its evolving understanding, and refines itself using this self-augmented dataset. To ensure robust learning and prevent catastrophic forgetting, we introduce a ranked preference mixup strategy that carefully balances exploration with adherence to initial human priors. Extensive experiments demonstrate that SAIL consistently outperforms state-of-the-art methods across multiple benchmarks while using merely 6\% of the preference data required by existing approaches, revealing that diffusion models possess remarkable self-improvement capabilities that, when properly harnessed, can effectively replace both large-scale human annotation and external reward models.
### [93] [VRIQ: Benchmarking and Analyzing Visual-Reasoning IQ of VLMs](https://arxiv.org/abs/2602.05382) *Tina Khezresmaeilzadeh,Jike Zhong,Konstantinos Psounis* Main category: cs.CV TL;DR: This paper introduces VRIQ, a benchmark to evaluate visual reasoning in Vision Language Models (VLMs), revealing that their poor performance stems mostly from perception limitations—not reasoning—especially on abstract puzzles.
Details Motivation: To assess whether modern Vision Language Models (VLMs) can reliably perform nonverbal visual reasoning, and to identify the root causes of their failures. Method: Proposes VRIQ benchmark with abstract puzzle-style and natural-image reasoning tasks; conducts diagnostic probing targeting perception and reasoning separately; introduces fine-grained perception-category probes (e.g., shape, count, position, 3D/depth). Result: VLMs achieve only ~28% accuracy on abstract puzzles and ~45% on natural-image tasks; 56% of failures are due to perception alone, 43% involve both perception and reasoning, and only 1% stem from reasoning alone; tool-augmented reasoning yields only modest gains. Conclusion: Current VLMs are unreliable abstract reasoners primarily due to perception limitations—not reasoning deficits—highlighting the need for perception-focused improvements in multimodal systems. Abstract: Recent progress in Vision Language Models (VLMs) has raised the question of whether they can reliably perform nonverbal reasoning. To this end, we introduce VRIQ (Visual Reasoning IQ), a novel benchmark designed to assess and analyze the visual reasoning ability of VLMs. We evaluate models on two sets of tasks: abstract puzzle-style and natural-image reasoning tasks. We find that on abstract puzzles, performance remains near random with an average accuracy of around 28%, while natural tasks yield better but still weak results with 45% accuracy. We also find that tool-augmented reasoning demonstrates only modest improvements. To uncover the source of this weakness, we introduce diagnostic probes targeting perception and reasoning. Our analysis demonstrates that around 56% of failures arise from perception alone, 43% from both perception and reasoning, and only a mere 1% from reasoning alone. This motivates us to design fine-grained diagnostic probe questions targeting specific perception categories (e.g., shape, count, position, 3D/depth), revealing that certain categories cause more failures than others. Our benchmark and analysis establish that current VLMs, even with visual reasoning tools, remain unreliable abstract reasoners, mostly due to perception limitations, and offer a principled basis for improving visual reasoning in multimodal systems.
### [94] [Dolphin-v2: Universal Document Parsing via Scalable Anchor Prompting](https://arxiv.org/abs/2602.05384) *Hao Feng,Wei Shi,Ke Zhang,Xiang Fei,Lei Liao,Dingkang Yang,Yongkun Du,Xuecheng Wu,Jingqun Tang,Yang Liu,Hong Chen,Can Huang* Main category: cs.CV TL;DR: Dolphin-v2 是一种两阶段文档图像解析模型,通过联合文档类型分类与布局分析、针对不同文档类型采用差异化解析策略(整体页面解析 vs. 元素级并行解析),显著提升了对拍摄文档的鲁棒性、细粒度元素识别能力(21类)及代码块识别能力,并在多个基准上取得显著性能提升。
Details Motivation: 现有文档解析方法碎片化严重,依赖轴对齐边界框,难以处理畸变或拍摄文档;用户需复杂选型,系统可扩展性差。 Method: 提出 Dolphin-v2:第一阶段联合进行文档类型分类(数字原生/拍摄)与布局分析(含阅读顺序预测);第二阶段采用混合解析策略——对拍摄文档做整页全局解析以应对几何畸变,对数字原生文档基于布局锚点进行元素级并行解析。新增代码块识别与缩进保持、21类细粒度元素检测及语义属性抽取。 Result: 在 OmniDocBench 上整体提升 +14.78 分,在拍摄文档上错误率降低 91%,并在 DocPTBench、RealDoc-160 等基准上验证有效性;支持高效并行推理。 Conclusion: Dolphin-v2 通过结构化两阶段设计与混合解析范式,统一并显著增强了文档解析的鲁棒性、细粒度和实用性,尤其解决了拍摄文档解析难题。 Abstract: Document parsing has garnered widespread attention as vision-language models (VLMs) advance OCR capabilities. However, the field remains fragmented across dozens of specialized models with varying strengths, forcing users to navigate complex model selection and limiting system scalability. Moreover, existing two-stage approaches depend on axis-aligned bounding boxes for layout detection, failing to handle distorted or photographed documents effectively. To this end, we present Dolphin-v2, a two-stage document image parsing model that substantially improves upon the original Dolphin. In the first stage, Dolphin-v2 jointly performs document type classification (digital-born versus photographed) alongside layout analysis. For digital-born documents, it conducts finer-grained element detection with reading order prediction. In the second stage, we employ a hybrid parsing strategy: photographed documents are parsed holistically as complete pages to handle geometric distortions, while digital-born documents undergo element-wise parallel parsing guided by the detected layout anchors, enabling efficient content extraction. Compared with the original Dolphin, Dolphin-v2 introduces several crucial enhancements: (1) robust parsing of photographed documents via holistic page-level understanding, (2) finer-grained element detection (21 categories) with semantic attribute extraction such as author information and document metadata, and (3) code block recognition with indentation preservation, which existing systems typically lack. Comprehensive evaluations are conducted on DocPTBench, OmniDocBench, and our self-constructed RealDoc-160 benchmark. The results demonstrate substantial improvements: +14.78 points overall on the challenging OmniDocBench and 91% error reduction on photographed documents, while maintaining efficient inference through parallel processing.
### [95] [Parallel Swin Transformer-Enhanced 3D MRI-to-CT Synthesis for MRI-Only Radiotherapy Planning](https://arxiv.org/abs/2602.05387) *Zolnamar Dorjsembe,Hung-Yi Chen,Furen Xiao,Hsing-Kuo Pao* Main category: cs.CV TL;DR: 本文提出了一种名为Parallel Swin Transformer-Enhanced Med2Transformer的3D网络架构,用于从MRI生成高质量合成CT图像,以支持MRI-only放疗计划;该方法结合卷积编码与双分支Swin Transformer,提升局部细节和长程上下文建模能力,在多个数据集上实现了更高的图像相似性、几何精度及临床可接受的剂量计算准确性(平均靶区剂量误差1.69%)。
Details Motivation: MRI缺乏电子密度信息,无法直接用于放疗剂量计算,当前需联合MRI与CT,带来配准不确定性和流程复杂性;合成CT生成是实现MRI-only放疗规划的关键,但面临MRI-CT非线性映射和解剖变异等挑战。 Method: 提出Parallel Swin Transformer-Enhanced Med2Transformer:3D架构,融合卷积编码器与双Swin Transformer分支,采用多尺度移位窗口注意力与层次化特征聚合,协同建模局部解剖细节与长程上下文依赖。 Result: 在公开与临床数据集上,相比基线方法,图像相似性(如PSNR、SSIM)和几何精度(如Dice、Hausdorff距离)显著提升;剂量学评估显示平均靶区剂量误差为1.69%,达到临床可接受水平。 Conclusion: 该方法有效提升了MRI-to-CT合成质量与剂量计算可靠性,为临床MRI-only放疗规划提供了可行且高性能的技术路径。 Abstract: MRI provides superior soft tissue contrast without ionizing radiation; however, the absence of electron density information limits its direct use for dose calculation. As a result, current radiotherapy workflows rely on combined MRI and CT acquisitions, increasing registration uncertainty and procedural complexity. Synthetic CT generation enables MRI only planning but remains challenging due to nonlinear MRI-CT relationships and anatomical variability. We propose Parallel Swin Transformer-Enhanced Med2Transformer, a 3D architecture that integrates convolutional encoding with dual Swin Transformer branches to model both local anatomical detail and long-range contextual dependencies. Multi-scale shifted window attention with hierarchical feature aggregation improves anatomical fidelity. Experiments on public and clinical datasets demonstrate higher image similarity and improved geometric accuracy compared with baseline methods. Dosimetric evaluation shows clinically acceptable performance, with a mean target dose error of 1.69%. Code is available at: https://github.com/mobaidoctor/med2transformer.
### [96] [Dataset Distillation via Relative Distribution Matching and Cognitive Heritage](https://arxiv.org/abs/2602.05391) *Qianxin Xia,Jiawei Du,Yuhan Zhang,Jielei Wang,Guoming Lu* Main category: cs.CV TL;DR: 本文提出了一种名为统计流匹配(statistical flow matching)的新方法,用于高效的数据集蒸馏,通过一次性加载原始数据的统计信息并单次增强合成图像,在显著降低GPU内存和运行时间的同时,达到或超越现有最优方法的性能;同时引入分类器继承策略,复用原数据集训练的分类器,仅需轻量级线性投影器即可获得显著性能提升。
Details Motivation: 现有基于线性梯度匹配的数据集蒸馏方法在使用预训练自监督模型作为骨干网络时,存在计算和内存开销大的问题,因其需每步加载大量真实图像并多次应用可微增强。 Method: 提出统计流匹配框架,通过在原始数据的类别中心间对齐恒定统计流来优化合成图像;仅需一次性加载原始数据的统计信息,并对合成图像进行单次增强;同时设计分类器继承策略,复用原数据集训练的分类器,仅添加轻量线性投影器。 Result: 相比SOTA方法,GPU内存降低10倍、运行时间缩短4倍,性能相当或更优;分类器继承策略带来显著性能增益,且存储开销极小。 Conclusion: 统计流匹配是一种稳定、高效的监督学习框架,显著提升了数据集蒸馏的效率与实用性;分类器继承策略进一步增强了其在下游任务中的泛化能力与部署友好性。 Abstract: Dataset distillation seeks to synthesize a highly compact dataset that achieves performance comparable to the original dataset on downstream tasks. For the classification task that use pre-trained self-supervised models as backbones, previous linear gradient matching optimizes synthetic images by encouraging them to mimic the gradient updates induced by real images on the linear classifier. However, this batch-level formulation requires loading thousands of real images and applying multiple rounds of differentiable augmentations to synthetic images at each distillation step, leading to substantial computational and memory overhead. In this paper, we introduce statistical flow matching , a stable and efficient supervised learning framework that optimizes synthetic images by aligning constant statistical flows from target class centers to non-target class centers in the original data. Our approach loads raw statistics only once and performs a single augmentation pass on the synthetic data, achieving performance comparable to or better than the state-of-the-art methods with 10x lower GPU memory usage and 4x shorter runtime. Furthermore, we propose a classifier inheritance strategy that reuses the classifier trained on the original dataset for inference, requiring only an extremely lightweight linear projector and marginal storage while achieving substantial performance gains.
### [97] [Explainable Pathomics Feature Visualization via Correlation-aware Conditional Feature Editing](https://arxiv.org/abs/2602.05397) *Yuechen Yang,Junlin Guo,Ruining Deng,Junchao Zhu,Zhengyi Lu,Chongyu Qu,Yanfan Zhu,Xingyi Guo,Yu Wang,Shilin Zhao,Haichun Yang,Yuankai Huo* Main category: cs.CV TL;DR: 本文提出了一种面向数字病理学的流形感知扩散模型(MAD),通过在VAE学习的解耦隐空间中正则化特征轨迹,实现对细胞核特征的可控且生物合理的编辑,克服了传统条件扩散模型因忽略特征相关性而导致的失真问题。
Details Motivation: 现有病理组学(Pathomics)特征(如‘二阶矩’)解释性差、临床泛化难;而条件扩散模型在特征编辑中假设特征独立,违背病理特征内在相关性,易导致脱离生物学流形、生成不真实图像。 Method: 提出Manifold-Aware Diffusion(MAD)框架:首先用变分自编码器(VAE)学习病理特征的解耦隐空间并建模其流形结构;然后在该空间中对目标特征进行编辑时,自动协同调整相关属性以保持流形一致性;最后由条件扩散模型基于优化后的特征生成高保真图像。 Result: 实验表明MAD能在编辑病理组学特征时有效沿真实细胞分布流形导航,相比基线方法在条件特征编辑任务中性能更优,同时更好保持细胞结构一致性与图像真实性。 Conclusion: MAD为可解释、可控且生物可信的数字病理图像编辑提供了新范式,提升了病理组学特征的临床可解释性与实用性。 Abstract: Pathomics is a recent approach that offers rich quantitative features beyond what black-box deep learning can provide, supporting more reproducible and explainable biomarkers in digital pathology. However, many derived features (e.g., "second-order moment") remain difficult to interpret, especially across different clinical contexts, which limits their practical adoption. Conditional diffusion models show promise for explainability through feature editing, but they typically assume feature independence**--**an assumption violated by intrinsically correlated pathomics features. Consequently, editing one feature while fixing others can push the model off the biological manifold and produce unrealistic artifacts. To address this, we propose a Manifold-Aware Diffusion (MAD) framework for controllable and biologically plausible cell nuclei editing. Unlike existing approaches, our method regularizes feature trajectories within a disentangled latent space learned by a variational auto-encoder (VAE). This ensures that manipulating a target feature automatically adjusts correlated attributes to remain within the learned distribution of real cells. These optimized features then guide a conditional diffusion model to synthesize high-fidelity images. Experiments demonstrate that our approach is able to navigate the manifold of pathomics features when editing those features. The proposed method outperforms baseline methods in conditional feature editing while preserving structural coherence.
### [98] [TSBOW: Traffic Surveillance Benchmark for Occluded Vehicles Under Various Weather Conditions](https://arxiv.org/abs/2602.05414) *Ngoc Doan-Minh Huynh,Duong Nguyen-Ngoc Tran,Long Hoang Pham,Tai Huu-Phuong Tran,Hyung-Joon Jeon,Huy-Hung Nguyen,Duong Khac Vu,Hyung-Min Jeon,Son Hong Phan,Quoc Pham-Nam Ho,Chi Dai Tran,Trinh Le Ba Khanh,Jae Wook Jeon* Main category: cs.CV TL;DR: 本文提出TSBOW数据集,用于在各种极端天气条件下提升遮挡车辆的检测性能,包含32小时真实交通数据和48,000+人工标注帧,支持智能交通系统研究。
Details Motivation: 全球变暖加剧极端天气,导致CCTV视频质量下降、交通流紊乱、事故率上升;现有数据集无法覆盖极端天气场景,缺乏对遮挡车辆在复杂天气下检测的支持。 Method: 构建大规模真实世界交通监控数据集TSBOW,涵盖多种极端天气条件,含32小时视频、48,000+人工标注与320万半标注帧,八类交通参与者边界框标注,并建立面向遮挡与恶劣天气的目标检测基准。 Result: 发布了首个面向多天气、高密度城市交通中遮挡车辆检测的大规模基准数据集TSBOW,揭示了遮挡与天气退化对检测性能的关键挑战,并验证其对智能交通系统研究的价值。 Conclusion: TSBOW填补了极端天气下遮挡车辆检测数据集的空白,为鲁棒交通感知算法提供关键支撑,推动基于CCTV的智能交通监控技术发展。 Abstract: Global warming has intensified the frequency and severity of extreme weather events, which degrade CCTV signal and video quality while disrupting traffic flow, thereby increasing traffic accident rates. Existing datasets, often limited to light haze, rain, and snow, fail to capture extreme weather conditions. To address this gap, this study introduces the Traffic Surveillance Benchmark for Occluded vehicles under various Weather conditions (TSBOW), a comprehensive dataset designed to enhance occluded vehicle detection across diverse annual weather scenarios. Comprising over 32 hours of real-world traffic data from densely populated urban areas, TSBOW includes more than 48,000 manually annotated and 3.2 million semi-labeled frames; bounding boxes spanning eight traffic participant classes from large vehicles to micromobility devices and pedestrians. We establish an object detection benchmark for TSBOW, highlighting challenges posed by occlusions and adverse weather. With its varied road types, scales, and viewpoints, TSBOW serves as a critical resource for advancing Intelligent Transportation Systems. Our findings underscore the potential of CCTV-based traffic monitoring, pave the way for new research and applications. The TSBOW dataset is publicly available at: https://github.com/SKKUAutoLab/TSBOW.
### [99] [VMF-GOS: Geometry-guided virtual Outlier Synthesis for Long-Tailed OOD Detection](https://arxiv.org/abs/2602.05415) *Ningkang Peng,Qianfeng Yu,Yuhao Zhang,Yafei Liu,Xiaoqian Peng,Peirong Ma,Yi Chen,Peiheng Li,Yanhui Gu* Main category: cs.CV TL;DR: 本文提出了一种无需外部数据的几何引导虚拟异常样本合成(GOS)框架,结合双粒度语义损失(DGS),在长尾分布下实现高性能OOD检测。
Details Motivation: 现有OOD检测方法依赖大规模外部异常数据(如80M Tiny Images),但在实际部署中受限于数据获取成本和隐私问题;同时长尾分布下尾部类别样本稀少,导致特征空间决策边界模糊。 Method: 提出数据无关的GOS策略:利用vMF分布在超球面上建模统计特性,在低似然环形区域进行方向性采样生成虚拟异常样本;并设计基于对比学习的双粒度语义损失(DGS),增强ID样本与合成边界异常样本的区分度。 Result: 在CIFAR-LT等基准上,本方法性能超越使用真实外部图像的SOTA方法。 Conclusion: 所提数据免费框架有效解耦了OOD检测对外部数据的依赖,在长尾场景下实现了更鲁棒、更实用的异常检测。 Abstract: Out-of-Distribution (OOD) detection under long-tailed distributions is a highly challenging task because the scarcity of samples in tail classes leads to blurred decision boundaries in the feature space. Current state-of-the-art (sota) methods typically employ Outlier Exposure (OE) strategies, relying on large-scale real external datasets (such as 80 Million Tiny Images) to regularize the feature space. However, this dependence on external data often becomes infeasible in practical deployment due to high data acquisition costs and privacy sensitivity. To this end, we propose a novel data-free framework aimed at completely eliminating reliance on external datasets while maintaining superior detection performance. We introduce a Geometry-guided virtual Outlier Synthesis (GOS) strategy that models statistical properties using the von Mises-Fisher (vMF) distribution on a hypersphere. Specifically, we locate a low-likelihood annulus in the feature space and perform directional sampling of virtual outliers in this region. Simultaneously, we introduce a new Dual-Granularity Semantic Loss (DGS) that utilizes contrastive learning to maximize the distinction between in-distribution (ID) features and these synthesized boundary outliers. Extensive experiments on benchmarks such as CIFAR-LT demonstrate that our method outperforms sota approaches that utilize external real images.
### [100] [Disco: Densely-overlapping Cell Instance Segmentation via Adjacency-aware Collaborative Coloring](https://arxiv.org/abs/2602.05420) *Rui Sun,Yiwen Yang,Kaiyu Guo,Chen Jiang,Dongli Xu,Zhaonan Liu,Tan Pan,Limei Han,Xue Jiang,Wu Wei,Yuan Cheng* Main category: cs.CV TL;DR: 本文提出Disco框架,通过图着色思想解决密集重叠细胞实例分割问题,并发布GBC-FS 2025数据集;发现真实细胞图多含奇圈(尤其三角形),故需超越二着色;采用显式标记与隐式消歧双策略协同优化分割性能。
Details Motivation: 现有基于轮廓检测和距离映射的方法难以处理复杂密集的细胞区域;图着色方法虽具潜力,但在真实密集重叠与复杂拓扑场景下的有效性尚未验证。 Method: 提出Disco框架,包含‘显式标记’(递归分解细胞图、识别冲突集并转化为分类任务)和‘隐式消歧’(在冲突区强制不同实例特征差异以学习可分表征)两个核心机制;并构建大规模复杂核排列数据集GBC-FS 2025,首次系统分析四大数据集细胞邻接图的染色性质。 Result: 发现大多数真实细胞邻接图是非二分图,普遍存在奇圈(尤以三角形为主),表明传统2-着色不足;Disco在密集复杂场景下显著提升细胞实例分割精度,验证了高阶图着色建模的有效性与可行性。 Conclusion: 图着色范式在数字病理细胞分割中具有重要价值,但需适配真实细胞图的高阶拓扑特性;Disco通过协同式邻接感知着色策略,为复杂组织分割提供了新思路与实用工具。 Abstract: Accurate cell instance segmentation is foundational for digital pathology analysis. Existing methods based on contour detection and distance mapping still face significant challenges in processing complex and dense cellular regions. Graph coloring-based methods provide a new paradigm for this task, yet the effectiveness of this paradigm in real-world scenarios with dense overlaps and complex topologies has not been verified. Addressing this issue, we release a large-scale dataset GBC-FS 2025, which contains highly complex and dense sub-cellular nuclear arrangements. We conduct the first systematic analysis of the chromatic properties of cell adjacency graphs across four diverse datasets and reveal an important discovery: most real-world cell graphs are non-bipartite, with a high prevalence of odd-length cycles (predominantly triangles). This makes simple 2-coloring theory insufficient for handling complex tissues, while higher-chromaticity models would cause representational redundancy and optimization difficulties. Building on this observation of complex real-world contexts, we propose Disco (Densely-overlapping Cell Instance Segmentation via Adjacency-aware COllaborative Coloring), an adjacency-aware framework based on the "divide and conquer" principle. It uniquely combines a data-driven topological labeling strategy with a constrained deep learning system to resolve complex adjacency conflicts. First, "Explicit Marking" strategy transforms the topological challenge into a learnable classification task by recursively decomposing the cell graph and isolating a "conflict set." Second, "Implicit Disambiguation" mechanism resolves ambiguities in conflict regions by enforcing feature dissimilarity between different instances, enabling the model to learn separable feature representations.
### [101] [NeVStereo: A NeRF-Driven NVS-Stereo Architecture for High-Fidelity 3D Tasks](https://arxiv.org/abs/2602.05423) *Pengcheng Chen,Yue Hu,Wenhao Li,Nicole M Gunderson,Andrew Feng,Zhenglong Sun,Peter Beerel,Eric J Seibel* Main category: cs.CV TL;DR: NeVStereo 是一种 NeRF 驱动的多视图立体架构,统一解决相机位姿估计、深度估计、新视角合成与表面重建问题,显著提升各项指标。
Details Motivation: 现有方法难以在单框架中同时实现准确位姿、可靠深度、高质量渲染和精确三维表面重建;端到端匹配方法不输出新视角合成,神经渲染方法对位姿误差敏感。 Method: 提出 NeVStereo:融合 NeRF 新视角合成(适配立体匹配)、置信度引导的多视图深度估计、NeRF 耦合的光束法平差(位姿优化)及深度与辐射场联合迭代优化,缓解表面堆叠、伪影与位姿-深度耦合问题。 Result: 在室内外、桌面、航拍等多类基准上实现零样本强性能:深度误差降低36%,位姿精度提升10.4%,新视角合成保真度提高4.5%,网格质量达 SOTA(F1 91.93%,Chamfer 4.35 mm)。 Conclusion: NeVStereo 成功统一多任务几何与渲染建模,为无结构采集图像提供高一致性、高保真三维重建新范式。 Abstract: In modern dense 3D reconstruction, feed-forward systems (e.g., VGGT, pi3) focus on end-to-end matching and geometry prediction but do not explicitly output the novel view synthesis (NVS). Neural rendering-based approaches offer high-fidelity NVS and detailed geometry from posed images, yet they typically assume fixed camera poses and can be sensitive to pose errors. As a result, it remains non-trivial to obtain a single framework that can offer accurate poses, reliable depth, high-quality rendering, and accurate 3D surfaces from casually captured views. We present NeVStereo, a NeRF-driven NVS-stereo architecture that aims to jointly deliver camera poses, multi-view depth, novel view synthesis, and surface reconstruction from multi-view RGB-only inputs. NeVStereo combines NeRF-based NVS for stereo-friendly renderings, confidence-guided multi-view depth estimation, NeRF-coupled bundle adjustment for pose refinement, and an iterative refinement stage that updates both depth and the radiance field to improve geometric consistency. This design mitigated the common NeRF-based issues such as surface stacking, artifacts, and pose-depth coupling. Across indoor, outdoor, tabletop, and aerial benchmarks, our experiments indicate that NeVStereo achieves consistently strong zero-shot performance, with up to 36% lower depth error, 10.4% improved pose accuracy, 4.5% higher NVS fidelity, and state-of-the-art mesh quality (F1 91.93%, Chamfer 4.35 mm) compared to existing prestigious methods.
### [102] [Multi-AD: Cross-Domain Unsupervised Anomaly Detection for Medical and Industrial Applications](https://arxiv.org/abs/2602.05426) *Wahyu Rahmaniar,Kenji Suzuki* Main category: cs.CV TL;DR: 本文提出了一种名为Multi-AD的无监督异常检测模型,结合SE模块、知识蒸馏与判别网络,在医学和工业图像上实现了跨域鲁棒异常检测,并在多个数据集上达到SOTA性能。
Details Motivation: 传统深度学习模型在跨领域应用(如医学早期疾病诊断和工业缺陷检测)中常面临标注数据稀缺的问题,尤其在异常检测任务中尤为突出。 Method: 提出Multi-AD模型:基于CNN架构,引入挤压-激励(SE)模块增强通道注意力;采用知识蒸馏(KD)将教师模型的判别性特征传递给学生模型;加入判别网络强化正常/异常区分能力;推理阶段融合多尺度特征以检测不同尺寸的异常;采用教师-学生(T-S)结构保障高维特征一致性并适配异常检测任务。 Result: 在多个医学(脑MRI、肝CT、视网膜OCT)和工业(MVTec AD)数据集上验证,图像级AUROC达81.4%(医学)和99.6%(工业),像素级AUROC达97.0%(医学)和98.4%(工业),全面超越现有SOTA方法。 Conclusion: Multi-AD是一种通用、鲁棒且高性能的无监督跨域异常检测框架,具备良好的实际部署潜力。 Abstract: Traditional deep learning models often lack annotated data, especially in cross-domain applications such as anomaly detection, which is critical for early disease diagnosis in medicine and defect detection in industry. To address this challenge, we propose Multi-AD, a convolutional neural network (CNN) model for robust unsupervised anomaly detection across medical and industrial images. Our approach employs the squeeze-and-excitation (SE) block to enhance feature extraction via channel-wise attention, enabling the model to focus on the most relevant features and detect subtle anomalies. Knowledge distillation (KD) transfers informative features from the teacher to the student model, enabling effective learning of the differences between normal and anomalous data. Then, the discriminator network further enhances the model's capacity to distinguish between normal and anomalous data. At the inference stage, by integrating multi-scale features, the student model can detect anomalies of varying sizes. The teacher-student (T-S) architecture ensures consistent representation of high-dimensional features while adapting them to enhance anomaly detection. Multi-AD was evaluated on several medical datasets, including brain MRI, liver CT, and retina OCT, as well as industrial datasets, such as MVTec AD, demonstrating strong generalization across multiple domains. Experimental results demonstrated that our approach consistently outperformed state-of-the-art models, achieving the best average AUROC for both image-level (81.4% for medical and 99.6% for industrial) and pixel-level (97.0% for medical and 98.4% for industrial) tasks, making it effective for real-world applications.
### [103] [LD-SLRO: Latent Diffusion Structured Light for 3-D Reconstruction of Highly Reflective Objects](https://arxiv.org/abs/2602.05434) *Sanghoon Jeon,Gihyun Jung,Suhyeon Ka,Jae-Sang Hyun* Main category: cs.CV TL;DR: 本文提出了一种基于潜在扩散模型的结构光方法(LD-SLRO),用于改善高反射、低粗糙度表面的条纹图像质量,从而提升三维重建精度。
Details Motivation: 高反射、低粗糙度表面在条纹投影三维重建中易受镜面反射和间接光照干扰,导致条纹失真或丢失。 Method: 提出LD-SLRO方法:先对相移条纹图像进行潜在编码以提取表面反射特性;再将该潜在特征作为条件输入至潜在扩散模型,概率性抑制反射伪影并恢复缺失条纹;引入镜面反射编码器、时变通道仿射层与注意力模块增强恢复效果;支持灵活配置输入/输出条纹集。 Result: 实验表明,该方法显著提升条纹图像质量和三维重建精度,平均均方根误差从1.8176 mm降至0.9619 mm。 Conclusion: LD-SLRO有效缓解了高反射表面三维测量中的条纹退化问题,在 fringe 质量与重建精度上均优于现有最先进方法。 Abstract: Fringe projection profilometry-based 3-D reconstruction of objects with high reflectivity and low surface roughness remains a significant challenge. When measuring such glossy surfaces, specular reflection and indirect illumination often lead to severe distortion or loss of the projected fringe patterns. To address these issues, we propose a latent diffusion-based structured light for reflective objects (LD-SLRO). Phase-shifted fringe images captured from highly reflective surfaces are first encoded to extract latent representations that capture surface reflectance characteristics. These latent features are then used as conditional inputs to a latent diffusion model, which probabilistically suppresses reflection-induced artifacts and recover lost fringe information, yielding high-quality fringe images. The proposed components, including the specular reflection encoder, time-variant channel affine layer, and attention modules, further improve fringe restoration quality. In addition, LD-SLRO provides high flexibility in configuring the input and output fringe sets. Experimental results demonstrate that the proposed method improves both fringe quality and 3-D reconstruction accuracy over state-of-the-art methods, reducing the average root-mean-squared error from 1.8176 mm to 0.9619 mm.
### [104] [Stable Velocity: A Variance Perspective on Flow Matching](https://arxiv.org/abs/2602.05435) *Donglin Yang,Yongxing Zhang,Xin Yu,Liang Hou,Xin Tao,Pengfei Wan,Xiaojuan Qi,Renjie Liao* Main category: cs.CV TL;DR: 本文提出Stable Velocity框架,通过分析流匹配中单样本条件速度导致的高方差问题,设计了降低方差的训练目标(StableVM)与自适应辅助监督(VA-REPA),并在低方差区域实现无需微调的快速采样(StableVS),显著提升训练效率与采样速度。
Details Motivation: 流匹配因依赖单一样本条件速度而存在高方差训练目标,导致优化不稳定、收敛慢,尤其在先验分布附近;需明确方差特性并据此改进。 Method: 1) 理论刻画条件速度的方差分布,识别高低方差区域;2) 提出无偏方差缩减目标StableVM;3) 设计方差感知的表示对齐VA-REPA,在低方差区增强辅助监督;4) 利用低方差区动力学可解析简化特性,构建无需微调的快速采样方法StableVS。 Result: 在ImageNet 256×256及多个大型文生图/文生视频模型(SD3.5、Flux、Qwen-Image、Wan2.2)上验证:训练效率提升,采样速度提高超2倍,且不损失样本质量。 Conclusion: Stable Velocity通过方差建模统一优化训练与采样,为流匹配提供了更稳定、高效的新范式。 Abstract: While flow matching is elegant, its reliance on single-sample conditional velocities leads to high-variance training targets that destabilize optimization and slow convergence. By explicitly characterizing this variance, we identify 1) a high-variance regime near the prior, where optimization is challenging, and 2) a low-variance regime near the data distribution, where conditional and marginal velocities nearly coincide. Leveraging this insight, we propose Stable Velocity, a unified framework that improves both training and sampling. For training, we introduce Stable Velocity Matching (StableVM), an unbiased variance-reduction objective, along with Variance-Aware Representation Alignment (VA-REPA), which adaptively strengthen auxiliary supervision in the low-variance regime. For inference, we show that dynamics in the low-variance regime admit closed-form simplifications, enabling Stable Velocity Sampling (StableVS), a finetuning-free acceleration. Extensive experiments on ImageNet $256\times256$ and large pretrained text-to-image and text-to-video models, including SD3.5, Flux, Qwen-Image, and Wan2.2, demonstrate consistent improvements in training efficiency and more than $2\times$ faster sampling within the low-variance regime without degrading sample quality. Our code is available at https://github.com/linYDTHU/StableVelocity.
### [105] [Synthetic Defect Geometries of Cast Metal Objects Modeled via 2d Voronoi Tessellations](https://arxiv.org/abs/2602.05440) *Natascha Jeziorski,Petra Gospodnetić,Claudia Redenbach* Main category: cs.CV TL;DR: 本文提出了一种基于参数化3D缺陷建模与物理仿真相结合的合成数据生成方法,用于提升无损检测中自动化缺陷识别的训练数据质量与多样性。
Details Motivation: 工业中缺陷检测对质量控制至关重要,但真实标注数据稀缺且难以覆盖罕见缺陷;现有机器学习方法受限于训练数据的数量和质量。 Method: 构建面向金属铸造等工艺的参数化3D缺陷模型(如气孔、裂纹等),将其嵌入工件数字孪生体的网格中;结合物理基础的蒙特卡洛仿真(如视觉表面检测)生成高保真合成图像,并同步生成像素级精确标注。 Result: 可生成任意规模、可控缺陷类型与分布的合成数据集,尤其支持罕见缺陷的充足采样,并实现像素级自动标注。 Conclusion: 该方法为各类无损检测任务提供了可扩展、高保真、强可控的合成数据生成框架,显著缓解真实数据瓶颈,提升自动化缺陷检测模型的泛化性与鲁棒性。 Abstract: In industry, defect detection is crucial for quality control. Non-destructive testing (NDT) methods are preferred as they do not influence the functionality of the object while inspecting. Automated data evaluation for automated defect detection is a growing field of research. In particular, machine learning approaches show promising results. To provide training data in sufficient amount and quality, synthetic data can be used. Rule-based approaches enable synthetic data generation in a controllable environment. Therefore, a digital twin of the inspected object including synthetic defects is needed. We present parametric methods to model 3d mesh objects of various defect types that can then be added to the object geometry to obtain synthetic defective objects. The models are motivated by common defects in metal casting but can be transferred to other machining procedures that produce similar defect shapes. Synthetic data resembling the real inspection data can then be created by using a physically based Monte Carlo simulation of the respective testing method. Using our defect models, a variable and arbitrarily large synthetic data set can be generated with the possibility to include rarely occurring defects in sufficient quantity. Pixel-perfect annotation can be created in parallel. As an example, we will use visual surface inspection, but the procedure can be applied in combination with simulations for any other NDT method.
### [106] [DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching](https://arxiv.org/abs/2602.05449) *Chang Zou,Changlin Li,Yang Li,Patrol Li,Jianbing Wu,Xiao He,Songtao Liu,Zhao Zhong,Kailin Huang,Linfeng Zhang* Main category: cs.CV TL;DR: 本文提出了一种面向视频扩散模型的可学习特征缓存机制与保守受限MeanFlow蒸馏方法,实现了11.8倍加速且保持生成质量。
Details Motivation: 现有视频扩散模型加速方法(如无训练特征缓存和步蒸馏)在高压缩比下存在语义/细节丢失、质量严重下降等问题,尤其在步数稀疏时联合使用效果更差。 Method: 提出可学习的轻量神经预测器替代传统启发式特征缓存;设计保守的Restricted MeanFlow蒸馏策略以提升大规模视频模型在高度压缩下的稳定性。 Result: 在保持生成质量前提下实现11.8倍推理加速;大量实验验证了方法有效性。 Conclusion: 可学习特征缓存与保守蒸馏策略协同显著提升了视频扩散模型的加速上限与鲁棒性,为高效视频生成提供了新范式。 Abstract: While diffusion models have achieved great success in the field of video generation, this progress is accompanied by a rapidly escalating computational burden. Among the existing acceleration methods, Feature Caching is popular due to its training-free property and considerable speedup performance, but it inevitably faces semantic and detail drop with further compression. Another widely adopted method, training-aware step-distillation, though successful in image generation, also faces drastic degradation in video generation with a few steps. Furthermore, the quality loss becomes more severe when simply applying training-free feature caching to the step-distilled models, due to the sparser sampling steps. This paper novelly introduces a distillation-compatible learnable feature caching mechanism for the first time. We employ a lightweight learnable neural predictor instead of traditional training-free heuristics for diffusion models, enabling a more accurate capture of the high-dimensional feature evolution process. Furthermore, we explore the challenges of highly compressed distillation on large-scale video models and propose a conservative Restricted MeanFlow approach to achieve more stable and lossless distillation. By undertaking these initiatives, we further push the acceleration boundaries to $11.8\times$ while preserving generation quality. Extensive experiments demonstrate the effectiveness of our method. The code is in the supplementary materials and will be publicly available.
### [107] [Attention Retention for Continual Learning with Vision Transformers](https://arxiv.org/abs/2602.05454) *Yue Lu,Xiangyu Zhou,Shizhou Zhang,Yinghui Xing,Guoqiang Liang,Wencong Zhang* Main category: cs.CV TL;DR: 本文提出了一种基于注意力保留的持续学习框架,通过在反向传播中约束视觉Transformer中的注意力漂移来缓解灾难性遗忘。
Details Motivation: 持续学习中灾难性遗忘问题严重,而作者发现视觉Transformer中的注意力漂移是其主要原因。 Method: 提出一种注意力保留框架:1)利用逐层展开机制提取前序任务的注意力图并生成实例自适应二值掩码;2)在学习新任务时用这些掩码置零与先前注意力区域相关的梯度,并按比例缩放参数更新以兼容现代优化器。 Result: 实验和可视化表明该方法能有效缓解灾难性遗忘、保持视觉概念,在多种持续学习场景下达到SOTA性能并具有强泛化能力。 Conclusion: 注意力漂移是导致灾难性遗忘的关键因素,通过梯度层面的注意力约束可显著提升持续学习模型的稳定性与泛化性。 Abstract: Continual learning (CL) empowers AI systems to progressively acquire knowledge from non-stationary data streams. However, catastrophic forgetting remains a critical challenge. In this work, we identify attention drift in Vision Transformers as a primary source of catastrophic forgetting, where the attention to previously learned visual concepts shifts significantly after learning new tasks. Inspired by neuroscientific insights into the selective attention in the human visual system, we propose a novel attention-retaining framework to mitigate forgetting in CL. Our method constrains attention drift by explicitly modifying gradients during backpropagation through a two-step process: 1) extracting attention maps of the previous task using a layer-wise rollout mechanism and generating instance-adaptive binary masks, and 2) when learning a new task, applying these masks to zero out gradients associated with previous attention regions, thereby preventing disruption of learned visual concepts. For compatibility with modern optimizers, the gradient masking process is further enhanced by scaling parameter updates proportionally to maintain their relative magnitudes. Experiments and visualizations demonstrate the effectiveness of our method in mitigating catastrophic forgetting and preserving visual concepts. It achieves state-of-the-art performance and exhibits robust generalizability across diverse CL scenarios.
### [108] [MerNav: A Highly Generalizable Memory-Execute-Review Framework for Zero-Shot Object Goal Navigation](https://arxiv.org/abs/2602.05467) *Dekang Qi,Shuang Zeng,Xinyuan Chang,Feng Xiong,Shichao Xie,Xiaolong Wu,Mu Xu* Main category: cs.CV TL;DR: 本文提出了一种Memory-Execute-Review框架,用于提升视觉语言导航(VLN)任务中的成功率(SR)与泛化能力,兼顾监督微调(SFT)与无训练(TF)方法的优势,在多个数据集上显著提升了性能。
Details Motivation: 现有VLN方法在成功率(SR)和泛化能力之间难以兼顾:监督微调方法SR高但泛化差,而无训练方法泛化好但SR低,亟需统一二者优势的新型框架。 Method: 提出Memory-Execute-Review三模块框架:分层记忆模块提供信息支持,执行模块进行常规决策与动作,审查模块处理异常并纠正行为;在Object Goal Navigation任务上验证。 Result: 在4个数据集上,零样本(ZS)和无训练(TF)设置下平均SR分别提升5%和7%;在HM3D_v0.1和HM3D_OVON上ZS设置下SR提升8%和6%;在MP3D和HM3D_OVON上同时超越所有TF和SFT方法,SR分别领先5%和2%。 Conclusion: Memory-Execute-Review框架有效平衡并提升了VLN任务的成功率与泛化能力,实现了全面性能领先。 Abstract: Visual Language Navigation (VLN) is one of the fundamental capabilities for embodied intelligence and a critical challenge that urgently needs to be addressed. However, existing methods are still unsatisfactory in terms of both success rate (SR) and generalization: Supervised Fine-Tuning (SFT) approaches typically achieve higher SR, while Training-Free (TF) approaches often generalize better, but it is difficult to obtain both simultaneously. To this end, we propose a Memory-Execute-Review framework. It consists of three parts: a hierarchical memory module for providing information support, an execute module for routine decision-making and actions, and a review module for handling abnormal situations and correcting behavior. We validated the effectiveness of this framework on the Object Goal Navigation task. Across 4 datasets, our average SR achieved absolute improvements of 7% and 5% compared to all baseline methods under TF and Zero-Shot (ZS) settings, respectively. On the most commonly used HM3D_v0.1 and the more challenging open vocabulary dataset HM3D_OVON, the SR improved by 8% and 6%, under ZS settings. Furthermore, on the MP3D and HM3D_OVON datasets, our method not only outperformed all TF methods but also surpassed all SFT methods, achieving comprehensive leadership in both SR (5% and 2%) and generalization.
### [109] [SOMA-1M: A Large-Scale SAR-Optical Multi-resolution Alignment Dataset for Multi-Task Remote Sensing](https://arxiv.org/abs/2602.05480) *Peihao Wu,Yongxiang Yao,Yi Wan,Wenfei Zhang,Ruipeng Zhao,Jiayuan Li,Yongjun Zhang* Main category: cs.CV TL;DR: 本文提出SOMA-1M数据集,一个包含130万对精确像素级配准的多分辨率SAR-光学遥感图像数据集,覆盖全球、多尺度(0.5m–10m)和12类地物,支持图像匹配、融合、云去除与跨模态翻译等任务,并验证其显著提升多模态遥感算法性能。
Details Motivation: 现有SAR-光学遥感基准数据集存在单一分辨率、规模不足、配准精度低等问题,难以支撑多尺度基础模型训练与泛化。 Method: 构建了SOMA-1M数据集,整合Sentinel-1、PIESAT-1、Capella Space和Google Earth影像;设计粗到精图像匹配框架实现像素级高精度配准;建立涵盖四大视觉任务的综合评测基准。 Result: 基于SOMA-1M监督训练显著提升所有任务性能,尤其在多模态遥感图像匹配上达到当前最优(SOTA)水平。 Conclusion: SOMA-1M为鲁棒多模态遥感算法及遥感基础模型提供了关键基础资源,将开源发布。 Abstract: Synthetic Aperture Radar (SAR) and optical imagery provide complementary strengths that constitute the critical foundation for transcending single-modality constraints and facilitating cross-modal collaborative processing and intelligent interpretation. However, existing benchmark datasets often suffer from limitations such as single spatial resolution, insufficient data scale, and low alignment accuracy, making them inadequate for supporting the training and generalization of multi-scale foundation models. To address these challenges, we introduce SOMA-1M (SAR-Optical Multi-resolution Alignment), a pixel-level precisely aligned dataset containing over 1.3 million pairs of georeferenced images with a specification of 512 x 512 pixels. This dataset integrates imagery from Sentinel-1, PIESAT-1, Capella Space, and Google Earth, achieving global multi-scale coverage from 0.5 m to 10 m. It encompasses 12 typical land cover categories, effectively ensuring scene diversity and complexity. To address multimodal projection deformation and massive data registration, we designed a rigorous coarse-to-fine image matching framework ensuring pixel-level alignment. Based on this dataset, we established comprehensive evaluation benchmarks for four hierarchical vision tasks, including image matching, image fusion, SAR-assisted cloud removal, and cross-modal translation, involving over 30 mainstream algorithms. Experimental results demonstrate that supervised training on SOMA-1M significantly enhances performance across all tasks. Notably, multimodal remote sensing image (MRSI) matching performance achieves current state-of-the-art (SOTA) levels. SOMA-1M serves as a foundational resource for robust multimodal algorithms and remote sensing foundation models. The dataset will be released publicly at: https://github.com/PeihaoWu/SOMA-1M.
### [110] [Feature points evaluation on omnidirectional vision with a photorealistic fisheye sequence -- A report on experiments done in 2014](https://arxiv.org/abs/2602.05487) *Julien Moreau,S. Ambellouis,Yassine Ruichek* Main category: cs.CV TL;DR: 本报告是一项未发表的博士研究工作,旨在为鱼眼图像寻找最适合自校准任务的特征检测器和描述符,提出了PFSeq(Photorealistic Fisheye Sequence)数据集,并进行了系统实验,但未提出新算法,也未与专用于全向图像的算法对比。
Details Motivation: 解决鱼眼图像自校准中的‘鸡生蛋还是蛋生鸡’问题:缺乏精确投影模型影响特征提取质量,而高质量特征又是估计该模型的前提;应用场景为车载朝天鱼眼相机在城市环境下的视觉里程计与立体视觉。 Method: 对多种标准特征检测与描述算法在鱼眼图像上进行系统性实验评估,使用自建的PFSeq真实感鱼眼图像序列数据集,聚焦于朝天安装、用于城市定位的车载鱼眼相机设置。 Result: 得出了在该特定设置(朝天鱼眼、自校准需求)下表现最优的特征检测与描述组合的实验结论,但未公布具体排名或量化指标细节。 Conclusion: 标准特征方法在鱼眼图像上仍具实用性,但其性能高度依赖于图像几何畸变特性与任务目标;强调了针对鱼眼图像构建专用基准数据集(如PFSeq)和开展针对性评估的重要性。 Abstract: What is this report: This is a scientific report, contributing with a detailed bibliography, a dataset which we will call now PFSeq for ''Photorealistic Fisheye Sequence'' and make available at https://doi.org/10. 57745/DYIVVU, and comprehensive experiments. This work should be considered as a draft, and has been done during my PhD thesis ''Construction of 3D models from fisheye video data-Application to the localisation in urban area'' in 2014 [Mor16]. These results have never been published. The aim was to find the best features detector and descriptor for fisheye images, in the context of selfcalibration, with cameras mounted on the top of a car and aiming at the zenith (to proceed then fisheye visual odometry and stereovision in urban scenes). We face a chicken and egg problem, because we can not take advantage of an accurate projection model for an optimal features detection and description, and we rightly need good features to perform the calibration (i.e. to compute the accurate projection model of the camera). What is not this report: It does not contribute with new features algorithm. It does not compare standard features algorithms to algorithms designed for omnidirectional images (unfortunately). It has not been peer-reviewed. Discussions have been translated and enhanced but the experiments have not been run again and the report has not been updated accordingly to the evolution of the state-of-the-art (read this as a 2014 report).
### [111] [VGGT-Motion: Motion-Aware Calibration-Free Monocular SLAM for Long-Range Consistency](https://arxiv.org/abs/2602.05508) *Zhuang Xiong,Chen Zhang,Qingshan Xu,Wenbing Tao* Main category: cs.CV TL;DR: 本文提出VGGT-Motion,一种无需相机标定的单目SLAM系统,通过运动感知子图构建、锚点驱动的Sim(3)配准和轻量级子图级位姿图优化,在千米级轨迹上实现高效、鲁棒的全局一致性。
Details Motivation: 现有无标定单目SLAM方法在长序列中存在严重尺度漂移;运动无关的子图划分破坏上下文连贯性并导致零运动漂移,而传统几何对齐计算开销大。 Method: 1)运动感知子图构建:利用光流指导自适应划分、剔除静态冗余、封装转向以稳定局部几何;2)锚点驱动的直接Sim(3)配准:基于上下文平衡锚点实现免搜索、像素级稠密对齐与高效回环检测;3)轻量子图级位姿图优化:线性复杂度,保障全局一致性。 Result: 在零样本、长距离、无标定单目SLAM任务中显著提升轨迹精度与效率,达到当前最优性能。 Conclusion: VGGT-Motion有效缓解了长序列下的尺度漂移问题,在保持高效性的同时实现了强鲁棒性与全局一致性,为无标定单目SLAM提供了新范式。 Abstract: Despite recent progress in calibration-free monocular SLAM via 3D vision foundation models, scale drift remains severe on long sequences. Motion-agnostic partitioning breaks contextual coherence and causes zero-motion drift, while conventional geometric alignment is computationally expensive. To address these issues, we propose VGGT-Motion, a calibration-free SLAM system for efficient and robust global consistency over kilometer-scale trajectories. Specifically, we first propose a motion-aware submap construction mechanism that uses optical flow to guide adaptive partitioning, prune static redundancy, and encapsulate turns for stable local geometry. We then design an anchor-driven direct Sim(3) registration strategy. By exploiting context-balanced anchors, it achieves search-free, pixel-wise dense alignment and efficient loop closure without costly feature matching. Finally, a lightweight submap-level pose graph optimization enforces global consistency with linear complexity, enabling scalable long-range operation. Experiments show that VGGT-Motion markedly improves trajectory accuracy and efficiency, achieving state-of-the-art performance in zero-shot, long-range calibration-free monocular SLAM.
### [112] [Mapper-GIN: Lightweight Structural Graph Abstraction for Corrupted 3D Point Cloud Classification](https://arxiv.org/abs/2602.05522) *Jeongbin You,Donggun Kim,Sejun Park,Seungsang Oh* Main category: cs.CV TL;DR: 本文提出Mapper-GIN,一种基于拓扑Mapper算法的轻量级点云分类方法,通过构建区域图并用GIN进行图分类,在ModelNet40-C上实现强鲁棒性与高可解释性。
Details Motivation: 探究仅通过结构抽象(而非扩大模型或依赖数据增强)是否能提升3D点云分类的鲁棒性。 Method: 提出Mapper-GIN:利用Mapper算法(PCA lens、立方覆盖、密度聚类)将点云划分为重叠区域,构建区域图,并用Graph Isomorphism Network(GIN)进行图分类。 Result: 在ModelNet40-C基准上,Mapper-GIN以仅0.5M参数在噪声和变换类扰动下取得竞争性且稳定的准确率,鲁棒性优于多数需更大模型或额外机制的方法。 Conclusion: 区域图结构本身即是一种高效、可解释的鲁棒性来源,为3D视觉识别提供了新思路。 Abstract: Robust 3D point cloud classification is often pursued by scaling up backbones or relying on specialized data augmentation. We instead ask whether structural abstraction alone can improve robustness, and study a simple topology-inspired decomposition based on the Mapper algorithm. We propose Mapper-GIN, a lightweight pipeline that partitions a point cloud into overlapping regions using Mapper (PCA lens, cubical cover, and followed by density-based clustering), constructs a region graph from their overlaps, and performs graph classification with a Graph Isomorphism Network. On the corruption benchmark ModelNet40-C, Mapper-GIN achieves competitive and stable accuracy under Noise and Transformation corruptions with only 0.5M parameters. In contrast to prior approaches that require heavier architectures or additional mechanisms to gain robustness, Mapper-GIN attains strong corruption robustness through simple region-level graph abstraction and GIN message passing. Overall, our results suggest that region-graph structure offers an efficient and interpretable source of robustness for 3D visual recognition.
### [113] [Generalization of Self-Supervised Vision Transformers for Protein Localization Across Microscopy Domains](https://arxiv.org/abs/2602.05527) *Ben Isselmann,Dilara Göksu,Andreas Weinmann* Main category: cs.CV TL;DR: 本文研究了自监督学习(SSL)预训练的视觉Transformer(DINO)在显微镜图像跨域迁移中的有效性,发现基于人蛋白图谱(HPA)数据集预训练的模型在OpenCell蛋白定位任务上表现最佳,略优于直接在OpenCell上训练的模型,表明领域相关的SSL表征可有效泛化到相关但不同的显微镜数据集。
Details Motivation: 显微镜任务特定数据集通常规模小,难以训练鲁棒的深度学习模型;自监督预训练虽有潜力,但其在不同染色协议和通道配置的显微镜域间迁移能力尚不明确。 Method: 采用DINO框架,分别在ImageNet-1k、人蛋白图谱(HPA)和OpenCell数据集上预训练ViT骨干网络,提取图像嵌入,并在OpenCell标签上训练监督分类头以评估跨域迁移性能。 Result: 所有预训练模型均表现出良好迁移性;HPA预训练模型取得最高平均宏F1分数(0.8221 ± 0.0062),略高于直接在OpenCell上预训练的模型(0.8057 ± 0.0090)。 Conclusion: 大规模、领域相关的自监督预训练能显著提升下游显微镜任务性能,即使目标域标注数据有限,也具备强泛化能力。 Abstract: Task-specific microscopy datasets are often too small to train deep learning models that learn robust feature representations. Self-supervised learning (SSL) can mitigate this by pretraining on large unlabeled datasets, but it remains unclear how well such representations transfer across microscopy domains with different staining protocols and channel configurations. We investigate the cross-domain transferability of DINO-pretrained Vision Transformers for protein localization on the OpenCell dataset. We generate image embeddings using three DINO backbones pretrained on ImageNet-1k, the Human Protein Atlas (HPA), and OpenCell, and evaluate them by training a supervised classification head on OpenCell labels. All pretrained models transfer well, with the microscopy-specific HPA-pretrained model achieving the best performance (mean macro $F_1$-score = 0.8221 \pm 0.0062), slightly outperforming a DINO model trained directly on OpenCell (0.8057 \pm 0.0090). These results highlight the value of large-scale pretraining and indicate that domain-relevant SSL representations can generalize effectively to related but distinct microscopy datasets, enabling strong downstream performance even when task-specific labeled data are limited.
### [114] [SSG: Scaled Spatial Guidance for Multi-Scale Visual Autoregressive Generation](https://arxiv.org/abs/2602.05534) *Youngwoo Shin,Jiwan Hur,Junmo Kim* Main category: cs.CV TL;DR: 本文提出了一种无需训练、仅在推理时使用的引导方法Scaled Spatial Guidance(SSG),通过信息论视角分析并缓解视觉自回归(VAR)模型在多尺度图像生成中出现的层次漂移问题;SSG结合频率域增强技术DSE,强调各尺度应贡献前序尺度未解释的高频语义残差,从而提升生成保真度、多样性与全局一致性,且不增加延迟。
Details Motivation: VAR模型在推理时易因容量限制和误差累积导致粗到细生成层次漂移,造成训练-推理不一致;作者从信息论出发,指出各尺度应贡献前序尺度未解释的高频内容以缓解该问题。 Method: 提出无需训练的推理时引导方法SSG,其核心是强调目标尺度的语义残差(即高频信号),该残差通过新提出的频率域方法Discrete Spatial Enhancement(DSE)从较粗先验中分离并增强;SSG适用于任意基于离散视觉token的VAR模型。 Result: SSG在多个VAR模型上显著提升生成图像的保真度与多样性,同时保持低延迟;验证了粗到细生成范式中尚未被挖掘的效率潜力。 Conclusion: 通过信息驱动的推理时引导机制SSG及配套的DSE技术,可有效校正VAR模型的层次漂移,提升生成质量与鲁棒性,且具备模型无关性和即插即用特性。 Abstract: Visual autoregressive (VAR) models generate images through next-scale prediction, naturally achieving coarse-to-fine, fast, high-fidelity synthesis mirroring human perception. In practice, this hierarchy can drift at inference time, as limited capacity and accumulated error cause the model to deviate from its coarse-to-fine nature. We revisit this limitation from an information-theoretic perspective and deduce that ensuring each scale contributes high-frequency content not explained by earlier scales mitigates the train-inference discrepancy. With this insight, we propose Scaled Spatial Guidance (SSG), training-free, inference-time guidance that steers generation toward the intended hierarchy while maintaining global coherence. SSG emphasizes target high-frequency signals, defined as the semantic residual, isolated from a coarser prior. To obtain this prior, we leverage a principled frequency-domain procedure, Discrete Spatial Enhancement (DSE), which is devised to sharpen and better isolate the semantic residual through frequency-aware construction. SSG applies broadly across VAR models leveraging discrete visual tokens, regardless of tokenization design or conditioning modality. Experiments demonstrate SSG yields consistent gains in fidelity and diversity while preserving low latency, revealing untapped efficiency in coarse-to-fine image generation. Code is available at https://github.com/Youngwoo-git/SSG.
### [115] [A Comparative Study of 3D Person Detection: Sensor Modalities and Robustness in Diverse Indoor and Outdoor Environments](https://arxiv.org/abs/2602.05538) *Malaz Tamim,Andrea Matic-Flierl,Karsten Roscher* Main category: cs.CV TL;DR: 本文系统评估了仅使用相机、仅使用LiDAR以及相机-LiDAR融合三种方式在3D人物检测任务上的性能,基于JRDB数据集,在室内外多种场景下对比BEVDepth、PointPillars和DAL模型,发现融合方法整体更优但对传感器错位和特定LiDAR干扰仍敏感,而纯相机方法最易受遮挡、距离和噪声影响。
Details Motivation: 现有研究多聚焦于自动驾驶场景下的3D人物检测,而实际应用(如机器人、工业监控、安防)需在多样化室内外环境中实现高精度与鲁棒性检测,因此亟需跨场景、多模态的系统性评估。 Method: 在JRDB数据集上,对BEVDepth(纯视觉)、PointPillars(纯LiDAR)和DAL(相机-LiDAR融合)三种代表性模型进行统一评测;分析其在不同遮挡程度、距离区间下的检测性能,并进一步评估其对传感器损坏(如图像噪声、LiDAR点云缺失)和标定误差(如外参错位)的鲁棒性。 Result: 融合模型DAL在各类场景下始终优于单模态模型,尤其在挑战性条件下优势明显;但DAL对传感器错位和部分LiDAR corruption较敏感;BEVDepth性能最低,且对遮挡、远距离和噪声最为脆弱。 Conclusion: 传感器融合是提升3D人物检测精度与鲁棒性的有效路径,但当前融合方法仍存在关键脆弱性,需持续改进以应对真实场景中的传感器不确定性问题。 Abstract: Accurate 3D person detection is critical for safety in applications such as robotics, industrial monitoring, and surveillance. This work presents a systematic evaluation of 3D person detection using camera-only, LiDAR-only, and camera-LiDAR fusion. While most existing research focuses on autonomous driving, we explore detection performance and robustness in diverse indoor and outdoor scenes using the JRDB dataset. We compare three representative models - BEVDepth (camera), PointPillars (LiDAR), and DAL (camera-LiDAR fusion) - and analyze their behavior under varying occlusion and distance levels. Our results show that the fusion-based approach consistently outperforms single-modality models, particularly in challenging scenarios. We further investigate robustness against sensor corruptions and misalignments, revealing that while DAL offers improved resilience, it remains sensitive to sensor misalignment and certain LiDAR-based corruptions. In contrast, the camera-based BEVDepth model showed the lowest performance and was most affected by occlusion, distance, and noise. Our findings highlight the importance of utilizing sensor fusion for enhanced 3D person detection, while also underscoring the need for ongoing research to address the vulnerabilities inherent in these systems.
### [116] [FastVMT: Eliminating Redundancy in Video Motion Transfer](https://arxiv.org/abs/2602.05551) *Yue Ma,Zhikai Wang,Tianhao Ren,Mingzhe Zheng,Hongyu Liu,Jiayi Guo,Mark Fong,Yuxuan Xue,Zixiang Zhao,Konrad Schindler,Qifeng Chen,Linfeng Zhang* Main category: cs.CV TL;DR: 本文提出FastVMT方法,通过消除运动冗余和梯度冗余来加速视频运动迁移中的DiT计算,在不降低视觉保真度和时序一致性的前提下实现3.43倍加速。
Details Motivation: 现有基于Diffusion Transformer(DiT)的视频运动迁移方法存在结构性低效问题,未考虑帧间运动的小幅平滑性及扩散轨迹中梯度变化缓慢的特性。 Method: 1)针对运动冗余:在注意力层引入局部邻域掩码,避免远距离图像区域的无谓交互计算;2)针对梯度冗余:设计梯度重用与跳过机制,在扩散过程中复用前序步的梯度并跳过不必要的梯度计算。 Result: FastVMT在多个基准上平均实现3.43倍推理加速,同时保持生成视频的视觉质量和时间一致性。 Conclusion: 消除运动与梯度两类冗余可显著提升DiT在视频运动迁移任务中的效率,无需牺牲生成质量,为高效视频生成提供了新思路。 Abstract: Video motion transfer aims to synthesize videos by generating visual content according to a text prompt while transferring the motion pattern observed in a reference video. Recent methods predominantly use the Diffusion Transformer (DiT) architecture. To achieve satisfactory runtime, several methods attempt to accelerate the computations in the DiT, but fail to address structural sources of inefficiency. In this work, we identify and remove two types of computational redundancy in earlier work: motion redundancy arises because the generic DiT architecture does not reflect the fact that frame-to-frame motion is small and smooth; gradient redundancy occurs if one ignores that gradients change slowly along the diffusion trajectory. To mitigate motion redundancy, we mask the corresponding attention layers to a local neighborhood such that interaction weights are not computed unnecessarily distant image regions. To exploit gradient redundancy, we design an optimization scheme that reuses gradients from previous diffusion steps and skips unwarranted gradient computations. On average, FastVMT achieves a 3.43x speedup without degrading the visual fidelity or the temporal consistency of the generated videos.
### [117] [IndustryShapes: An RGB-D Benchmark dataset for 6D object pose estimation of industrial assembly components and tools](https://arxiv.org/abs/2602.05555) *Panagiotis Sapoutzoglou,Orestis Vaggelis,Athina Zacharia,Evangelos Sartinas,Maria Pateraki* Main category: cs.CV TL;DR: IndustryShapes 是一个面向工业场景的新型RGB-D基准数据集,专为实例级和新物体6D位姿估计设计,包含真实工业装配环境下的五类复杂工具与零部件,分为经典集(4.6k图像、6k标注位姿)和扩展集(支持模型无关及序列方法),是首个提供RGB-D静态初始化序列的数据集,并在多个SOTA方法上验证了其挑战性与实用性。
Details Motivation: 填补实验室研究与真实工业制造场景部署之间的鸿沟,解决现有数据集多集中于家用物品、合成环境或受控实验室条件,缺乏真实工业复杂性和多样性的不足。 Method: 构建包含五类新型工业工具与组件的RGB-D数据集,覆盖单/多物体、同物多实例等复杂场景;分为经典集(静态图像+位姿标注)和扩展集(新增多模态数据,支持模型无关与序列方法评估);提供RGB-D静态onboarding序列;并在代表性SOTA方法(含检测、分割、6D位姿估计)上进行系统评测。 Result: IndustryShapes成为首个支持实例级与新物体6D位姿估计、并提供RGB-D静态onboarding序列的工业级基准数据集;实验表明当前SOTA方法在该数据集上仍有显著提升空间。 Conclusion: IndustryShapes为工业机器人6D位姿估计提供了更真实、更具挑战性和应用相关性的基准平台,推动算法从实验室走向实际产线部署。 Abstract: We introduce IndustryShapes, a new RGB-D benchmark dataset of industrial tools and components, designed for both instance-level and novel object 6D pose estimation approaches. The dataset provides a realistic and application-relevant testbed for benchmarking these methods in the context of industrial robotics bridging the gap between lab-based research and deployment in real-world manufacturing scenarios. Unlike many previous datasets that focus on household or consumer products or use synthetic, clean tabletop datasets, or objects captured solely in controlled lab environments, IndustryShapes introduces five new object types with challenging properties, also captured in realistic industrial assembly settings. The dataset has diverse complexity, from simple to more challenging scenes, with single and multiple objects, including scenes with multiple instances of the same object and it is organized in two parts: the classic set and the extended set. The classic set includes a total of 4,6k images and 6k annotated poses. The extended set introduces additional data modalities to support the evaluation of model-free and sequence-based approaches. To the best of our knowledge, IndustryShapes is the first dataset to offer RGB-D static onboarding sequences. We further evaluate the dataset on a representative set of state-of-the art methods for instance-based and novel object 6D pose estimation, including also object detection, segmentation, showing that there is room for improvement in this domain. The dataset page can be found in https://pose-lab.github.io/IndustryShapes.
### [118] [PIRATR: Parametric Object Inference for Robotic Applications with Transformers in 3D Point Clouds](https://arxiv.org/abs/2602.05557) *Michael Schwingshackl,Fabio F. Oberweger,Mario Niedermeyer,Huemer Johannes,Markus Murschitz* Main category: cs.CV TL;DR: PIRATR是一个面向机器人应用的端到端3D目标检测框架,能直接从受遮挡点云中联合估计多类6自由度位姿和类别特异性参数属性,在纯仿真训练下即可在真实户外LiDAR数据上实现0.919 mAP。
Details Motivation: 解决机器人在动态环境中对参数化物体(如可调节开合的夹爪)进行几何定位与任务相关属性联合估计的需求,弥合低层几何推理与可执行世界模型之间的鸿沟。 Method: 基于PI3DETR扩展,提出模块化、类别专用检测头架构,支持直接从遮挡点云联合预测多类6-DoF位姿及参数属性(如夹爪开口大小),并按预定义规则调整3D模型;支持无需重构流程即可扩展新物体类别。 Result: 在自动叉车平台上的三类结构功能差异大的物体(起重机夹爪、装载平台、托盘)上验证,纯仿真训练后直接迁移到真实户外LiDAR扫描,mAP达0.919,无需微调。 Conclusion: PIRATR确立了姿态感知、参数化的新型感知范式,推动可扩展、仿真训练即部署的机器人感知系统发展。 Abstract: We present PIRATR, an end-to-end 3D object detection framework for robotic use cases in point clouds. Extending PI3DETR, our method streamlines parametric 3D object detection by jointly estimating multi-class 6-DoF poses and class-specific parametric attributes directly from occlusion-affected point cloud data. This formulation enables not only geometric localization but also the estimation of task-relevant properties for parametric objects, such as a gripper's opening, where the 3D model is adjusted according to simple, predefined rules. The architecture employs modular, class-specific heads, making it straightforward to extend to novel object types without re-designing the pipeline. We validate PIRATR on an automated forklift platform, focusing on three structurally and functionally diverse categories: crane grippers, loading platforms, and pallets. Trained entirely in a synthetic environment, PIRATR generalizes effectively to real outdoor LiDAR scans, achieving a detection mAP of 0.919 without additional fine-tuning. PIRATR establishes a new paradigm of pose-aware, parameterized perception. This bridges the gap between low-level geometric reasoning and actionable world models, paving the way for scalable, simulation-trained perception systems that can be deployed in dynamic robotic environments. Code available at https://github.com/swingaxe/piratr.
### [119] [ShapeGaussian: High-Fidelity 4D Human Reconstruction in Monocular Videos via Vision Priors](https://arxiv.org/abs/2602.05572) *Zhenxiao Liang,Ning Zhang,Youbao Tang,Ruei-Sung Lin,Qixing Huang,Peng Chang,Jing Xiao* Main category: cs.CV TL;DR: ShapeGaussian是一种无需模板、高保真的4D人体重建方法,适用于随意单目视频,通过结合数据驱动的2D视觉先验与两阶段变形建模,兼顾重建精度与鲁棒性。
Details Motivation: 现有通用4D重建方法(如4DGS)缺乏强视觉先验,难以处理单目视频中高形变人体运动;而基于模板(如SMPL)的方法(如HUGS)虽能生成高质量结果,但严重依赖姿态估计精度,易产生失真伪影。 Method: 采用两阶段流程:首先利用预训练模型学习粗略可变形几何结构以获取数据驱动先验;随后用神经形变模型细化几何,捕捉动态细节;全程融合2D视觉先验,并利用多帧参考缓解关键点遮挡问题。 Result: 在多种人体运动的随意单目视频上,ShapeGaussian在重建精度、视觉质量与鲁棒性方面均优于模板基方法。 Conclusion: ShapeGaussian成功融合模板自由与视觉先验优势,在无模板约束下实现高保真、鲁棒的4D人体重建,为单目4D重建提供了新范式。 Abstract: We introduce ShapeGaussian, a high-fidelity, template-free method for 4D human reconstruction from casual monocular videos. Generic reconstruction methods lacking robust vision priors, such as 4DGS, struggle to capture high-deformation human motion without multi-view cues. While template-based approaches, primarily relying on SMPL, such as HUGS, can produce photorealistic results, they are highly susceptible to errors in human pose estimation, often leading to unrealistic artifacts. In contrast, ShapeGaussian effectively integrates template-free vision priors to achieve both high-fidelity and robust scene reconstructions. Our method follows a two-step pipeline: first, we learn a coarse, deformable geometry using pretrained models that estimate data-driven priors, providing a foundation for reconstruction. Then, we refine this geometry using a neural deformation model to capture fine-grained dynamic details. By leveraging 2D vision priors, we mitigate artifacts from erroneous pose estimation in template-based methods and employ multiple reference frames to resolve the invisibility issue of 2D keypoints in a template-free manner. Extensive experiments demonstrate that ShapeGaussian surpasses template-based methods in reconstruction accuracy, achieving superior visual quality and robustness across diverse human motions in casual monocular videos.
### [120] [Visual Implicit Geometry Transformer for Autonomous Driving](https://arxiv.org/abs/2602.05573) *Arsenii Shirokov,Mikhail Kuznetsov,Danila Stepochkin,Egor Evdokimov,Daniil Glazkov,Nikolay Patakin,Anton Konushin,Dmitry Senushkin* Main category: cs.CV TL;DR: ViGT是一种无需标定、自监督训练的视觉隐式几何Transformer模型,用于从环视相机中估计连续3D占用场,支持多数据集联合训练并在点地图估计等任务上达到SOTA。
Details Motivation: 构建面向自动驾驶的、可扩展、结构简洁、泛化能力强的基础几何模型,满足BEV下多传感器配置适配与无标注训练需求。 Method: 提出校准无关的ViGT架构,以自监督方式利用同步图像-LiDAR对学习连续3D占用场;统一映射多视角图像至度量BEV坐标系。 Result: 在NuScenes、Waymo等5个大规模数据集混合训练下,点地图估计任务平均排名最优;在Occ3D-nuScenes上性能媲美监督方法。 Conclusion: ViGT验证了校准无关、自监督、多数据集联合训练的隐式几何建模范式在自动驾驶基础模型中的有效性与泛化潜力。 Abstract: We introduce the Visual Implicit Geometry Transformer (ViGT), an autonomous driving geometric model that estimates continuous 3D occupancy fields from surround-view camera rigs. ViGT represents a step towards foundational geometric models for autonomous driving, prioritizing scalability, architectural simplicity, and generalization across diverse sensor configurations. Our approach achieves this through a calibration-free architecture, enabling a single model to adapt to different sensor setups. Unlike general-purpose geometric foundational models that focus on pixel-aligned predictions, ViGT estimates a continuous 3D occupancy field in a birds-eye-view (BEV) addressing domain-specific requirements. ViGT naturally infers geometry from multiple camera views into a single metric coordinate frame, providing a common representation for multiple geometric tasks. Unlike most existing occupancy models, we adopt a self-supervised training procedure that leverages synchronized image-LiDAR pairs, eliminating the need for costly manual annotations. We validate the scalability and generalizability of our approach by training our model on a mixture of five large-scale autonomous driving datasets (NuScenes, Waymo, NuPlan, ONCE, and Argoverse) and achieving state-of-the-art performance on the pointmap estimation task, with the best average rank across all evaluated baselines. We further evaluate ViGT on the Occ3D-nuScenes benchmark, where ViGT achieves comparable performance with supervised methods. The source code is publicly available at \href{https://github.com/whesense/ViGT}{https://github.com/whesense/ViGT}.
### [121] [A Hybrid CNN and ML Framework for Multi-modal Classification of Movement Disorders Using MRI and Brain Structural Features](https://arxiv.org/abs/2602.05574) *Mengyu Li,Ingibjörg Kristjánsdóttir,Thilo van Eimeren,Kathrin Giehl,Lotta M. Ellingsen,the ASAP Neuroimaging Initiative* Main category: cs.CV TL;DR: 本研究提出了一种结合卷积神经网络(CNN)与机器学习(ML)的混合框架,利用多模态MRI数据(T1加权图像、12个深部脑结构的分割掩膜及体积测量)实现非典型帕金森综合征(APD)亚型(PSP、MSA)与帕金森病(PD)之间的精准鉴别诊断,AUC达0.95(PSP vs. PD)、0.86(MSA vs. PD)、0.92(PSP vs. MSA)。
Details Motivation: APD早期临床表现与PD高度重叠,易误诊;亟需可靠的影像生物标志物实现早期鉴别诊断。 Method: 构建CNN-ML混合模型,输入为T1加权MRI图像、12个相关深部脑结构的分割掩膜及其体积测量值,融合图像特征与定量体积特征进行多分类任务(PSP vs. PD、MSA vs. PD、PSP vs. MSA)。 Result: 在三组二分类任务中分别取得AUC 0.95(PSP vs. PD)、0.86(MSA vs. PD)、0.92(PSP vs. MSA),验证了多模态融合策略的有效性。 Conclusion: CNN提取的空间图像特征与ML处理的体积结构特征融合可显著提升APD亚型鉴别能力,有望推动临床早期精准诊断与干预。 Abstract: Atypical Parkinsonian Disorders (APD), also known as Parkinson-plus syndrome, are a group of neurodegenerative diseases that include progressive supranuclear palsy (PSP) and multiple system atrophy (MSA). In the early stages, overlapping clinical features often lead to misdiagnosis as Parkinson's disease (PD). Identifying reliable imaging biomarkers for early differential diagnosis remains a critical challenge. In this study, we propose a hybrid framework combining convolutional neural networks (CNNs) with machine learning (ML) techniques to classify APD subtypes versus PD and distinguish between the subtypes themselves: PSP vs. PD, MSA vs. PD, and PSP vs. MSA. The model leverages multi-modal input data, including T1-weighted magnetic resonance imaging (MRI), segmentation masks of 12 deep brain structures associated with APD, and their corresponding volumetric measurements. By integrating these complementary modalities, including image data, structural segmentation masks, and quantitative volume features, the hybrid approach achieved promising classification performance with area under the curve (AUC) scores of 0.95 for PSP vs. PD, 0.86 for MSA vs. PD, and 0.92 for PSP vs. MSA. These results highlight the potential of combining spatial and structural information for robust subtype differentiation. In conclusion, this study demonstrates that fusing CNN-based image features with volume-based ML inputs improves classification accuracy for APD subtypes. The proposed approach may contribute to more reliable early-stage diagnosis, facilitating timely and targeted interventions in clinical practice.
### [122] [LocateEdit-Bench: A Benchmark for Instruction-Based Editing Localization](https://arxiv.org/abs/2602.05577) *Shiyu Wu,Shuyan Li,Jing Li,Jing Liu,Yequan Wang* Main category: cs.CV TL;DR: 本文提出LocateEdit-Bench数据集,用于评估针对指令驱动图像编辑的伪造定位方法,包含23.1万张编辑图像,覆盖4种前沿编辑模型和3类编辑类型,并设计了多指标评估协议。
Details Motivation: 现有AI生成伪造定位方法主要针对基于inpainting的编辑,在面对新兴的指令驱动图像编辑时效果不佳,亟需构建适配新编辑范式的基准数据集。 Method: 构建大规模LocateEdit-Bench数据集(231K编辑图像),涵盖4种最新指令驱动编辑模型与3类常见编辑类型;提出两种多指标评估协议,对现有定位方法进行系统评测。 Result: 建立了首个面向指令驱动图像编辑的伪造定位基准数据集与评估体系,揭示了当前方法在该场景下的性能瓶颈。 Conclusion: LocateEdit-Bench为应对快速演进的图像编辑技术提供了关键基准支撑,推动未来伪造定位方法的发展。 Abstract: Recent advancements in image editing have enabled highly controllable and semantically-aware alteration of visual content, posing unprecedented challenges to manipulation localization. However, existing AI-generated forgery localization methods primarily focus on inpainting-based manipulations, making them ineffective against the latest instruction-based editing paradigms. To bridge this critical gap, we propose LocateEdit-Bench, a large-scale dataset comprising $231$K edited images, designed specifically to benchmark localization methods against instruction-driven image editing. Our dataset incorporates four cutting-edge editing models and covers three common edit types. We conduct a detailed analysis of the dataset and develop two multi-metric evaluation protocols to assess existing localization methods. Our work establishes a foundation to keep pace with the evolving landscape of image editing, thereby facilitating the development of effective methods for future forgery localization. Dataset will be open-sourced upon acceptance.
### [123] [LoGoSeg: Integrating Local and Global Features for Open-Vocabulary Semantic Segmentation](https://arxiv.org/abs/2602.05578) *Junyang Chen,Xiangbo Lv,Zhiqiang Kou,Xingdong Sheng,Ning Xu,Yiguo Qiao* Main category: cs.CV TL;DR: 本文提出LoGoSeg,一种高效的单阶段开放词汇语义分割框架,通过引入对象存在先验、区域感知对齐模块和双流融合机制,提升视觉-文本空间对齐精度,减少幻觉与漏检,且无需外部掩码提议或额外数据集。
Details Motivation: 现有基于VLM(如CLIP)的开放词汇语义分割方法因依赖图像级预训练,空间对齐不精确,且缺乏强对象先验与区域约束,易导致对象幻觉或漏检。 Method: 提出LoGoSeg框架,包含:(i) 基于全局图文相似度的对象存在先验,动态加权相关类别;(ii) 区域感知对齐模块,建立精准区域级图文对应;(iii) 双流融合机制,融合局部结构与全局语义信息。 Result: 在A-847、PC-459等六个基准上验证了其竞争性性能与强泛化能力,且无需外部掩码提议、额外骨干网络或数据集。 Conclusion: LoGoSeg通过引入对象先验与区域级对齐机制,在保持高效单阶段设计的同时显著提升了开放词汇语义分割的准确性与鲁棒性。 Abstract: Open-vocabulary semantic segmentation (OVSS) extends traditional closed-set segmentation by enabling pixel-wise annotation for both seen and unseen categories using arbitrary textual descriptions. While existing methods leverage vision-language models (VLMs) like CLIP, their reliance on image-level pretraining often results in imprecise spatial alignment, leading to mismatched segmentations in ambiguous or cluttered scenes. However, most existing approaches lack strong object priors and region-level constraints, which can lead to object hallucination or missed detections, further degrading performance. To address these challenges, we propose LoGoSeg, an efficient single-stage framework that integrates three key innovations: (i) an object existence prior that dynamically weights relevant categories through global image-text similarity, effectively reducing hallucinations; (ii) a region-aware alignment module that establishes precise region-level visual-textual correspondences; and (iii) a dual-stream fusion mechanism that optimally combines local structural information with global semantic context. Unlike prior works, LoGoSeg eliminates the need for external mask proposals, additional backbones, or extra datasets, ensuring efficiency. Extensive experiments on six benchmarks (A-847, PC-459, A-150, PC-59, PAS-20, and PAS-20b) demonstrate its competitive performance and strong generalization in open-vocabulary settings.
### [124] [Geometric Observability Index: An Operator-Theoretic Framework for Per-Feature Sensitivity, Weak Observability, and Dynamic Effects in SE(3) Pose Estimation](https://arxiv.org/abs/2602.05582) *Joe-Mei Feng,Sheng-Wei Yu* Main category: cs.CV TL;DR: 本文提出了一种基于李群SE(3)的统一算子理论框架,用于分析相机位姿估计中各图像特征的敏感性,并定义了几何可观测性指数(GOI)来量化单个测量对位姿估计的影响。
Details Motivation: 经典敏感性分析工具无法解释单个图像特征如何影响位姿估计,也无法说明动态或不一致观测为何会显著扭曲SLAM和SfM系统。 Method: 将影响函数理论拓展至矩阵李群,推导左平凡化M估计器在SE(3)上的内禀扰动算子,并定义几何可观测性指数(GOI),基于曲率算子与可观测子空间的李代数结构进行量化。 Result: GOI可通过曲率算子的谱分解揭示弱可观测性与高敏感性的直接关联;在总体情形下与SE(3)上的Fisher信息几何一致,给出单测量版本的Cramér-Rao界;可解释纯旋转、视差消失等经典退化现象及弱曲率方向上的动态特征放大效应。 Conclusion: GOI统一了条件数分析、Fisher信息几何、影响函数理论与动态场景可检测性,其计算可嵌入标准Gauss-Newton流程,提供无需训练、轻量级、即插即用的诊断信号,用于识别动态特征与弱可观测构型。 Abstract: We present a unified operator-theoretic framework for analyzing per-feature sensitivity in camera pose estimation on the Lie group SE(3). Classical sensitivity tools - conditioning analyses, Euclidean perturbation arguments, and Fisher information bounds - do not explain how individual image features influence the pose estimate, nor why dynamic or inconsistent observations can disproportionately distort modern SLAM and structure-from-motion systems. To address this gap, we extend influence function theory to matrix Lie groups and derive an intrinsic perturbation operator for left-trivialized M-estimators on SE(3). The resulting Geometric Observability Index (GOI) quantifies the contribution of a single measurement through the curvature operator and the Lie algebraic structure of the observable subspace. GOI admits a spectral decomposition along the principal directions of the observable curvature, revealing a direct correspondence between weak observability and amplified sensitivity. In the population regime, GOI coincides with the Fisher information geometry on SE(3), yielding a single-measurement analogue of the Cramer-Rao bound. The same spectral mechanism explains classical degeneracies such as pure rotation and vanishing parallax, as well as dynamic feature amplification along weak curvature directions. Overall, GOI provides a geometrically consistent description of measurement influence that unifies conditioning analysis, Fisher information geometry, influence function theory, and dynamic scene detectability through the spectral geometry of the curvature operator. Because these quantities arise directly within Gauss-Newton pipelines, the curvature spectrum and GOI also yield lightweight, training-free diagnostic signals for identifying dynamic features and detecting weak observability configurations without modifying existing SLAM architectures.
### [125] [A Mixed Reality System for Robust Manikin Localization in Childbirth Training](https://arxiv.org/abs/2602.05588) *Haojie Cheng,Chang Liu,Abhiram Kanneganti,Mahesh Arjandas Choolani,Arundhati Tushar Gosavi,Eng Tat Khoo* Main category: cs.CV TL;DR: 本文提出了一种混合现实(MR)分娩训练系统,结合虚拟引导与实体产科模拟人触觉交互,在保留真实触感的同时支持无专家监督的自主练习;通过外接RGB-D相机空间校准与粗-精定位流程实现高精度虚拟手部叠加,实验表明该系统在独立头显上运行稳定,并在83名医学生的大规模对比研究中显著优于VR训练。
Details Motivation: 医学实习生获取阴道分娩实践机会日益受限,原因包括临床轮转时间缩短、患者意愿低及产程不可预测性;同时临床教师教学负担重,亟需提升培训效率。 Method: 开发基于商业头戴显示设备(HMD)的混合现实分娩训练系统:1)扩展HMD直通功能,通过外接RGB-D相机空间校准实现实时物理训练对象视觉融合;2)构建粗-精定位流程——先用标记点对齐母体模拟人以定义分娩区域,再将预扫描新生儿头部模型注册到该区域;3)在模拟人附近精准叠加虚拟指导手势,结合真实触觉反馈引导操作。 Result: 系统在单机头显上实现了准确稳定的模拟人定位;83名四年级医学生参与的对照实验显示,MR组在分娩操作、产后处理及整体任务表现上得分均显著高于VR组,且获参训者一致偏好;4位资深产科医师采用标准化标准独立评估确认该结果。 Conclusion: 所提出的MR分娩训练系统有效缓解了临床教学资源压力,在保持真实触觉反馈的前提下提升了学习效率与操作技能,较VR方案更具临床培训优势,具备实际部署潜力。 Abstract: Opportunities for medical students to gain practical experience in vaginal births are increasingly constrained by shortened clinical rotations, patient reluctance, and the unpredictable nature of labour. To alleviate clinicians' instructional burden and enhance trainees' learning efficiency, we introduce a mixed reality (MR) system for childbirth training that combines virtual guidance with tactile manikin interaction, thereby preserving authentic haptic feedback while enabling independent practice without continuous on-site expert supervision. The system extends the passthrough capability of commercial head-mounted displays (HMDs) by spatially calibrating an external RGB-D camera, allowing real-time visual integration of physical training objects. Building on this capability, we implement a coarse-to-fine localization pipeline that first aligns the maternal manikin with fiducial markers to define a delivery region and then registers the pre-scanned neonatal head within this area. This process enables spatially accurate overlay of virtual guiding hands near the manikin, allowing trainees to follow expert trajectories reinforced by haptic interaction. Experimental evaluations demonstrate that the system achieves accurate and stable manikin localization on a standalone headset, ensuring practical deployment without external computing resources. A large-scale user study involving 83 fourth-year medical students was subsequently conducted to compare MR-based and virtual reality (VR)-based childbirth training. Four senior obstetricians independently assessed performance using standardized criteria. Results showed that MR training achieved significantly higher scores in delivery, post-delivery, and overall task performance, and was consistently preferred by trainees over VR training.
### [126] [EgoPoseVR: Spatiotemporal Multi-Modal Reasoning for Egocentric Full-Body Pose in Virtual Reality](https://arxiv.org/abs/2602.05590) *Haojie Cheng,Shaun Jing Heng Ong,Shaoyu Cai,Aiden Tat Yang Koh,Fuxi Ouyang,Eng Tat Khoo* Main category: cs.CV TL;DR: 本文提出EgoPoseVR,一种端到端的双模态融合框架,结合HMD运动信号与RGB-D数据,实现高精度、时序稳定的VR场景下自我中心式全身姿态估计,无需额外穿戴传感器或房间级追踪系统。
Details Motivation: 现有头戴式摄像头方案在VR中存在时序不稳定、下肢估计不准及实时性差等问题。 Method: 提出EgoPoseVR框架:1)双模态融合(HMD运动+RGB-D);2)基于交叉注意力的时空编码器提取帧级和关节点级表征;3)引入运动学优化模块,利用HMD信号施加约束提升精度与稳定性;4)构建含180万帧的大规模合成数据集用于训练与评估。 Result: 在多个指标上超越当前最优自我中心姿态估计模型;真实场景用户研究显示其在准确性、稳定性、沉浸感与未来使用意愿方面显著优于基线方法。 Conclusion: EgoPoseVR实现了鲁棒、准确、实时的全身姿态追踪,为VR具身交互提供了实用、轻量的解决方案。 Abstract: Immersive virtual reality (VR) applications demand accurate, temporally coherent full-body pose tracking. Recent head-mounted camera-based approaches show promise in egocentric pose estimation, but encounter challenges when applied to VR head-mounted displays (HMDs), including temporal instability, inaccurate lower-body estimation, and the lack of real-time performance. To address these limitations, we present EgoPoseVR, an end-to-end framework for accurate egocentric full-body pose estimation in VR that integrates headset motion cues with egocentric RGB-D observations through a dual-modality fusion pipeline. A spatiotemporal encoder extracts frame- and joint-level representations, which are fused via cross-attention to fully exploit complementary motion cues across modalities. A kinematic optimization module then imposes constraints from HMD signals, enhancing the accuracy and stability of pose estimation. To facilitate training and evaluation, we introduce a large-scale synthetic dataset of over 1.8 million temporally aligned HMD and RGB-D frames across diverse VR scenarios. Experimental results show that EgoPoseVR outperforms state-of-the-art egocentric pose estimation models. A user study in real-world scenes further shows that EgoPoseVR achieved significantly higher subjective ratings in accuracy, stability, embodiment, and intention for future use compared to baseline methods. These results show that EgoPoseVR enables robust full-body pose tracking, offering a practical solution for accurate VR embodiment without requiring additional body-worn sensors or room-scale tracking systems.
### [127] [CAViT -- Channel-Aware Vision Transformer for Dynamic Feature Fusion](https://arxiv.org/abs/2602.05598) *Aon Safdar,Mohamed Saadeldin* Main category: cs.CV TL;DR: 本文提出CAViT,一种双注意力架构的视觉Transformer,通过用动态的、基于注意力的机制替代静态MLP,实现空间与通道维度上的自适应特征交互,在多个基准数据集上超越标准ViT,同时减少参数量和计算量。
Details Motivation: 现有Vision Transformers(ViTs)在通道维度上的特征混合是静态的,依赖固定的MLP,缺乏对输入内容的适应性。 Method: 提出CAViT架构,在每个Transformer块中依次进行空间自注意力和通道自注意力,以实现全局上下文感知的动态特征重校准;用内容感知的注意力机制替代传统静态MLP。 Result: 在五个自然与医学图像基准数据集上,CAViT相比标准ViT最高提升3.6%准确率,同时参数量和FLOPs降低超30%;可视化注意力图显示更锐利、语义更明确的激活模式。 Conclusion: 动态的双注意力机制能有效增强ViT的表征能力,无需增加模型深度或复杂度,为高效、内容自适应的视觉建模提供了新思路。 Abstract: Vision Transformers (ViTs) have demonstrated strong performance across a range of computer vision tasks by modeling long-range spatial interactions via self-attention. However, channel-wise mixing in ViTs remains static, relying on fixed multilayer perceptrons (MLPs) that lack adaptability to input content. We introduce 'CAViT', a dual-attention architecture that replaces the static MLP with a dynamic, attention-based mechanism for feature interaction. Each Transformer block in CAViT performs spatial self-attention followed by channel-wise self-attention, allowing the model to dynamically recalibrate feature representations based on global image context. This unified and content-aware token mixing strategy enhances representational expressiveness without increasing depth or complexity. We validate CAViT across five benchmark datasets spanning both natural and medical domains, where it outperforms the standard ViT baseline by up to +3.6% in accuracy, while reducing parameter count and FLOPs by over 30%. Qualitative attention maps reveal sharper and semantically meaningful activation patterns, validating the effectiveness of our attention-driven token mixing.
### [128] [Multi-instance robust fitting for non-classical geometric models](https://arxiv.org/abs/2602.05602) *Zongliang Zhang,Shuxiang Li,Xingwang Huang,Zongyue Wang* Main category: cs.CV TL;DR: 本文提出了一种针对非经典模型(如螺旋曲线、程序化字符模型、自由曲面)的多实例鲁棒拟合方法,通过基于模型到数据误差的新估计器和元启发式优化器解决含噪数据下的全局最优拟合问题。
Details Motivation: 现有鲁棒拟合方法主要面向经典几何模型(如直线、圆、平面),对非经典模型支持不足,且多限于单实例重建,缺乏处理多实例及强噪声的能力。 Method: 将多实例拟合建模为优化问题,设计一种不依赖预设误差阈值的模型到数据误差估计器,并采用元启发式算法优化该非可微目标函数。 Result: 在多种非经典模型上验证了方法的有效性,能鲁棒地从含噪数据中同时拟合多个实例。 Conclusion: 所提方法突破了传统鲁棒拟合对经典模型和单实例的限制,为复杂非经典模型的多实例重建提供了新思路与实用工具。 Abstract: Most existing robust fitting methods are designed for classical models, such as lines, circles, and planes. In contrast, fewer methods have been developed to robustly handle non-classical models, such as spiral curves, procedural character models, and free-form surfaces. Furthermore, existing methods primarily focus on reconstructing a single instance of a non-classical model. This paper aims to reconstruct multiple instances of non-classical models from noisy data. We formulate this multi-instance fitting task as an optimization problem, which comprises an estimator and an optimizer. Specifically, we propose a novel estimator based on the model-to-data error, capable of handling outliers without a predefined error threshold. Since the proposed estimator is non-differentiable with respect to the model parameters, we employ a meta-heuristic algorithm as the optimizer to seek the global optimum. The effectiveness of our method are demonstrated through experimental results on various non-classical models. The code is available at https://github.com/zhangzongliang/fitting.
### [129] [Unified Sensor Simulation for Autonomous Driving](https://arxiv.org/abs/2602.05617) *Nikolay Patakin,Arsenii Shirokov,Anton Konushin,Dmitry Senushkin* Main category: cs.CV TL;DR: XSIM是一个面向自动驾驶的传感器仿真框架,扩展了3DGUT splatting方法,引入滚动快门建模、相位建模机制和双不透明度高斯表示,以提升动态场景中几何一致性与外观真实感,并在多个自动驾驶数据集上达到SOTA性能。
Details Motivation: 现有3DGUT splatting在处理球面传感器(如LiDAR)时存在方位角边界处的循环投影与时间不连续问题,导致高斯粒子投影错误;缺乏统一灵活的传感器外观与几何建模能力。 Method: 提出XSIM框架:1)广义滚动快门建模;2)针对球面相机的相位建模机制,显式处理方位角边界处的时间与形状不连续性;3)扩展3D高斯表示,引入两个独立不透明度参数以解耦几何与颜色分布。 Result: 在Waymo Open Dataset、Argoverse 2和PandaSet上全面评测,XSIM持续超越强基线,达成各数据集SOTA性能;代码已开源。 Conclusion: XSIM为自动驾驶传感器仿真提供了更准确、一致且逼真的渲染能力,尤其解决了球面传感器建模的关键难点,推动了基于高斯溅射的仿真技术发展。 Abstract: In this work, we introduce \textbf{XSIM}, a sensor simulation framework for autonomous driving. XSIM extends 3DGUT splatting with a generalized rolling-shutter modeling tailored for autonomous driving applications. Our framework provides a unified and flexible formulation for appearance and geometric sensor modeling, enabling rendering of complex sensor distortions in dynamic environments. We identify spherical cameras, such as LiDARs, as a critical edge case for existing 3DGUT splatting due to cyclic projection and time discontinuities at azimuth boundaries leading to incorrect particle projection. To address this issue, we propose a phase modeling mechanism that explicitly accounts temporal and shape discontinuities of Gaussians projected by the Unscented Transform at azimuth borders. In addition, we introduce an extended 3D Gaussian representation that incorporates two distinct opacity parameters to resolve mismatches between geometry and color distributions. As a result, our framework provides enhanced scene representations with improved geometric consistency and photorealistic appearance. We evaluate our framework extensively on multiple autonomous driving datasets, including Waymo Open Dataset, Argoverse 2, and PandaSet. Our framework consistently outperforms strong recent baselines and achieves state-of-the-art performance across all datasets. The source code is publicly available at \href{https://github.com/whesense/XSIM}{https://github.com/whesense/XSIM}.
### [130] [ROMAN: Reward-Orchestrated Multi-Head Attention Network for Autonomous Driving System Testing](https://arxiv.org/abs/2602.05629) *Jianlei Chi,Yuzhen Wu,Jiaxuan Hou,Xiaodong Zhang,Ming Fan,Suhui Sun,Weijun Dai,Bo Li,Jianguo Sun,Jun Sun* Main category: cs.CV TL;DR: 本文提出ROMAN方法,结合多头注意力网络与交通法规加权机制,用于生成高风险违规场景以提升自动驾驶系统(ADS)测试的全面性与针对性。实验表明其在违规数量与场景多样性上均优于现有方法,并能覆盖全部输入交通法规条款。
Details Motivation: 当前ADS测试难以生成复杂、高风险的违法场景,且忽略多车交互与关键情境,导致安全验证不充分。 Method: 提出ROMAN:融合多头注意力网络建模车辆-信号等交互,引入基于大语言模型(LLM)的风险加权模块,从严重性与发生概率两维度量化交通法规违反风险。 Result: 在CARLA中测试Baidu Apollo,ROMAN相较ABLE和LawBreaker平均违规数分别提升7.91%和55.96%,场景多样性更高,且唯一实现对所有输入交通法规条款的全覆盖违规生成。 Conclusion: ROMAN显著提升了ADS测试中高风险违法场景的生成能力与覆盖率,为更安全可靠的自动驾驶部署提供了有效验证手段。 Abstract: Automated Driving System (ADS) acts as the brain of autonomous vehicles, responsible for their safety and efficiency. Safe deployment requires thorough testing in diverse real-world scenarios and compliance with traffic laws like speed limits, signal obedience, and right-of-way rules. Violations like running red lights or speeding pose severe safety risks. However, current testing approaches face significant challenges: limited ability to generate complex and high-risk law-breaking scenarios, and failing to account for complex interactions involving multiple vehicles and critical situations. To address these challenges, we propose ROMAN, a novel scenario generation approach for ADS testing that combines a multi-head attention network with a traffic law weighting mechanism. ROMAN is designed to generate high-risk violation scenarios to enable more thorough and targeted ADS evaluation. The multi-head attention mechanism models interactions among vehicles, traffic signals, and other factors. The traffic law weighting mechanism implements a workflow that leverages an LLM-based risk weighting module to evaluate violations based on the two dimensions of severity and occurrence. We have evaluated ROMAN by testing the Baidu Apollo ADS within the CARLA simulation platform and conducting extensive experiments to measure its performance. Experimental results demonstrate that ROMAN surpassed state-of-the-art tools ABLE and LawBreaker by achieving 7.91% higher average violation count than ABLE and 55.96% higher than LawBreaker, while also maintaining greater scenario diversity. In addition, only ROMAN successfully generated violation scenarios for every clause of the input traffic laws, enabling it to identify more high-risk violations than existing approaches.
### [131] [UniSurg: A Video-Native Foundation Model for Universal Understanding of Surgical Videos](https://arxiv.org/abs/2602.05638) *Jinlin Wu,Felix Holm,Chuxi Chen,An Wang,Yaxin Hu,Xiaofan Ye,Zelin Zang,Miao Xu,Lihua Zhou,Huai Liao,Danny T. M. Chan,Ming Feng,Wai S. Poon,Hongliang Ren,Dong Yi,Nassir Navab,Gaofeng Meng,Jiebo Luo,Hongbin Liu,Zhen Lei* Main category: cs.CV TL;DR: UniSurg是一种面向手术视频的新型基础模型,摒弃像素级重建,转而预测潜在运动表征,通过三项技术创新提升语义理解能力,并在多项手术视频分析任务上显著超越现有方法。
Details Motivation: 现有手术视频分析方法依赖像素级重建目标,浪费模型容量于低级视觉细节(如烟雾、镜面反射、液体运动),而忽视对手术理解至关重要的语义结构。 Method: 提出UniSurg模型,基于V-JEPA架构,引入三项创新:1)运动引导的潜在预测以聚焦语义区域;2)时空亲和自蒸馏以保持关系一致性;3)特征多样性正则化以防纹理稀疏场景下的表征坍缩;并构建大规模手术视频数据集UniSurg-15M用于预训练。 Result: 在17个基准测试中全面领先,包括手术流程识别(EgoSurgery +14.6% F1,PitVis +10.3%)、动作三元组识别(CholecT50 mAP-IVT达39.54%)、技能评估、息肉分割与深度估计。 Conclusion: UniSurg确立了面向运动、通用化的手术视频理解新标准。 Abstract: While foundation models have advanced surgical video analysis, current approaches rely predominantly on pixel-level reconstruction objectives that waste model capacity on low-level visual details - such as smoke, specular reflections, and fluid motion - rather than semantic structures essential for surgical understanding. We present UniSurg, a video-native foundation model that shifts the learning paradigm from pixel-level reconstruction to latent motion prediction. Built on the Video Joint Embedding Predictive Architecture (V-JEPA), UniSurg introduces three key technical innovations tailored to surgical videos: 1) motion-guided latent prediction to prioritize semantically meaningful regions, 2) spatiotemporal affinity self-distillation to enforce relational consistency, and 3) feature diversity regularization to prevent representation collapse in texture-sparse surgical scenes. To enable large-scale pretraining, we curate UniSurg-15M, the largest surgical video dataset to date, comprising 3,658 hours of video from 50 sources across 13 anatomical regions. Extensive experiments across 17 benchmarks demonstrate that UniSurg significantly outperforms state-of-the-art methods on surgical workflow recognition (+14.6% F1 on EgoSurgery, +10.3% on PitVis), action triplet recognition (39.54% mAP-IVT on CholecT50), skill assessment, polyp segmentation, and depth estimation. These results establish UniSurg as a new standard for universal, motion-oriented surgical video understanding.
### [132] [Enhancing Personality Recognition by Comparing the Predictive Power of Traits, Facets, and Nuances](https://arxiv.org/abs/2602.05650) *Amir Ansari,Jana Subirana,Bruna Silva,Sergio Escalera,David Gallardo-Pujol,Cristina Palmero* Main category: cs.CV TL;DR: 本文探讨了在音频视频交互数据中,利用大五人格模型的更细粒度层次(如细微特征)来提升人格识别性能,结果表明细微特征级别的模型显著优于更粗粒度的层面。
Details Motivation: 现有方法依赖宽泛的人格特质得分作为真值标签,且训练数据有限,导致泛化能力差;相似特质得分可能源于多样且情境依赖的行为表现。 Method: 基于UDIVA v0.5数据集,构建了一个融合跨模态(音视频)和跨被试(双人互动感知)注意力机制的Transformer模型,并分别在特质、层面和细微特征三个层级上进行建模与比较。 Result: 细微特征(nuance)层级的模型在所有交互场景中持续优于层面(facet)和特质(trait)层级模型,均方误差最多降低74%。 Conclusion: 人格识别任务应优先采用更细粒度的标注层级(如细微特征),可显著提升模型性能与泛化能力。 Abstract: Personality is a complex, hierarchical construct typically assessed through item-level questionnaires aggregated into broad trait scores. Personality recognition models aim to infer personality traits from different sources of behavioral data. However, reliance on broad trait scores as ground truth, combined with limited training data, poses challenges for generalization, as similar trait scores can manifest through diverse, context dependent behaviors. In this work, we explore the predictive impact of the more granular hierarchical levels of the Big-Five Personality Model, facets and nuances, to enhance personality recognition from audiovisual interaction data. Using the UDIVA v0.5 dataset, we trained a transformer-based model including cross-modal (audiovisual) and cross-subject (dyad-aware) attention mechanisms. Results show that nuance-level models consistently outperform facet and trait-level models, reducing mean squared error by up to 74% across interaction scenarios.
### [133] [ShapeUP: Scalable Image-Conditioned 3D Editing](https://arxiv.org/abs/2602.05676) *Inbar Gat,Dana Cohen-Bar,Guy Levy,Elad Richardson,Daniel Cohen-Or* Main category: cs.CV TL;DR: ShapeUP是一种基于图像提示的可扩展3D编辑框架,通过监督式潜在空间映射实现高保真、结构一致的3D资产编辑。
Details Motivation: 现有3D编辑方法在视觉可控性、几何一致性与可扩展性之间难以兼顾:优化方法慢、多视角2D传播易漂移、无训练潜空间编辑受限于固定先验。 Method: 提出ShapeUP框架,将3D编辑建模为监督式潜在到潜在的翻译任务,在原生3D表征中进行;使用预训练3D基础模型作为强生成先验,并通过3D Diffusion Transformer(DiT)在源3D-编辑图像-目标3D三元组上监督训练。 Result: 在身份保持与编辑保真度上全面超越现有有训练和无训练基线,支持细粒度局部/全局编辑、隐式无掩码定位,且严格保持原始结构一致性。 Conclusion: ShapeUP提供了一种鲁棒、可扩展的原生3D内容创作新范式,突破了当前3D编辑在可控性、一致性与扩展性之间的权衡瓶颈。 Abstract: Recent advancements in 3D foundation models have enabled the generation of high-fidelity assets, yet precise 3D manipulation remains a significant challenge. Existing 3D editing frameworks often face a difficult trade-off between visual controllability, geometric consistency, and scalability. Specifically, optimization-based methods are prohibitively slow, multi-view 2D propagation techniques suffer from visual drift, and training-free latent manipulation methods are inherently bound by frozen priors and cannot directly benefit from scaling. In this work, we present ShapeUP, a scalable, image-conditioned 3D editing framework that formulates editing as a supervised latent-to-latent translation within a native 3D representation. This formulation allows ShapeUP to build on a pretrained 3D foundation model, leveraging its strong generative prior while adapting it to editing through supervised training. In practice, ShapeUP is trained on triplets consisting of a source 3D shape, an edited 2D image, and the corresponding edited 3D shape, and learns a direct mapping using a 3D Diffusion Transformer (DiT). This image-as-prompt approach enables fine-grained visual control over both local and global edits and achieves implicit, mask-free localization, while maintaining strict structural consistency with the original asset. Our extensive evaluations demonstrate that ShapeUP consistently outperforms current trained and training-free baselines in both identity preservation and edit fidelity, offering a robust and scalable paradigm for native 3D content creation.
### [134] [Poster: Camera Tampering Detection for Outdoor IoT Systems](https://arxiv.org/abs/2602.05706) *Shadi Attarha,Kanaga Shanmugi,Anna Förster* Main category: cs.CV TL;DR: 本文提出两种相机篡改检测方法:基于规则的方法和基于深度学习的方法,旨在评估它们在真实场景中的准确性、计算需求和训练数据要求。结果表明,深度学习模型准确率更高,而规则方法更适合资源受限且难以长时间校准的场景,并提供了公开数据集支持后续研究。
Details Motivation: 智能摄像头在户外应用中易受故意破坏或恶劣环境影响,导致监控失效;且静态图像篡改检测比视频更困难,缺乏连续帧信息。 Method: 提出了两种篡改检测方法:一是基于规则的方法,二是基于深度学习的方法,并在真实场景下对比其准确性、计算开销与训练数据需求。 Result: 深度学习模型准确率更高;规则方法更适合资源受限、无法长时间校准的场景;同时发布了包含正常、模糊和旋转图像的公开数据集。 Conclusion: 两种方法各有适用场景:深度学习适合高精度需求,规则方法适合低资源环境;公开数据集填补了该领域资源空白。 Abstract: Recently, the use of smart cameras in outdoor settings has grown to improve surveillance and security. Nonetheless, these systems are susceptible to tampering, whether from deliberate vandalism or harsh environmental conditions, which can undermine their monitoring effectiveness. In this context, detecting camera tampering is more challenging when a camera is capturing still images rather than video as there is no sequence of continuous frames over time. In this study, we propose two approaches for detecting tampered images: a rule-based method and a deep-learning-based method. The aim is to evaluate how each method performs in terms of accuracy, computational demands, and the data required for training when applied to real-world scenarios. Our results show that the deep-learning model provides higher accuracy, while the rule-based method is more appropriate for scenarios where resources are limited and a prolonged calibration phase is impractical. We also offer publicly available datasets with normal, blurred, and rotated images to support the development and evaluation of camera tampering detection methods, addressing the need for such resources.
### [135] [Exploring the Temporal Consistency for Point-Level Weakly-Supervised Temporal Action Localization](https://arxiv.org/abs/2602.05718) *Yunchuan Ma,Laiyun Qing,Guorong Li,Yuqing Liu,Yuankai Qi,Qingming Huang* Main category: cs.CV TL;DR: 本文提出了一种多任务学习框架,通过三个自监督时序理解任务(动作完成、动作顺序理解、动作规律性理解)来增强点监督下的时序动作定位模型对帧间时序关系的理解能力。
Details Motivation: 现有点监督时序动作定位方法缺乏对动作内部帧间时序关系的显式建模,而这种时序理解对准确定位完整动作片段至关重要。 Method: 设计了一个多任务学习框架,包含三个自监督时序理解任务:动作完成、动作顺序理解和动作规律性理解,以充分利用点监督信号提升模型的时序理解能力。 Result: 在四个基准数据集上的大量实验表明,所提方法优于多个当前最优方法。 Conclusion: 显式建模动作的时序一致性可显著提升点监督下时序动作定位的性能,本文是首个对此进行系统探索的工作。 Abstract: Point-supervised Temporal Action Localization (PTAL) adopts a lightly frame-annotated paradigm (\textit{i.e.}, labeling only a single frame per action instance) to train a model to effectively locate action instances within untrimmed videos. Most existing approaches design the task head of models with only a point-supervised snippet-level classification, without explicit modeling of understanding temporal relationships among frames of an action. However, understanding the temporal relationships of frames is crucial because it can help a model understand how an action is defined and therefore benefits localizing the full frames of an action. To this end, in this paper, we design a multi-task learning framework that fully utilizes point supervision to boost the model's temporal understanding capability for action localization. Specifically, we design three self-supervised temporal understanding tasks: (i) Action Completion, (ii) Action Order Understanding, and (iii) Action Regularity Understanding. These tasks help a model understand the temporal consistency of actions across videos. To the best of our knowledge, this is the first attempt to explicitly explore temporal consistency for point supervision action localization. Extensive experimental results on four benchmark datasets demonstrate the effectiveness of the proposed method compared to several state-of-the-art approaches.
### [136] [Adaptive Global and Fine-Grained Perceptual Fusion for MLLM Embeddings Compatible with Hard Negative Amplification](https://arxiv.org/abs/2602.05729) *Lexiang Hu,Youze Xue,Dian Li,Gang Liu,Zhouchen Lin* Main category: cs.CV TL;DR: 本文提出AGFF-Embed方法,通过 prompting MLLM 生成多维度语义嵌入并自适应融合,结合EGA技术增强难负样本,显著提升多模态嵌入在全局与细粒度理解上的性能。
Details Motivation: 现有CLIP和MLLM嵌入模型仅捕获全局语义,而实际复杂场景需兼顾全局与细粒度感知,缺乏兼容的融合机制。 Method: 提出AGFF-Embed:利用MLLM生成面向不同语义维度的多个嵌入,并自适应平滑聚合;引入Explicit Gradient Amplification(EGA)实现无需数据编辑的批内难负样本增强。 Result: 在MMEB和MMVP-VLM基准上,AGFF-Embed在通用与细粒度理解任务中均达到SOTA性能。 Conclusion: AGFF-Embed有效统一全局与细粒度感知建模,为MLLM嵌入提供了更鲁棒、更灵活的多粒度表征能力。 Abstract: Multimodal embeddings serve as a bridge for aligning vision and language, with the two primary implementations -- CLIP-based and MLLM-based embedding models -- both limited to capturing only global semantic information. Although numerous studies have focused on fine-grained understanding, we observe that complex scenarios currently targeted by MLLM embeddings often involve a hybrid perceptual pattern of both global and fine-grained elements, thus necessitating a compatible fusion mechanism. In this paper, we propose Adaptive Global and Fine-grained perceptual Fusion for MLLM Embeddings (AGFF-Embed), a method that prompts the MLLM to generate multiple embeddings focusing on different dimensions of semantic information, which are then adaptively and smoothly aggregated. Furthermore, we adapt AGFF-Embed with the Explicit Gradient Amplification (EGA) technique to achieve in-batch hard negatives enhancement without requiring fine-grained editing of the dataset. Evaluation on the MMEB and MMVP-VLM benchmarks shows that AGFF-Embed comprehensively achieves state-of-the-art performance in both general and fine-grained understanding compared to other multimodal embedding models.
### [137] [Depth as Prior Knowledge for Object Detection](https://arxiv.org/abs/2602.05730) *Moussa Kassem Sbeyti,Nadja Klein* Main category: cs.CV TL;DR: 本文提出DepthPrior框架,利用深度信息作为先验知识提升小目标检测性能,无需修改检测器架构,通过深度加权损失、分层损失和置信度阈值调整,在多个数据集和检测器上显著提升小目标检测精度。
Details Motivation: 小而远的目标检测因尺度变化、分辨率低和背景杂乱而困难,安全关键应用需要可靠检测;现有利用深度信息的方法需复杂且模型特定的结构修改。 Method: 理论分析与实证研究深度-检测关系,提出DepthPrior框架,包含训练阶段的深度加权损失(DLW)和深度分层损失(DLS),以及推理阶段的深度感知置信度阈值(DCT)。 Result: 在KITTI、MS COCO、VisDrone、SUN RGB-D四个基准和YOLOv11、EfficientDet两个检测器上,小目标mAP_S提升达+9%,mAR_S提升达+7%,推理真检/误检恢复率达95:1。 Conclusion: DepthPrior无需额外传感器、架构改动或性能开销,即可有效提升小目标检测鲁棒性与精度,具有强泛化性和实用性。 Abstract: Detecting small and distant objects remains challenging for object detectors due to scale variation, low resolution, and background clutter. Safety-critical applications require reliable detection of these objects for safe planning. Depth information can improve detection, but existing approaches require complex, model-specific architectural modifications. We provide a theoretical analysis followed by an empirical investigation of the depth-detection relationship. Together, they explain how depth causes systematic performance degradation and why depth-informed supervision mitigates it. We introduce DepthPrior, a framework that uses depth as prior knowledge rather than as a fused feature, providing comparable benefits without modifying detector architectures. DepthPrior consists of Depth-Based Loss Weighting (DLW) and Depth-Based Loss Stratification (DLS) during training, and Depth-Aware Confidence Thresholding (DCT) during inference. The only overhead is the initial cost of depth estimation. Experiments across four benchmarks (KITTI, MS COCO, VisDrone, SUN RGB-D) and two detectors (YOLOv11, EfficientDet) demonstrate the effectiveness of DepthPrior, achieving up to +9% mAP$_S$ and +7% mAR$_S$ for small objects, with inference recovery rates as high as 95:1 (true vs. false detections). DepthPrior offers these benefits without additional sensors, architectural changes, or performance costs. Code is available at https://github.com/mos-ks/DepthPrior.
### [138] [Neuro-Inspired Visual Pattern Recognition via Biological Reservoir Computing](https://arxiv.org/abs/2602.05737) *Luca Ciampi,Ludovico Iannello,Fabrizio Tonelli,Gabriele Lagani,Angelo Di Garbo,Federico Cremisi,Giuseppe Amato* Main category: cs.CV TL;DR: 本文提出了一种基于体外培养皮层神经元的生物储层计算(BRC)方法,利用高密度微电极阵列(HD-MEA)刺激与记录神经活动,结合线性读出层实现静态视觉模式识别任务,验证了活体神经网络作为计算储层的有效性。
Details Motivation: 突破传统人工递归模型对神经动力学的近似限制,探索利用真实活体神经回路的自发与诱发活动作为天然计算基质,推动神经形态计算与生物启发式机器学习的发展。 Method: 以体外培养的大鼠皮层神经元网络为物理储层,通过HD-MEA施加输入刺激并同步记录数百通道神经响应;将高维神经响应作为特征表示,训练单层感知机进行分类。 Result: 系统在点刺激、朝向光栅、类钟表数字及MNIST手写数字等逐级提升难度的任务中均实现准确分类,证明活体皮层网络能生成鲁棒、高维的表征,克服生物变异性(噪声、自发活动、跨会话差异)带来的挑战。 Conclusion: 体外皮层神经网络可作为有效的生物储层用于静态视觉模式识别,为活体神经硬件融入神经形态计算提供了实证基础,并支持以生物学原理驱动新型高效计算模型的设计。 Abstract: In this paper, we present a neuro-inspired approach to reservoir computing (RC) in which a network of in vitro cultured cortical neurons serves as the physical reservoir. Rather than relying on artificial recurrent models to approximate neural dynamics, our biological reservoir computing (BRC) system leverages the spontaneous and stimulus-evoked activity of living neural circuits as its computational substrate. A high-density multi-electrode array (HD-MEA) provides simultaneous stimulation and readout across hundreds of channels: input patterns are delivered through selected electrodes, while the remaining ones capture the resulting high-dimensional neural responses, yielding a biologically grounded feature representation. A linear readout layer (single-layer perceptron) is then trained to classify these reservoir states, enabling the living neural network to perform static visual pattern-recognition tasks within a computer-vision framework. We evaluate the system across a sequence of tasks of increasing difficulty, ranging from pointwise stimuli to oriented bars, clock-digit-like shapes, and handwritten digits from the MNIST dataset. Despite the inherent variability of biological neural responses-arising from noise, spontaneous activity, and inter-session differences-the system consistently generates high-dimensional representations that support accurate classification. These results demonstrate that in vitro cortical networks can function as effective reservoirs for static visual pattern recognition, opening new avenues for integrating living neural substrates into neuromorphic computing frameworks. More broadly, this work contributes to the effort to incorporate biological principles into machine learning and supports the goals of neuro-inspired vision by illustrating how living neural systems can inform the design of efficient and biologically grounded computational models.
### [139] [FMPose3D: monocular 3D pose estimation via flow matching](https://arxiv.org/abs/2602.05755) *Ti Wang,Xiaohang Yu,Mackenzie Weygandt Mathis* Main category: cs.CV TL;DR: 本文提出FMPose3D,一种基于流匹配(Flow Matching)的高效单目3D姿态估计框架,通过ODE建模实现快速多假设生成,并结合重投影后验期望聚合(RPEA)模块提升精度,在人体与动物3D姿态数据集上均达SOTA。
Details Motivation: 单目3D姿态估计存在深度模糊和遮挡问题,导致病态性;现有扩散模型虽性能强但推理慢,需大量去噪步数。 Method: 提出FMPose3D框架,将3D姿态估计建模为条件分布传输问题,利用流匹配学习从标准高斯先验到条件3D姿态分布的连续ODE速度场;通过不同噪声种子采样生成多样姿态假设,并引入重投影驱动的后验期望聚合(RPEA)模块融合多假设得到最优预测。 Result: 在Human3.6M和MPI-INF-3DHP人体数据集及Animal3D、CtrlAni3D动物数据集上均取得SOTA性能。 Conclusion: 流匹配为单目3D姿态估计提供了更高效、更通用的概率建模范式,FMPose3D兼顾生成多样性与预测精度,拓展至跨物种3D姿态估计任务。 Abstract: Monocular 3D pose estimation is fundamentally ill-posed due to depth ambiguity and occlusions, thereby motivating probabilistic methods that generate multiple plausible 3D pose hypotheses. In particular, diffusion-based models have recently demonstrated strong performance, but their iterative denoising process typically requires many timesteps for each prediction, making inference computationally expensive. In contrast, we leverage Flow Matching (FM) to learn a velocity field defined by an Ordinary Differential Equation (ODE), enabling efficient generation of 3D pose samples with only a few integration steps. We propose a novel generative pose estimation framework, FMPose3D, that formulates 3D pose estimation as a conditional distribution transport problem. It continuously transports samples from a standard Gaussian prior to the distribution of plausible 3D poses conditioned only on 2D inputs. Although ODE trajectories are deterministic, FMPose3D naturally generates various pose hypotheses by sampling different noise seeds. To obtain a single accurate prediction from those hypotheses, we further introduce a Reprojection-based Posterior Expectation Aggregation (RPEA) module, which approximates the Bayesian posterior expectation over 3D hypotheses. FMPose3D surpasses existing methods on the widely used human pose estimation benchmarks Human3.6M and MPI-INF-3DHP, and further achieves state-of-the-art performance on the 3D animal pose datasets Animal3D and CtrlAni3D, demonstrating strong performance across both 3D pose domains. The code is available at https://github.com/AdaptiveMotorControlLab/FMPose3D.
### [140] [ReText: Text Boosts Generalization in Image-Based Person Re-identification](https://arxiv.org/abs/2602.05785) *Timur Mamedov,Karina Kvanchiani,Anton Konushin,Vadim Konushin* Main category: cs.CV TL;DR: ReText是一种新颖的图像基础行人重识别方法,通过结合多摄像头Re-ID数据与带文本描述的单摄像头数据进行多任务联合学习,显著提升了跨域泛化性能。
Details Motivation: 现有方法虽能缓解域间差异,但依赖复杂架构;而单摄像头数据虽易获取且风格多样,却缺乏跨视角变化,语义信息不足。 Method: ReText在多摄像头Re-ID数据和带文本描述的单摄像头数据混合集上联合优化三个任务:(1) 多摄像头Re-ID、(2) 图像-文本匹配、(3) 文本引导的单摄像头图像重建。 Result: ReText在多个跨域Re-ID基准上显著超越现有最先进方法,展现出优异的泛化能力。 Conclusion: 本文首次将多模态联合学习应用于混合多摄像头与单摄像头数据的图像行人Re-ID任务,验证了文本增强对提升模型泛化性的有效性。 Abstract: Generalizable image-based person re-identification (Re-ID) aims to recognize individuals across cameras in unseen domains without retraining. While multiple existing approaches address the domain gap through complex architectures, recent findings indicate that better generalization can be achieved by stylistically diverse single-camera data. Although this data is easy to collect, it lacks complexity due to minimal cross-view variation. We propose ReText, a novel method trained on a mixture of multi-camera Re-ID data and single-camera data, where the latter is complemented by textual descriptions to enrich semantic cues. During training, ReText jointly optimizes three tasks: (1) Re-ID on multi-camera data, (2) image-text matching, and (3) image reconstruction guided by text on single-camera data. Experiments demonstrate that ReText achieves strong generalization and significantly outperforms state-of-the-art methods on cross-domain Re-ID benchmarks. To the best of our knowledge, this is the first work to explore multimodal joint learning on a mixture of multi-camera and single-camera data in image-based person Re-ID.
### [141] [Allocentric Perceiver: Disentangling Allocentric Reasoning from Egocentric Visual Priors via Frame Instantiation](https://arxiv.org/abs/2602.05789) *Hengyi Wang,Ruiqiang Zhang,Chang Liu,Guanjie Wang,Zehua Ma,Han Fang,Weiming Zhang* Main category: cs.CV TL;DR: 本文提出Allocentric Perceiver,一种无需训练的策略,通过几何专家从图像中恢复3D状态并构建指令对齐的外在参考系,提升VLM在需视角转换的外在空间推理任务上的性能。
Details Motivation: 现有视觉语言模型(VLMs)在需显式视角转换的外在(allocentric)空间查询上表现脆弱,因其依赖以观察者为中心(egocentric)的推理,难以处理目标中心坐标系下的空间推理。 Method: 利用现成几何专家从单/多图中无训练地恢复度量级3D状态,并据此构建与指令语义意图对齐的查询条件化外在参考系;再将重建的几何结构确定性地变换至该目标帧,生成几何接地的结构化表示以提示骨干VLM。 Result: 在多个骨干模型和空间推理基准上验证,Allocentric Perceiver在外在任务上稳定提升约10%,同时保持优异的自我中心性能,并超越已微调的空间感知模型及当前最优开源与闭源模型。 Conclusion: 将隐式的心理旋转转化为显式的几何计算可显著增强VLM的外在空间理解能力,且无需额外训练,具有通用性和高效性。 Abstract: With the rising need for spatially grounded tasks such as Vision-Language Navigation/Action, allocentric perception capabilities in Vision-Language Models (VLMs) are receiving growing focus. However, VLMs remain brittle on allocentric spatial queries that require explicit perspective shifts, where the answer depends on reasoning in a target-centric frame rather than the observed camera view. Thus, we introduce Allocentric Perceiver, a training-free strategy that recovers metric 3D states from one or more images with off-the-shelf geometric experts, and then instantiates a query-conditioned allocentric reference frame aligned with the instruction's semantic intent. By deterministically transforming reconstructed geometry into the target frame and prompting the backbone VLM with structured, geometry-grounded representations, Allocentric Perceriver offloads mental rotation from implicit reasoning to explicit computation. We evaluate Allocentric Perciver across multiple backbone families on spatial reasoning benchmarks, observing consistent and substantial gains ($\sim$10%) on allocentric tasks while maintaining strong egocentric performance, and surpassing both spatial-perception-finetuned models and state-of-the-art open-source and proprietary models.
### [142] [Focus-Scan-Refine: From Human Visual Perception to Efficient Visual Token Pruning](https://arxiv.org/abs/2602.05809) *Enwei Tong,Yuanchao Bai,Yao Zhu,Junjun Jiang,Xianming Liu* Main category: cs.CV TL;DR: 本文提出Focus-Scan-Refine(FSR)框架,一种受人类视觉问答启发的训练无关视觉token剪枝方法,通过聚焦关键证据、扫描互补上下文、精细化聚合相关信息,在不增加token数量前提下提升VLM的准确率与推理效率平衡。
Details Motivation: 现有训练无关的视觉token剪枝方法难以在高压缩率下兼顾局部证据与全局上下文,导致性能下降。 Method: FSR包含三阶段:1)聚焦——联合视觉重要性与指令相关性筛选关键token,避免视觉显著性偏差;2)扫描——基于已聚焦token,选择差异性最大的互补token;3)精细化——通过相似性分配与分数加权融合,将邻近信息聚合至扫描锚点,不增加token总数。 Result: 在多个VLM主干模型和视觉语言基准上,FSR持续优于现有SOTA剪枝方法,在精度与效率间取得更优权衡。 Conclusion: FSR是一种即插即用、无需训练的视觉token剪枝框架,有效缓解VLM高延迟与高内存开销问题,同时保持甚至提升任务性能。 Abstract: Vision-language models (VLMs) often generate massive visual tokens that greatly increase inference latency and memory footprint; while training-free token pruning offers a practical remedy, existing methods still struggle to balance local evidence and global context under aggressive compression. We propose Focus-Scan-Refine (FSR), a human-inspired, plug-and-play pruning framework that mimics how humans answer visual questions: focus on key evidence, then scan globally if needed, and refine the scanned context by aggregating relevant details. FSR first focuses on key evidence by combining visual importance with instruction relevance, avoiding the bias toward visually salient but query-irrelevant regions. It then scans for complementary context conditioned on the focused set, selecting tokens that are most different from the focused evidence. Finally, FSR refines the scanned context by aggregating nearby informative tokens into the scan anchors via similarity-based assignment and score-weighted merging, without increasing the token budget. Extensive experiments across multiple VLM backbones and vision-language benchmarks show that FSR consistently improves the accuracy-efficiency trade-off over existing state-of-the-art pruning methods. The source codes can be found at https://github.com/ILOT-code/FSR
### [143] [NVS-HO: A Benchmark for Novel View Synthesis of Handheld Objects](https://arxiv.org/abs/2602.05822) *Musawar Ali,Manuel Carranza-García,Nicola Fioraio,Samuele Salti,Luigi Di Stefano* Main category: cs.CV TL;DR: 本文提出了NVS-HO,首个仅使用RGB输入、面向真实环境中手持物体的新视角合成(NVS)基准;通过手持序列(用于训练)和标定板序列(提供真值用于评估)构建数据集,并基于NeRF与高斯泼溅等方法建立基线,揭示了现有方法在无约束手持场景下的性能瓶颈。
Details Motivation: 现有新视角合成方法在真实世界手持物体场景下缺乏可靠基准,难以评估模型对复杂运动、遮挡和光照变化的鲁棒性。 Method: 构建包含手持序列(物体被手操控)和标定板序列(ChArUco板提供精确相机位姿)的双序列RGB数据集;采用SfM和预训练VGGT作为位姿估计器,结合NeRF与高斯泼溅进行NVS建模。 Result: 实验表明当前主流NVS方法(如NeRF、高斯泼溅)在手持条件下性能显著下降,暴露其对位姿误差和动态交互的敏感性。 Conclusion: NVS-HO为RGB驱动的手持物体新视角合成提供了首个真实、具挑战性的基准,推动更鲁棒、实用的NVS方法发展。 Abstract: We propose NVS-HO, the first benchmark designed for novel view synthesis of handheld objects in real-world environments using only RGB inputs. Each object is recorded in two complementary RGB sequences: (1) a handheld sequence, where the object is manipulated in front of a static camera, and (2) a board sequence, where the object is fixed on a ChArUco board to provide accurate camera poses via marker detection. The goal of NVS-HO is to learn a NVS model that captures the full appearance of an object from (1), whereas (2) provides the ground-truth images used for evaluation. To establish baselines, we consider both a classical SfM pipeline and a state-of-the-art pre-trained feed-forward neural network (VGGT) as pose estimators, and train NVS models based on NeRF and Gaussian Splatting. Our experiments reveal significant performance gaps in current methods under unconstrained handheld conditions, highlighting the need for more robust approaches. NVS-HO thus offers a challenging real-world benchmark to drive progress in RGB-based novel view synthesis of handheld objects.
### [144] [Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation](https://arxiv.org/abs/2602.05827) *Hai Zhang,Siqi Liang,Li Chen,Yuxian Li,Yukuan Xu,Yichao Zhong,Fu Zhang,Hongyang Li* Main category: cs.CV TL;DR: 本文提出SparseVideoNav,首次将视频生成模型引入超越视野导航(BVN)任务,通过生成稀疏未来视频实现亚秒级轨迹推理,在真实世界零样本实验中显著提升成功率,尤其在夜间场景首次实现有效导航。
Details Motivation: 现有视觉语言导航依赖冗长细致的语言指令,与现实世界中仅需简单高层意图引导的自主导航目标相悖;同时,LLM方法因短视监督难以胜任需长时程规划的BVN任务。 Method: 利用视频生成模型天然适配长时程监督的特性,首次将其引入BVN;为解决视频生成延迟问题,提出SparseVideoNav,通过生成跨度20秒的稀疏未来来实现快速轨迹推理。 Result: SparseVideoNav在真实世界零样本BVN任务中达到SOTA LLM基线2.5倍的成功率,并首次在挑战性夜间场景中实现该能力;推理速度提升27倍,达亚秒级。 Conclusion: 视频生成模型是解决BVN这一长时程、高自主性导航任务的新范式,SparseVideoNav验证了其有效性与实用性,为真实世界具身智能导航开辟新路径。 Abstract: Why must vision-language navigation be bound to detailed and verbose language instructions? While such details ease decision-making, they fundamentally contradict the goal for navigation in the real-world. Ideally, agents should possess the autonomy to navigate in unknown environments guided solely by simple and high-level intents. Realizing this ambition introduces a formidable challenge: Beyond-the-View Navigation (BVN), where agents must locate distant, unseen targets without dense and step-by-step guidance. Existing large language model (LLM)-based methods, though adept at following dense instructions, often suffer from short-sighted behaviors due to their reliance on short-horimzon supervision. Simply extending the supervision horizon, however, destabilizes LLM training. In this work, we identify that video generation models inherently benefit from long-horizon supervision to align with language instructions, rendering them uniquely suitable for BVN tasks. Capitalizing on this insight, we propose introducing the video generation model into this field for the first time. Yet, the prohibitive latency for generating videos spanning tens of seconds makes real-world deployment impractical. To bridge this gap, we propose SparseVideoNav, achieving sub-second trajectory inference guided by a generated sparse future spanning a 20-second horizon. This yields a remarkable 27x speed-up compared to the unoptimized counterpart. Extensive real-world zero-shot experiments demonstrate that SparseVideoNav achieves 2.5x the success rate of state-of-the-art LLM baselines on BVN tasks and marks the first realization of such capability in challenging night scenes.
### [145] [Weaver: End-to-End Agentic System Training for Video Interleaved Reasoning](https://arxiv.org/abs/2602.05829) *Yudi Shi,Shangzhe Di,Qirui Chen,Qinian Wang,Jiayin Cai,Xiaolong Jiang,Yao Hu,Weidi Xie* Main category: cs.CV TL;DR: 本文提出Weaver,一种端到端可训练的多模态推理智能体系统,通过动态调用多样化工具并结合强化学习,提升视频推理能力,尤其在长视频任务上表现优异。
Details Motivation: 现有基于文本链式思维(Chain-of-Thought)的视频推理方法存在表征不匹配和感知能力受限的问题。 Method: 提出Weaver系统:1)策略模型动态调用多种视觉/多模态工具以渐进获取关键视觉线索;2)引入无轨迹监督的强化学习,使系统自主探索工具使用与组合策略。 Result: 在多个复杂视频推理基准(尤其是长视频任务)上显著提升性能。 Conclusion: Weaver通过具身化、工具增强与强化学习驱动的多模态推理范式,有效突破了传统文本中心推理在视频理解中的局限性。 Abstract: Video reasoning constitutes a comprehensive assessment of a model's capabilities, as it demands robust perceptual and interpretive skills, thereby serving as a means to explore the boundaries of model performance. While recent research has leveraged text-centric Chain-of-Thought reasoning to augment these capabilities, such approaches frequently suffer from representational mismatch and restricted by limited perceptual acuity. To address these limitations, we propose Weaver, a novel, end-to-end trainable multimodal reasoning agentic system. Weaver empowers its policy model to dynamically invoke diverse tools throughout the reasoning process, enabling progressive acquisition of crucial visual cues and construction of authentic multimodal reasoning trajectories. Furthermore, we integrate a reinforcement learning algorithm to allow the system to freely explore strategies for employing and combining these tools with trajectory-free data. Extensive experiments demonstrate that our system, Weaver, enhances performance on several complex video reasoning benchmarks, particularly those involving long videos.
### [146] [UI-Mem: Self-Evolving Experience Memory for Online Reinforcement Learning in Mobile GUI Agents](https://arxiv.org/abs/2602.05832) *Han Xiao,Guozhi Wang,Hao Wang,Shilong Liu,Yuxiang Chai,Yue Pan,Yufeng Zhou,Xiaoxin Chen,Yafei Wen,Hongsheng Li* Main category: cs.CV TL;DR: 本文提出UI-Mem框架,通过分层经验记忆(含工作流、子任务技能与失败模式)和分层组采样策略,提升GUI智能体在在线强化学习中的信用分配与跨任务经验迁移能力,并借助自演化循环持续更新记忆,显著提升性能与泛化性。
Details Motivation: 在线强化学习在GUI智能体中面临长程任务信用分配低效和跨任务重复出错的问题,根源在于缺乏有效的经验迁移机制。 Method: 提出UI-Mem框架,包含:1)结构化分层经验记忆(参数化模板存储工作流、技能与失败模式);2)分层组采样(在每组rollout中注入多级记忆引导以保持多样性);3)自演化循环(自动抽象新策略与错误以更新记忆)。 Result: 在在线GUI基准测试中,UI-Mem显著优于传统RL基线和静态复用方法,并展现出对未见应用的强泛化能力。 Conclusion: 结构化、可演化的经验记忆机制能有效提升GUI智能体在线学习的效率、鲁棒性与跨任务迁移能力,为面向真实交互场景的智能体学习提供了新范式。 Abstract: Online Reinforcement Learning (RL) offers a promising paradigm for enhancing GUI agents through direct environment interaction. However, its effectiveness is severely hindered by inefficient credit assignment in long-horizon tasks and repetitive errors across tasks due to the lack of experience transfer. To address these challenges, we propose UI-Mem, a novel framework that enhances GUI online RL with a Hierarchical Experience Memory. Unlike traditional replay buffers, our memory accumulates structured knowledge, including high-level workflows, subtask skills, and failure patterns. These experiences are stored as parameterized templates that enable cross-task and cross-application transfer. To effectively integrate memory guidance into online RL, we introduce Stratified Group Sampling, which injects varying levels of guidance across trajectories within each rollout group to maintain outcome diversity, driving the unguided policy toward internalizing guided behaviors. Furthermore, a Self-Evolving Loop continuously abstracts novel strategies and errors to keep the memory aligned with the agent's evolving policy. Experiments on online GUI benchmarks demonstrate that UI-Mem significantly outperforms traditional RL baselines and static reuse strategies, with strong generalization to unseen applications. Project page: https://ui-mem.github.io
### [147] [Self-Supervised Learning with a Multi-Task Latent Space Objective](https://arxiv.org/abs/2602.05845) *Pierre-François De Plaen,Abhishek Jha,Luc Van Gool,Tinne Tuytelaars,Marc Proesmans* Main category: cs.CV TL;DR: 本文提出了一种稳定多裁剪自监督学习的新方法,通过为每种视图类型分配独立的预测器,并引入掩码视图(cutout),构建了一个融合全局、局部和掩码视图的多任务非对称Siamese框架,显著提升了BYOL、SimSiam等模型在ImageNet上的性能。
Details Motivation: 多裁剪策略虽能提升SSL性能,但在预测器架构(如BYOL、SimSiam、MoCo v3)中引发训练不稳定;作者旨在解决该不稳定性问题并进一步挖掘多视图对齐潜力。 Method: 为不同视图类型(全局、局部、cutout)分配独立预测器,将空间变换视为独立对齐任务,构建多任务非对称Siamese SSL框架。 Result: 显著提升ResNet与ViT在ImageNet上的线性评估准确率,训练更稳定,且适用于多种骨干网络。 Conclusion: 视图专用预测器是稳定多裁剪SSL的关键,扩展至cutout等掩码视图可进一步提升表征质量,该多任务框架简单有效、泛化性强。 Abstract: Self-supervised learning (SSL) methods based on Siamese networks learn visual representations by aligning different views of the same image. The multi-crop strategy, which incorporates small local crops to global ones, enhances many SSL frameworks but causes instability in predictor-based architectures such as BYOL, SimSiam, and MoCo v3. We trace this failure to the shared predictor used across all views and demonstrate that assigning a separate predictor to each view type stabilizes multi-crop training, resulting in significant performance gains. Extending this idea, we treat each spatial transformation as a distinct alignment task and add cutout views, where part of the image is masked before encoding. This yields a simple multi-task formulation of asymmetric Siamese SSL that combines global, local, and masked views into a single framework. The approach is stable, generally applicable across backbones, and consistently improves the performance of ResNet and ViT models on ImageNet.
### [148] [Pathwise Test-Time Correction for Autoregressive Long Video Generation](https://arxiv.org/abs/2602.05871) *Xunzhi Xiang,Zixuan Duan,Guiyu Zhang,Haiyu Zhang,Zhe Gao,Junta Wu,Shaofeng Zhang,Tengfei Wang,Qi Fan,Chunchao Guo* Main category: cs.CV TL;DR: 本文提出Test-Time Correction (TTC)方法,通过以初始帧为稳定参考锚点校准采样过程中的中间随机状态,无需训练即可缓解蒸馏自回归扩散模型在长视频生成中的误差累积问题。
Details Motivation: 蒸馏自回归扩散模型在长视频合成中存在严重误差累积;现有测试时优化(TTO)方法因奖励景观不稳定和蒸馏参数高度敏感而难以缓解长序列漂移。 Method: 提出无需训练的Test-Time Correction(TTC),利用初始帧作为稳定参考锚点,在采样轨迹中动态校准中间随机状态。 Result: TTC可无缝集成多种蒸馏模型,在几乎无额外开销下显著延长生成长度,并在30秒基准测试中达到与资源密集型训练方法相当的质量。 Conclusion: TTC是一种高效、通用且无需训练的测试时校正机制,有效解决了长视频生成中的漂移问题,提升了蒸馏扩散模型的实用性。 Abstract: Distilled autoregressive diffusion models facilitate real-time short video synthesis but suffer from severe error accumulation during long-sequence generation. While existing Test-Time Optimization (TTO) methods prove effective for images or short clips, we identify that they fail to mitigate drift in extended sequences due to unstable reward landscapes and the hypersensitivity of distilled parameters. To overcome these limitations, we introduce Test-Time Correction (TTC), a training-free alternative. Specifically, TTC utilizes the initial frame as a stable reference anchor to calibrate intermediate stochastic states along the sampling trajectory. Extensive experiments demonstrate that our method seamlessly integrates with various distilled models, extending generation lengths with negligible overhead while matching the quality of resource-intensive training-based methods on 30-second benchmarks.
### [149] [Contour Refinement using Discrete Diffusion in Low Data Regime](https://arxiv.org/abs/2602.05880) *Fei Yu Guan,Ian Keefe,Sophie Wilkinson,Daniel D. B. Perrakis,Steven Waslander* Main category: cs.CV TL;DR: 本文提出了一种轻量级离散扩散轮廓优化方法,用于小样本条件下的不规则、半透明物体边界检测,结合自注意力CNN与简化扩散过程,在多个数据集上取得SOTA性能并显著提升推理速度。
Details Motivation: 边界检测在医学影像、环境监测和制造等领域至关重要,但现有研究多关注分割掩码对齐,而边界检测本身尤其在标注数据稀缺时缺乏研究。 Method: 提出基于CNN与自注意力机制的轻量级离散扩散轮廓优化流程,以分割掩码为条件,迭代去噪稀疏轮廓表示;引入简化扩散过程、定制化网络结构及极简后处理。 Result: 在KVASIR医学数据集上优于多个SOTA基线,在HAM10K和自建Wildfire Smoke数据集上表现具竞争力,推理帧率提升3.5倍。 Conclusion: 该方法在<500图像的小样本设置下可生成稠密、独立轮廓,兼顾精度与效率,适用于低资源场景的鲁棒边界检测。 Abstract: Boundary detection of irregular and translucent objects is an important problem with applications in medical imaging, environmental monitoring and manufacturing, where many of these applications are plagued with scarce labeled data and low in situ computational resources. While recent image segmentation studies focus on segmentation mask alignment with ground-truth, the task of boundary detection remains understudied, especially in the low data regime. In this work, we present a lightweight discrete diffusion contour refinement pipeline for robust boundary detection in the low data regime. We use a Convolutional Neural Network(CNN) architecture with self-attention layers as the core of our pipeline, and condition on a segmentation mask, iteratively denoising a sparse contour representation. We introduce multiple novel adaptations for improved low-data efficacy and inference efficiency, including using a simplified diffusion process, a customized model architecture, and minimal post processing to produce a dense, isolated contour given a dataset of size <500 training images. Our method outperforms several SOTA baselines on the medical imaging dataset KVASIR, is competitive on HAM10K and our custom wildfire dataset, Smoke, while improving inference framerate by 3.5X.
### [150] [EoCD: Encoder only Remote Sensing Change Detection](https://arxiv.org/abs/2602.05882) *Mubashir Noman,Mustansar Fiaz,Hiyam Debary,Abdul Hannan,Shah Nawaz,Fahad Shahbaz Khan,Salman Khan* Main category: cs.CV TL;DR: 本文提出了一种名为EoCD的编码器-only变化检测方法,通过早期融合时序图像并用无参数多尺度特征融合模块替代解码器,显著降低了模型复杂度,同时在性能与预测速度间取得最优平衡。
Details Motivation: 现有变化检测方法依赖Siamese编码器和复杂解码器,导致计算成本高、模型复杂;早期融合方法虽减少开销但性能较差。需一种简单高效、兼顾性能与效率的新方法。 Method: 提出Encoder-only Change Detection(EoCD):采用早期融合策略融合时序图像,并以无参数的多尺度特征融合模块替代传统解码器,仅依赖编码器提取与融合特征。 Result: EoCD在四个具有挑战性的变化检测数据集上验证有效,在保持高性能的同时大幅提升预测速度,且性能主要取决于编码器,解码器成为非必需组件。 Conclusion: EoCD是一种简单、高效、低复杂度的变化检测新范式,证明了编码器主导性能、解码器可被精简甚至去除的可行性,为轻量化遥感变化检测提供了新思路。 Abstract: Being a cornerstone of temporal analysis, change detection has been playing a pivotal role in modern earth observation. Existing change detection methods rely on the Siamese encoder to individually extract temporal features followed by temporal fusion. Subsequently, these methods design sophisticated decoders to improve the change detection performance without taking into consideration the complexity of the model. These aforementioned issues intensify the overall computational cost as well as the network's complexity which is undesirable. Alternatively, few methods utilize the early fusion scheme to combine the temporal images. These methods prevent the extra overhead of Siamese encoder, however, they also rely on sophisticated decoders for better performance. In addition, these methods demonstrate inferior performance as compared to late fusion based methods. To bridge these gaps, we introduce encoder only change detection (EoCD) that is a simple and effective method for the change detection task. The proposed method performs the early fusion of the temporal data and replaces the decoder with a parameter-free multiscale feature fusion module thereby significantly reducing the overall complexity of the model. EoCD demonstrate the optimal balance between the change detection performance and the prediction speed across a variety of encoder architectures. Additionally, EoCD demonstrate that the performance of the model is predominantly dependent on the encoder network, making the decoder an additional component. Extensive experimentation on four challenging change detection datasets reveals the effectiveness of the proposed method.
### [151] [Neural Implicit 3D Cardiac Shape Reconstruction from Sparse CT Angiography Slices Mimicking 2D Transthoracic Echocardiography Views](https://arxiv.org/abs/2602.05884) *Gino E. Jansen,Carolina Brás,R. Nils Planken,Mark J. Schuuring,Berto J. Bouma,Ivana Išgum* Main category: cs.CV TL;DR: 本文提出了一种基于神经隐式函数的方法,从模拟标准经胸超声(TTE)视角的稀疏CTA切片分割中重建完整3D心脏结构,显著优于临床常用的Simpson双平面法。
Details Motivation: 提升2D经胸超声(TTE)下心脏腔室三维定量分析的准确性,克服其固有二维限制。 Method: 利用多层感知机从CTA 3D分割中学习形状先验;测试时联合优化潜在码与刚性变换,将稀疏TTE模拟切片映射至3D空间,并通过神经隐式函数重建心腔及左室心肌的3D形状。 Result: 在独立CTA测试集上,所有结构平均Dice系数达0.86±0.04;左室和左房容积误差显著低于Simpson双平面法(分别为4.88±4.26 mL vs. 8.14±6.04 mL;6.40±7.37 mL vs. 37.76±22.96 mL)。 Conclusion: 该方法为2D TTE提供了一条可行且更准确的3D心脏腔室量化新路径。 Abstract: Accurate 3D representations of cardiac structures allow quantitative analysis of anatomy and function. In this work, we propose a method for reconstructing complete 3D cardiac shapes from segmentations of sparse planes in CT angiography (CTA) for application in 2D transthoracic echocardiography (TTE). Our method uses a neural implicit function to reconstruct the 3D shape of the cardiac chambers and left-ventricle myocardium from sparse CTA planes. To investigate the feasibility of achieving 3D reconstruction from 2D TTE, we select planes that mimic the standard apical 2D TTE views. During training, a multi-layer perceptron learns shape priors from 3D segmentations of the target structures in CTA. At test time, the network reconstructs 3D cardiac shapes from segmentations of TTE-mimicking CTA planes by jointly optimizing the latent code and the rigid transforms that map the observed planes into 3D space. For each heart, we simulate four realistic apical views, and we compare reconstructed multi-class volumes with the reference CTA volumes. On a held-out set of CTA segmentations, our approach achieves an average Dice coefficient of 0.86 $\pm$ 0.04 across all structures. Our method also achieves markedly lower volume errors than the clinical standard, Simpson's biplane rule: 4.88 $\pm$ 4.26 mL vs. 8.14 $\pm$ 6.04 mL, respectively, for the left ventricle; and 6.40 $\pm$ 7.37 mL vs. 37.76 $\pm$ 22.96 mL, respectively, for the left atrium. This suggests that our approach offers a viable route to more accurate 3D chamber quantification in 2D transthoracic echocardiography.
### [152] [CLIP-Map: Structured Matrix Mapping for Parameter-Efficient CLIP Compression](https://arxiv.org/abs/2602.05909) *Kangjie Zhang,Wenxuan Huang,Xin Zhou,Boxiang Zhou,Dejia Song,Yuan Xie,Baochang Zhang,Lizhuang Ma,Nemo Chen,Xu Tang,Yao Hu,Shaohui Lin* Main category: cs.CV TL;DR: 本文提出了一种基于映射的CLIP压缩框架CLIP-Map,通过可学习矩阵和Kronecker分解进行全映射,结合对角继承初始化缓解优化困难,在高倍压缩下显著优于传统基于权重选择的压缩方法。
Details Motivation: CLIP模型计算与内存开销大,难以部署于资源受限场景;现有基于权重选择的压缩方法在极端压缩下会严重损害特征表达能力。 Method: 提出CLIP-Map框架:采用全映射(Full-Mapping)结合Kronecker因子分解,用可学习矩阵组合原始预训练权重;引入对角继承初始化(Diagonal Inheritance Initialization)缓解分布偏移,提升映射学习效率。 Result: 在多种压缩比下均优于基于选择的压缩方法,尤其在高倍压缩时增益显著。 Conclusion: 基于映射的压缩策略比基于选择的策略更能保留原始CLIP权重的信息与表征能力,是一种更优的轻量化方案。 Abstract: Contrastive Language-Image Pre-training (CLIP) has achieved widely applications in various computer vision tasks, e.g., text-to-image generation, Image-Text retrieval and Image captioning. However, CLIP suffers from high memory and computation cost, which prohibits its usage to the resource-limited application scenarios. Existing CLIP compression methods typically reduce the size of pre-trained CLIP weights by selecting their subset as weight inheritance for further retraining via mask optimization or important weight measurement. However, these select-based weight inheritance often compromises the feature presentation ability, especially on the extreme compression. In this paper, we propose a novel mapping-based CLIP compression framework, CLIP-Map. It leverages learnable matrices to map and combine pretrained weights by Full-Mapping with Kronecker Factorization, aiming to preserve as much information from the original weights as possible. To mitigate the optimization challenges introduced by the learnable mapping, we propose Diagonal Inheritance Initialization to reduce the distribution shifting problem for efficient and effective mapping learning. Extensive experimental results demonstrate that the proposed CLIP-Map outperforms select-based frameworks across various compression ratios, with particularly significant gains observed under high compression settings.
### [153] [Multi-Scale Global-Instance Prompt Tuning for Continual Test-time Adaptation in Medical Image Segmentation](https://arxiv.org/abs/2602.05937) *Lingrui Li,Yanfeng Zhou,Nan Pu,Xin Chen,Zhun Zhong* Main category: cs.CV TL;DR: 本文提出了一种名为Multi-scale Global-Instance Prompt Tuning(MGIPT)的新方法,用于解决医学图像语义分割中持续测试时适应(CTTA)面临的错误累积、灾难性遗忘和隐私泄露等问题。该方法通过自适应尺度实例提示(AIP)和多尺度全局提示(MGP)协同建模实例级与全局域级知识,并在多个医学图像分割基准上验证了其优越性。
Details Motivation: 医学图像在不同临床中心存在分布偏移,导致预训练模型难以跨域部署;现有持续测试时适应(CTTA)方法易出现参数更新导致的错误累积与灾难性遗忘;基于提示微调的方法虽有潜力,但仍存在多尺度提示缺乏多样性、实例知识融合不足及隐私泄露风险等问题。 Method: 提出Multi-scale Global-Instance Prompt Tuning(MGIPT),包含两个核心模块:1)自适应尺度实例提示(AIP),动态学习轻量且实例特定的提示,并通过自适应最优尺度选择机制缓解错误累积;2)多尺度全局提示(MGP),捕获跨尺度的域级知识以增强抗遗忘能力;二者通过加权集成实现双层级(全局+实例)鲁棒适应。 Result: 在多个医学图像分割基准上,MGIPT显著优于当前最先进方法,展现出对持续变化目标域的强鲁棒适应能力。 Conclusion: MGIPT通过引入多尺度与全局-实例双层次提示机制,有效缓解了CTTA中的错误累积、灾难性遗忘和隐私问题,为医学图像跨中心鲁棒部署提供了新思路。 Abstract: Distribution shift is a common challenge in medical images obtained from different clinical centers, significantly hindering the deployment of pre-trained semantic segmentation models in real-world applications across multiple domains. Continual Test-Time Adaptation(CTTA) has emerged as a promising approach to address cross-domain shifts during continually evolving target domains. Most existing CTTA methods rely on incrementally updating model parameters, which inevitably suffer from error accumulation and catastrophic forgetting, especially in long-term adaptation. Recent prompt-tuning-based works have shown potential to mitigate the two issues above by updating only visual prompts. While these approaches have demonstrated promising performance, several limitations remain:1)lacking multi-scale prompt diversity, 2)inadequate incorporation of instance-specific knowledge, and 3)risk of privacy leakage. To overcome these limitations, we propose Multi-scale Global-Instance Prompt Tuning(MGIPT), to enhance scale diversity of prompts and capture both global- and instance-level knowledge for robust CTTA. Specifically, MGIPT consists of an Adaptive-scale Instance Prompt(AIP) and a Multi-scale Global-level Prompt(MGP). AIP dynamically learns lightweight and instance-specific prompts to mitigate error accumulation with adaptive optimal-scale selection mechanism. MGP captures domain-level knowledge across different scales to ensure robust adaptation with anti-forgetting capabilities. These complementary components are combined through a weighted ensemble approach, enabling effective dual-level adaptation that integrates both global and local information. Extensive experiments on medical image segmentation benchmarks demonstrate that our MGIPT outperforms state-of-the-art methods, achieving robust adaptation across continually changing target domains.
### [154] [Better Source, Better Flow: Learning Condition-Dependent Source Distribution for Flow Matching](https://arxiv.org/abs/2602.05951) *Junwan Kim,Jiho Park,Seonghu Jeon,Seungryong Kim* Main category: cs.CV TL;DR: 本文提出了一种针对条件流匹配(conditional flow matching)的源分布设计方法,通过学习条件依赖的源分布,并引入方差正则化与方向对齐策略来避免坍缩与不稳定问题,显著提升了文本到图像生成的收敛速度与性能。
Details Motivation: 现有流匹配方法多沿用扩散模型的高斯源分布,未将源分布本身作为可优化对象,尤其在文本到图像等强条件生成任务中缺乏对源分布的原理性设计。 Method: 提出学习条件依赖的源分布;引入方差正则化与源-目标分布的方向对齐机制以稳定训练;分析不同目标表征空间对结构化源分布有效性的影响。 Result: 在多个文本到图像基准上实现一致提升,FID指标收敛速度最高提升3倍,验证了源分布设计的有效性与实用性。 Conclusion: 源分布的设计不仅是可行的,而且对现代条件流匹配系统至关重要;结构化、条件化的源分布配合正则化与对齐策略,能显著提升生成质量与训练稳定性。 Abstract: Flow matching has recently emerged as a promising alternative to diffusion-based generative models, particularly for text-to-image generation. Despite its flexibility in allowing arbitrary source distributions, most existing approaches rely on a standard Gaussian distribution, a choice inherited from diffusion models, and rarely consider the source distribution itself as an optimization target in such settings. In this work, we show that principled design of the source distribution is not only feasible but also beneficial at the scale of modern text-to-image systems. Specifically, we propose learning a condition-dependent source distribution under flow matching objective that better exploit rich conditioning signals. We identify key failure modes that arise when directly incorporating conditioning into the source, including distributional collapse and instability, and show that appropriate variance regularization and directional alignment between source and target are critical for stable and effective learning. We further analyze how the choice of target representation space impacts flow matching with structured sources, revealing regimes in which such designs are most effective. Extensive experiments across multiple text-to-image benchmarks demonstrate consistent and robust improvements, including up to a 3x faster convergence in FID, highlighting the practical benefits of a principled source distribution design for conditional flow matching.
### [155] [LSA: Localized Semantic Alignment for Enhancing Temporal Consistency in Traffic Video Generation](https://arxiv.org/abs/2602.05966) *Mirlan Karimov,Teodora Spasojevic,Markus Braun,Julian Wiederer,Vasileios Belagiannis,Marc Pollefeys* Main category: cs.CV TL;DR: 本文提出Localized Semantic Alignment (LSA)方法,通过在动态物体局部区域对齐真实与生成视频的语义特征,提升预训练视频生成模型的时间一致性,无需推理时外部控制信号。
Details Motivation: 现有可控视频生成方法依赖推理时的控制信号来保证时间一致性,限制了其作为可扩展、通用数据引擎的应用潜力。 Method: 提出Localized Semantic Alignment (LSA)框架,在预训练视频生成模型微调中引入基于局部动态物体区域的语义特征一致性损失(利用现成特征提取模型计算),并与标准扩散损失联合优化。 Result: 仅用单轮微调即在nuScenes和KITTI数据集上超越基线方法;在常用视频生成指标及自适应的目标检测指标(mAP、mIoU)上均验证了时间一致性提升。 Conclusion: LSA是一种简单有效、无推理开销、不依赖外部控制信号的方法,显著增强了自动驾驶场景下视频生成的时间一致性与实用性。 Abstract: Controllable video generation has emerged as a versatile tool for autonomous driving, enabling realistic synthesis of traffic scenarios. However, existing methods depend on control signals at inference time to guide the generative model towards temporally consistent generation of dynamic objects, limiting their utility as scalable and generalizable data engines. In this work, we propose Localized Semantic Alignment (LSA), a simple yet effective framework for fine-tuning pre-trained video generation models. LSA enhances temporal consistency by aligning semantic features between ground-truth and generated video clips. Specifically, we compare the output of an off-the-shelf feature extraction model between the ground-truth and generated video clips localized around dynamic objects inducing a semantic feature consistency loss. We fine-tune the base model by combining this loss with the standard diffusion loss. The model fine-tuned for a single epoch with our novel loss outperforms the baselines in common video generation evaluation metrics. To further test the temporal consistency in generated videos we adapt two additional metrics from object detection task, namely mAP and mIoU. Extensive experiments on nuScenes and KITTI datasets show the effectiveness of our approach in enhancing temporal consistency in video generation without the need for external control signals during inference and any computational overheads.
### [156] [RISE-Video: Can Video Generators Decode Implicit World Rules?](https://arxiv.org/abs/2602.05986) *Mingxin Liu,Shuran Ma,Shibei Meng,Xiangyu Zhao,Zicheng Zhang,Shaofeng Zhang,Zhihang Zhong,Peixian Chen,Haoyu Cao,Xing Sun,Haodong Duan,Xue Yang* Main category: cs.CV TL;DR: 本文提出RISE-Video,首个面向推理能力评估的文本-图像到视频生成基准,强调对隐式世界规则的理解与推理,而非仅关注视觉保真度;包含467个人工标注样本、八类推理任务及四维评估指标,并引入基于大视觉语言模型的自动化评估流程;实验揭示当前11种主流TI2V模型在复杂隐式约束场景下普遍存在推理缺陷。
Details Motivation: 现有生成式视频模型虽视觉质量高,但缺乏对隐式世界规则(如常识、物理规律、时空逻辑)的建模与推理能力,亟需专门的推理导向评测基准。 Method: 构建RISE-Video基准:含467个八类人工标注样本;设计四维评估协议(推理对齐性、时序一致性、物理合理性、视觉质量);开发基于大视觉语言模型(LMM)的自动化评估流水线。 Result: 在11个SOTA TI2V模型上的系统评测表明,所有模型在涉及隐式约束的复杂推理任务中表现显著不足,尤其在常识与物理合理性方面存在普遍缺陷。 Conclusion: RISE-Video为生成式视频模型的认知能力评估提供了新范式,揭示了当前模型‘重表象、轻推理’的根本局限,为构建具备世界模拟能力的下一代视频生成模型指明方向。 Abstract: While generative video models have achieved remarkable visual fidelity, their capacity to internalize and reason over implicit world rules remains a critical yet under-explored frontier. To bridge this gap, we present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis that shifts the evaluative focus from surface-level aesthetics to deep cognitive reasoning. RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories, providing a structured testbed for probing model intelligence across diverse dimensions, ranging from commonsense and spatial dynamics to specialized subject domains. Our framework introduces a multi-dimensional evaluation protocol consisting of four metrics: \textit{Reasoning Alignment}, \textit{Temporal Consistency}, \textit{Physical Rationality}, and \textit{Visual Quality}. To further support scalable evaluation, we propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment. Extensive experiments on 11 state-of-the-art TI2V models reveal pervasive deficiencies in simulating complex scenarios under implicit constraints, offering critical insights for the advancement of future world-simulating generative models.
### [157] [VisRefiner: Learning from Visual Differences for Screenshot-to-Code Generation](https://arxiv.org/abs/2602.05998) *Jie Deng,Kaichun Yao,Libo Zhang* Main category: cs.CV TL;DR: 本文提出VisRefiner框架,通过让模型学习渲染结果与参考设计之间的视觉差异来提升截图生成代码的性能,结合差异对齐监督和强化学习自优化阶段,显著提高单步生成质量与布局保真度。
Details Motivation: 现有模型直接从截图生成代码,但未观察生成代码的视觉效果;而人类开发者通过反复渲染、比对设计稿并根据视觉差异调整代码,因此作者希望让模型也具备这种基于视觉差异的学习能力。 Method: 提出VisRefiner训练框架:1)构建差异对齐监督,将视觉差异与对应代码编辑关联;2)引入强化学习自优化阶段,模型通过观察渲染结果与目标设计的视觉差异来迭代改进代码。 Result: 实验表明VisRefiner显著提升了单步生成质量与布局保真度,并赋予模型强大的自优化能力。 Conclusion: 学习视觉差异能有效推动截图生成代码任务的发展,VisRefiner为该方向提供了新范式。 Abstract: Screenshot-to-code generation aims to translate user interface screenshots into executable frontend code that faithfully reproduces the target layout and style. Existing multimodal large language models perform this mapping directly from screenshots but are trained without observing the visual outcomes of their generated code. In contrast, human developers iteratively render their implementation, compare it with the design, and learn how visual differences relate to code changes. Inspired by this process, we propose VisRefiner, a training framework that enables models to learn from visual differences between rendered predictions and reference designs. We construct difference-aligned supervision that associates visual discrepancies with corresponding code edits, allowing the model to understand how appearance variations arise from implementation changes. Building on this, we introduce a reinforcement learning stage for self-refinement, where the model improves its generated code by observing both the rendered output and the target design, identifying their visual differences, and updating the code accordingly. Experiments show that VisRefiner substantially improves single-step generation quality and layout fidelity, while also endowing models with strong self-refinement ability. These results demonstrate the effectiveness of learning from visual differences for advancing screenshot-to-code generation.
### [158] [GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?](https://arxiv.org/abs/2602.06013) *Ruihang Li,Leigang Qu,Jingxu Zhang,Dongnan Gui,Mengde Xu,Xiaosong Zhang,Han Hu,Wenjie Wang,Jiaqi Wang* Main category: cs.CV TL;DR: 本文提出GenArena框架,通过成对比较替代传统点式评分,显著提升视觉生成模型评估的稳定性与人类感知一致性,并使开源模型在评估中超越顶级专有模型。
Details Motivation: 视觉生成模型快速发展,传统评估方法已无法满足需求,需采用视觉语言模型作为代理裁判;但现有绝对点式评分标准存在随机不一致性和与人类感知对齐差的问题。 Method: 提出GenArena统一评估框架,采用成对比较范式替代点式评分,并在多种视觉生成任务上系统验证其可靠性。 Result: GenArena将评估准确率提升超20%,与LMArena排行榜的Spearman相关性达0.86,远超点式方法的0.36;且仅靠该协议即可使开源模型评估表现超越顶级专有模型。 Conclusion: 成对比较是更稳定、更符合人类感知的视觉生成评估范式,GenArena为社区提供了严谨、自动化的评估新标准。 Abstract: The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision-Language Models as surrogate judges. In this work, we systematically investigate the reliability of the prevailing absolute pointwise scoring standard, across a wide spectrum of visual generation tasks. Our analysis reveals that this paradigm is limited due to stochastic inconsistency and poor alignment with human perception. To resolve these limitations, we introduce GenArena, a unified evaluation framework that leverages a pairwise comparison paradigm to ensure stable and human-aligned evaluation. Crucially, our experiments uncover a transformative finding that simply adopting this pairwise protocol enables off-the-shelf open-source models to outperform top-tier proprietary models. Notably, our method boosts evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, drastically surpassing the 0.36 correlation of pointwise methods. Based on GenArena, we benchmark state-of-the-art visual generation models across diverse tasks, providing the community with a rigorous and automated evaluation standard for visual generation.
### [159] [MambaVF: State Space Model for Efficient Video Fusion](https://arxiv.org/abs/2602.06017) *Zixiang Zhao,Yukun Cui,Lilun Deng,Haowen Bai,Haotong Qin,Tao Feng,Konrad Schindler* Main category: cs.CV TL;DR: 本文提出MambaVF,一种基于状态空间模型(SSM)的高效视频融合框架,无需光流估计即可建模长时序依赖,显著降低计算开销与参数量,并在多类视频融合任务中达到SOTA性能。
Details Motivation: 现有视频融合方法严重依赖光流估计和特征扭曲,导致计算开销大、可扩展性差。 Method: 将视频融合重构为序列状态更新过程,采用轻量级SSM融合模块,结合时空双向扫描机制替代传统光流对齐。 Result: 在多曝光、多焦点、红外-可见光及医学视频融合任务上达到SOTA;参数减少92.25%,FLOPs降低88.79%,推理速度提升2.1倍。 Conclusion: MambaVF验证了状态空间模型在视频融合中替代显式运动建模的有效性与高效性,为低开销、高扩展性视频融合提供了新范式。 Abstract: Video fusion is a fundamental technique in various video processing tasks. However, existing video fusion methods heavily rely on optical flow estimation and feature warping, resulting in severe computational overhead and limited scalability. This paper presents MambaVF, an efficient video fusion framework based on state space models (SSMs) that performs temporal modeling without explicit motion estimation. First, by reformulating video fusion as a sequential state update process, MambaVF captures long-range temporal dependencies with linear complexity while significantly reducing computation and memory costs. Second, MambaVF proposes a lightweight SSM-based fusion module that replaces conventional flow-guided alignment via a spatio-temporal bidirectional scanning mechanism. This module enables efficient information aggregation across frames. Extensive experiments across multiple benchmarks demonstrate that our MambaVF achieves state-of-the-art performance in multi-exposure, multi-focus, infrared-visible, and medical video fusion tasks. We highlight that MambaVF enjoys high efficiency, reducing up to 92.25% of parameters and 88.79% of computational FLOPs and a 2.1x speedup compared to existing methods. Project page: https://mambavf.github.io
### [160] [Context Forcing: Consistent Autoregressive Video Generation with Long Context](https://arxiv.org/abs/2602.06028) *Shuo Chen,Cong Wei,Sun Sun,Ping Nie,Kai Zhou,Ge Zhang,Ming-Hsuan Yang,Wenhu Chen* Main category: cs.CV TL;DR: 本文提出Context Forcing框架,通过使用长上下文教师模型指导长上下文学生模型训练,解决现有流式调优中师生不匹配问题,并引入Slow-Fast Memory架构实现高效长视频生成,显著提升时序一致性与上下文长度(>20秒)
Details Motivation: 现有实时长视频生成方法采用短上下文教师监督长上下文学生,导致师生在长期时序依赖建模上存在结构性失配,限制学生模型的全局一致性能力 Method: 提出Context Forcing框架:1)构建具备完整历史感知能力的长上下文教师;2)设计Slow-Fast Memory上下文管理系统,将线性增长的视觉上下文压缩为低冗余的双速记忆结构 Result: 在2分钟级长视频生成任务中,有效上下文长度达20秒以上,是LongLive和Infinite-RoPE等SOTA方法的2–10倍;在多项长视频评估指标上显著超越基线 Conclusion: Context Forcing通过消除师生上下文尺度失配并优化记忆效率,实现了更鲁棒、更一致的长视频生成,为突破长时序建模瓶颈提供了新范式 Abstract: Recent approaches to real-time long video generation typically employ streaming tuning strategies, attempting to train a long-context student using a short-context (memoryless) teacher. In these frameworks, the student performs long rollouts but receives supervision from a teacher limited to short 5-second windows. This structural discrepancy creates a critical \textbf{student-teacher mismatch}: the teacher's inability to access long-term history prevents it from guiding the student on global temporal dependencies, effectively capping the student's context length. To resolve this, we propose \textbf{Context Forcing}, a novel framework that trains a long-context student via a long-context teacher. By ensuring the teacher is aware of the full generation history, we eliminate the supervision mismatch, enabling the robust training of models capable of long-term consistency. To make this computationally feasible for extreme durations (e.g., 2 minutes), we introduce a context management system that transforms the linearly growing context into a \textbf{Slow-Fast Memory} architecture, significantly reducing visual redundancy. Extensive results demonstrate that our method enables effective context lengths exceeding 20 seconds -- 2 to 10 times longer than state-of-the-art methods like LongLive and Infinite-RoPE. By leveraging this extended context, Context Forcing preserves superior consistency across long durations, surpassing state-of-the-art baselines on various long video evaluation metrics.
### [161] [Splat and Distill: Augmenting Teachers with Feed-Forward 3D Reconstruction For 3D-Aware Distillation](https://arxiv.org/abs/2602.06032) *David Shavin,Sagie Benaim* Main category: cs.CV TL;DR: 本文提出Splat and Distill框架,通过将2D视觉基础模型(VFM)的特征提升为显式3D高斯表示并投影到新视角,以蒸馏几何感知知识,显著提升VFMs的3D感知能力及语义丰富性。
Details Motivation: 现有2D视觉基础模型缺乏3D感知能力,限制其在需要几何理解的下游任务中的表现。 Method: 引入一种前馈式3D重建流水线,将教师模型生成的2D特征提升为3D高斯表示,并‘splatted’(投影)至新视角生成监督信号,用于蒸馏训练学生模型;避免了以往逐场景优化带来的特征平均伪影。 Result: 在单目深度估计、表面法向量估计、多视图对应和语义分割等任务上显著超越先前方法,同时增强了3D感知能力和2D特征的语义丰富性。 Conclusion: Splat and Distill提供了一种高效、可扩展的方式将3D几何先验注入2D VFMs,在不牺牲效率的前提下显著提升其多任务泛化与几何理解能力。 Abstract: Vision Foundation Models (VFMs) have achieved remarkable success when applied to various downstream 2D tasks. Despite their effectiveness, they often exhibit a critical lack of 3D awareness. To this end, we introduce Splat and Distill, a framework that instills robust 3D awareness into 2D VFMs by augmenting the teacher model with a fast, feed-forward 3D reconstruction pipeline. Given 2D features produced by a teacher model, our method first lifts these features into an explicit 3D Gaussian representation, in a feedforward manner. These 3D features are then ``splatted" onto novel viewpoints, producing a set of novel 2D feature maps used to supervise the student model, ``distilling" geometrically grounded knowledge. By replacing slow per-scene optimization of prior work with our feed-forward lifting approach, our framework avoids feature-averaging artifacts, creating a dynamic learning process where the teacher's consistency improves alongside that of the student. We conduct a comprehensive evaluation on a suite of downstream tasks, including monocular depth estimation, surface normal estimation, multi-view correspondence, and semantic segmentation. Our method significantly outperforms prior works, not only achieving substantial gains in 3D awareness but also enhancing the underlying semantic richness of 2D features. Project page is available at https://davidshavin4.github.io/Splat-and-Distill/
### [162] [V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval](https://arxiv.org/abs/2602.06034) *Dongyang Chen,Chaoyang Wang,Dezhao SU,Xi Xiao,Zeyu Zhang,Jing Xiong,Qing Li,Yuzhang Shang,Shichao Ka* Main category: cs.CV TL;DR: 本文提出V-Retrver框架,将多模态检索重构为基于视觉检查的代理式推理过程,通过调用外部视觉工具主动获取细粒度视觉证据,实现假设生成与针对性视觉验证的交替推理,并采用课程学习策略训练该证据收集型检索代理,在多个基准上显著提升检索准确率和推理可靠性。
Details Motivation: 现有方法多为语言驱动,依赖静态视觉编码,缺乏主动验证细粒度视觉证据的能力,导致在视觉模糊情况下出现推测性推理。 Method: 提出V-Retrver框架,将多模态检索建模为基于视觉检验的代理推理过程;MLLM可选择性调用外部视觉工具获取视觉证据,进行多模态交错推理(假设生成↔视觉验证);采用课程学习策略,融合监督式推理激活、拒绝式精炼和面向证据对齐目标的强化学习进行训练。 Result: 在多个多模态检索基准上取得平均23.0%的准确率提升,同时增强了感知驱动推理的可靠性与泛化能力。 Conclusion: V-Retrver通过引入证据驱动的视觉主动检验机制,有效克服了传统语言中心范式的局限,显著提升了多模态检索的准确性与鲁棒性。 Abstract: Multimodal Large Language Models (MLLMs) have recently been applied to universal multimodal retrieval, where Chain-of-Thought (CoT) reasoning improves candidate reranking. However, existing approaches remain largely language-driven, relying on static visual encodings and lacking the ability to actively verify fine-grained visual evidence, which often leads to speculative reasoning in visually ambiguous cases. We propose V-Retrver, an evidence-driven retrieval framework that reformulates multimodal retrieval as an agentic reasoning process grounded in visual inspection. V-Retrver enables an MLLM to selectively acquire visual evidence during reasoning via external visual tools, performing a multimodal interleaved reasoning process that alternates between hypothesis generation and targeted visual verification.To train such an evidence-gathering retrieval agent, we adopt a curriculum-based learning strategy combining supervised reasoning activation, rejection-based refinement, and reinforcement learning with an evidence-aligned objective. Experiments across multiple multimodal retrieval benchmarks demonstrate consistent improvements in retrieval accuracy (with 23.0% improvements on average), perception-driven reasoning reliability, and generalization.
### [163] [InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions](https://arxiv.org/abs/2602.06035) *Sirui Xu,Samuel Schulter,Morteza Ziyadi,Xialin He,Xiaohan Fei,Yu-Xiong Wang,Liangyan Gui* Main category: cs.CV TL;DR: 本文提出InterPrior框架,通过大规模模仿预训练和强化学习微调,学习一个统一的生成式控制器,以实现人形机器人在多样化场景下泛化和组合全身运动-操作技能的能力。
Details Motivation: 人类通常不以显式的全身运动规划与物体交互,而是依赖高层次意图(如可供性)及底层物理与运动先验来自然产生协调的平衡、接触和操作行为;扩展这些先验对提升人形机器人在复杂人-物交互中泛化能力至关重要。 Method: 提出InterPrior框架:首先通过大规模模仿学习将全参考专家策略蒸馏为多模态观测与高层意图驱动的变分策略;再通过物理扰动增强的数据增广与强化学习微调,使策略在未见目标和初始状态下具备鲁棒泛化能力,并将技能映射到有效流形。 Result: InterPrior实现了对未见物体交互等新行为的泛化能力,支持用户交互式控制,并验证了其在真实机器人部署中的可行性。 Conclusion: InterPrior通过结合模仿学习与强化学习,构建了一个可扩展、可泛化的运动先验,为人形机器人实现物理一致且灵活的全身loco-manipulation技能提供了新范式。 Abstract: Humans rarely plan whole-body interactions with objects at the level of explicit whole-body movements. High-level intentions, such as affordance, define the goal, while coordinated balance, contact, and manipulation can emerge naturally from underlying physical and motor priors. Scaling such priors is key to enabling humanoids to compose and generalize loco-manipulation skills across diverse contexts while maintaining physically coherent whole-body coordination. To this end, we introduce InterPrior, a scalable framework that learns a unified generative controller through large-scale imitation pretraining and post-training by reinforcement learning. InterPrior first distills a full-reference imitation expert into a versatile, goal-conditioned variational policy that reconstructs motion from multimodal observations and high-level intent. While the distilled policy reconstructs training behaviors, it does not generalize reliably due to the vast configuration space of large-scale human-object interactions. To address this, we apply data augmentation with physical perturbations, and then perform reinforcement learning finetuning to improve competence on unseen goals and initializations. Together, these steps consolidate the reconstructed latent skills into a valid manifold, yielding a motion prior that generalizes beyond the training data, e.g., it can incorporate new behaviors such as interactions with unseen objects. We further demonstrate its effectiveness for user-interactive control and its potential for real robot deployment.
### [164] [Thinking with Geometry: Active Geometry Integration for Spatial Reasoning](https://arxiv.org/abs/2602.06037) *Haoyuan Li,Qihang Cao,Tao Tang,Kun Xiang,Zihan Guo,Jianhua Han,Hang Xu,Xiaodan Liang* Main category: cs.CV TL;DR: 本文提出GeoThinker框架,通过空间锚定融合和重要性门控机制,使多模态大模型能主动、选择性地整合几何信息,显著提升空间推理能力,在VSI-Bench上达到72.6的SOTA性能。
Details Motivation: 现有MLLM在空间推理中被动融合3D几何特征,易导致语义-几何错位与冗余信号。 Method: 提出GeoThinker框架,采用空间锚定融合(在选定VLM层进行帧严格交叉注意力)与重要性门控机制,使模型根据内部推理需求主动检索并融合任务相关的几何证据。 Result: 在VSI-Bench上达72.6峰值分数,刷新空间智能SOTA;在具身指代、自动驾驶等复杂下游任务中展现出强泛化性与空间感知提升。 Conclusion: 主动整合空间结构的能力是下一代空间智能的关键。 Abstract: Recent progress in spatial reasoning with Multimodal Large Language Models (MLLMs) increasingly leverages geometric priors from 3D encoders. However, most existing integration strategies remain passive: geometry is exposed as a global stream and fused in an indiscriminate manner, which often induces semantic-geometry misalignment and redundant signals. We propose GeoThinker, a framework that shifts the paradigm from passive fusion to active perception. Instead of feature mixing, GeoThinker enables the model to selectively retrieve geometric evidence conditioned on its internal reasoning demands. GeoThinker achieves this through Spatial-Grounded Fusion applied at carefully selected VLM layers, where semantic visual priors selectively query and integrate task-relevant geometry via frame-strict cross-attention, further calibrated by Importance Gating that biases per-frame attention toward task-relevant structures. Comprehensive evaluation results show that GeoThinker sets a new state-of-the-art in spatial intelligence, achieving a peak score of 72.6 on the VSI-Bench. Furthermore, GeoThinker demonstrates robust generalization and significantly improved spatial perception across complex downstream scenarios, including embodied referring and autonomous driving. Our results indicate that the ability to actively integrate spatial structures is essential for next-generation spatial intelligence. Code can be found at https://github.com/Li-Hao-yuan/GeoThinker.
### [165] [SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs](https://arxiv.org/abs/2602.06040) *Jintao Tong,Shilin Yan,Hongwei Xue,Xiaojun Tang,Kunyu Shi,Guannan Zhang,Ruixuan Li,Yixiong Zou* Main category: cs.CV TL;DR: SwimBird是一种可切换推理模式的多模态大语言模型,能根据输入自适应选择纯文本、纯视觉或图文交错三种推理方式,在保持文本逻辑能力的同时显著提升视觉密集型任务性能。
Details Motivation: 现有MLLM多依赖文本思维链(CoT)进行推理,难以应对视觉密集型任务;引入固定数量视觉隐状态虽提升视觉能力,却损害文本逻辑推理;根本问题在于缺乏对不同查询自适应选择最优推理模态的能力。 Method: 提出SwimBird模型,采用混合自回归建模(统一文本token预测与视觉embedding预测),并构建覆盖三种推理模式的监督微调数据集SwimBird-SFT-92K,设计推理模式筛选策略以支持动态模式切换。 Result: 在涵盖文本推理与复杂视觉理解的多个基准上达到SOTA,相比固定推理模式方法展现出鲁棒性提升。 Conclusion: 推理模式的自适应切换是提升MLLM通用多模态能力的关键路径,SwimBird验证了该范式的有效性与实用性。 Abstract: Multimodal Large Language Models (MLLMs) have made remarkable progress in multimodal perception and reasoning by bridging vision and language. However, most existing MLLMs perform reasoning primarily with textual CoT, which limits their effectiveness on vision-intensive tasks. Recent approaches inject a fixed number of continuous hidden states as "visual thoughts" into the reasoning process and improve visual performance, but often at the cost of degraded text-based logical reasoning. We argue that the core limitation lies in a rigid, pre-defined reasoning pattern that cannot adaptively choose the most suitable thinking modality for different user queries. We introduce SwimBird, a reasoning-switchable MLLM that dynamically switches among three reasoning modes conditioned on the input: (1) text-only reasoning, (2) vision-only reasoning (continuous hidden states as visual thoughts), and (3) interleaved vision-text reasoning. To enable this capability, we adopt a hybrid autoregressive formulation that unifies next-token prediction for textual thoughts with next-embedding prediction for visual thoughts, and design a systematic reasoning-mode curation strategy to construct SwimBird-SFT-92K, a diverse supervised fine-tuning dataset covering all three reasoning patterns. By enabling flexible, query-adaptive mode selection, SwimBird preserves strong textual logic while substantially improving performance on vision-dense tasks. Experiments across diverse benchmarks covering textual reasoning and challenging visual understanding demonstrate that SwimBird achieves state-of-the-art results and robust gains over prior fixed-pattern multimodal reasoning methods.
### [166] [Predicting Camera Pose from Perspective Descriptions for Spatial Reasoning](https://arxiv.org/abs/2602.06041) *Xuejun Zhang,Aditi Tiwari,Zhenhailong Wang,Heng Ji* Main category: cs.CV TL;DR: 本文提出CAMCUE框架,利用相机姿态作为几何锚点,实现多视角图像的跨视图融合与新视角推理,并在自建数据集CAMCUE-DATA上验证其在视角理解与推理任务中的有效性与高效性。
Details Motivation: 当前多模态大语言模型在多图像空间推理(尤其是视角转换)方面能力不足,需从多视角2D图像中构建一致的3D场景理解并支持语言指定的新视角推理。 Method: 提出CAMCUE:将每视角相机姿态注入视觉token;将自然语言视角描述映射到目标相机姿态;合成姿态条件下的想象目标视图以支持问答;构建含27,668训练/508测试样本的CAMCUE-DATA数据集,含多视角图像、姿态、视角描述及视角转换问题。 Result: 在CAMCUE-DATA上整体准确率提升9.06%;对自然语言视角描述预测目标姿态达90%以上旋转精度(≤20°)和翻译精度(≤0.5);推理时间从256.6秒降至1.45秒/例。 Conclusion: CAMCUE通过显式姿态建模与语言-姿态直接对齐,显著提升多视图空间推理能力与效率,支持实时交互应用。 Abstract: Multi-image spatial reasoning remains challenging for current multimodal large language models (MLLMs). While single-view perception is inherently 2D, reasoning over multiple views requires building a coherent scene understanding across viewpoints. In particular, we study perspective taking, where a model must build a coherent 3D understanding from multi-view observations and use it to reason from a new, language-specified viewpoint. We introduce CAMCUE, a pose-aware multi-image framework that uses camera pose as an explicit geometric anchor for cross-view fusion and novel-view reasoning. CAMCUE injects per-view pose into visual tokens, grounds natural-language viewpoint descriptions to a target camera pose, and synthesizes a pose-conditioned imagined target view to support answering. To support this setting, we curate CAMCUE-DATA with 27,668 training and 508 test instances pairing multi-view images and poses with diverse target-viewpoint descriptions and perspective-shift questions. We also include human-annotated viewpoint descriptions in the test split to evaluate generalization to human language. CAMCUE improves overall accuracy by 9.06% and predicts target poses from natural-language viewpoint descriptions with over 90% rotation accuracy within 20° and translation accuracy within a 0.5 error threshold. This direct grounding avoids expensive test-time search-and-match, reducing inference time from 256.6s to 1.45s per example and enabling fast, interactive use in real-world scenarios.